commit number;commit ID;author name;committer name;message;URL;message_preprocessed;Decision;Rationale;Supporting Facts 0;C_kwDOACN7MtoAKDk3NGY0MzY3ZGQzMTVhY2MxNWFkNGE2NDUzZjgzMDRhZWE2MGRmYmQ;Michal Hocko;Andrew Morton;"mm: reduce noise in show_mem for lowmem allocations While discussing early DMA pool pre-allocation failure with Christoph [1] I have realized that the allocation failure warning is rather noisy for constrained allocations like GFP_DMA{32}. Those zones are usually not populated on all nodes very often as their memory ranges are constrained. This is an attempt to reduce the ballast that doesn't provide any relevant information for those allocation failures investigation. Please note that I have only compile tested it (in my default config setup) and I am throwing it mostly to see what people think about it. [1] http://lkml.kernel.org/r/20220817060647.1032426-1-hch@lst.de [mhocko@suse.com: update] Link: https://lkml.kernel.org/r/Yw29bmJTIkKogTiW@dhcp22.suse.cz [mhocko@suse.com: fix build] [akpm@linux-foundation.org: fix it for mapletree] [akpm@linux-foundation.org: update it for Michal's update] [mhocko@suse.com: fix arch/powerpc/xmon/xmon.c] Link: https://lkml.kernel.org/r/Ywh3C4dKB9B93jIy@dhcp22.suse.cz [akpm@linux-foundation.org: fix arch/sparc/kernel/setup_32.c] Link: https://lkml.kernel.org/r/YwScVmVofIZkopkF@dhcp22.suse.cz Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Hellwig <hch@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/974f4367dd315acc15ad4a6453f8304aea60dfbd;mm: reduce noise in show_mem for lowmem allocations;yes;yes;no 0;C_kwDOACN7MtoAKDk3NGY0MzY3ZGQzMTVhY2MxNWFkNGE2NDUzZjgzMDRhZWE2MGRmYmQ;Michal Hocko;Andrew Morton;"mm: reduce noise in show_mem for lowmem allocations While discussing early DMA pool pre-allocation failure with Christoph [1] I have realized that the allocation failure warning is rather noisy for constrained allocations like GFP_DMA{32}. Those zones are usually not populated on all nodes very often as their memory ranges are constrained. This is an attempt to reduce the ballast that doesn't provide any relevant information for those allocation failures investigation. Please note that I have only compile tested it (in my default config setup) and I am throwing it mostly to see what people think about it. [1] http://lkml.kernel.org/r/20220817060647.1032426-1-hch@lst.de [mhocko@suse.com: update] Link: https://lkml.kernel.org/r/Yw29bmJTIkKogTiW@dhcp22.suse.cz [mhocko@suse.com: fix build] [akpm@linux-foundation.org: fix it for mapletree] [akpm@linux-foundation.org: update it for Michal's update] [mhocko@suse.com: fix arch/powerpc/xmon/xmon.c] Link: https://lkml.kernel.org/r/Ywh3C4dKB9B93jIy@dhcp22.suse.cz [akpm@linux-foundation.org: fix arch/sparc/kernel/setup_32.c] Link: https://lkml.kernel.org/r/YwScVmVofIZkopkF@dhcp22.suse.cz Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Hellwig <hch@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/974f4367dd315acc15ad4a6453f8304aea60dfbd;"While discussing early DMA pool pre-allocation failure with Christoph [1] I have realized that the allocation failure warning is rather noisy for constrained allocations like GFP_DMA{32}";no;yes;yes 0;C_kwDOACN7MtoAKDk3NGY0MzY3ZGQzMTVhY2MxNWFkNGE2NDUzZjgzMDRhZWE2MGRmYmQ;Michal Hocko;Andrew Morton;"mm: reduce noise in show_mem for lowmem allocations While discussing early DMA pool pre-allocation failure with Christoph [1] I have realized that the allocation failure warning is rather noisy for constrained allocations like GFP_DMA{32}. Those zones are usually not populated on all nodes very often as their memory ranges are constrained. This is an attempt to reduce the ballast that doesn't provide any relevant information for those allocation failures investigation. Please note that I have only compile tested it (in my default config setup) and I am throwing it mostly to see what people think about it. [1] http://lkml.kernel.org/r/20220817060647.1032426-1-hch@lst.de [mhocko@suse.com: update] Link: https://lkml.kernel.org/r/Yw29bmJTIkKogTiW@dhcp22.suse.cz [mhocko@suse.com: fix build] [akpm@linux-foundation.org: fix it for mapletree] [akpm@linux-foundation.org: update it for Michal's update] [mhocko@suse.com: fix arch/powerpc/xmon/xmon.c] Link: https://lkml.kernel.org/r/Ywh3C4dKB9B93jIy@dhcp22.suse.cz [akpm@linux-foundation.org: fix arch/sparc/kernel/setup_32.c] Link: https://lkml.kernel.org/r/YwScVmVofIZkopkF@dhcp22.suse.cz Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Hellwig <hch@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/974f4367dd315acc15ad4a6453f8304aea60dfbd;" Those zones are usually not populated on all nodes very often as their memory ranges are constrained";no;no;yes 0;C_kwDOACN7MtoAKDk3NGY0MzY3ZGQzMTVhY2MxNWFkNGE2NDUzZjgzMDRhZWE2MGRmYmQ;Michal Hocko;Andrew Morton;"mm: reduce noise in show_mem for lowmem allocations While discussing early DMA pool pre-allocation failure with Christoph [1] I have realized that the allocation failure warning is rather noisy for constrained allocations like GFP_DMA{32}. Those zones are usually not populated on all nodes very often as their memory ranges are constrained. This is an attempt to reduce the ballast that doesn't provide any relevant information for those allocation failures investigation. Please note that I have only compile tested it (in my default config setup) and I am throwing it mostly to see what people think about it. [1] http://lkml.kernel.org/r/20220817060647.1032426-1-hch@lst.de [mhocko@suse.com: update] Link: https://lkml.kernel.org/r/Yw29bmJTIkKogTiW@dhcp22.suse.cz [mhocko@suse.com: fix build] [akpm@linux-foundation.org: fix it for mapletree] [akpm@linux-foundation.org: update it for Michal's update] [mhocko@suse.com: fix arch/powerpc/xmon/xmon.c] Link: https://lkml.kernel.org/r/Ywh3C4dKB9B93jIy@dhcp22.suse.cz [akpm@linux-foundation.org: fix arch/sparc/kernel/setup_32.c] Link: https://lkml.kernel.org/r/YwScVmVofIZkopkF@dhcp22.suse.cz Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Hellwig <hch@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/974f4367dd315acc15ad4a6453f8304aea60dfbd;"This is an attempt to reduce the ballast that doesn't provide any relevant information for those allocation failures investigation";yes;yes;no 0;C_kwDOACN7MtoAKDk3NGY0MzY3ZGQzMTVhY2MxNWFkNGE2NDUzZjgzMDRhZWE2MGRmYmQ;Michal Hocko;Andrew Morton;"mm: reduce noise in show_mem for lowmem allocations While discussing early DMA pool pre-allocation failure with Christoph [1] I have realized that the allocation failure warning is rather noisy for constrained allocations like GFP_DMA{32}. Those zones are usually not populated on all nodes very often as their memory ranges are constrained. This is an attempt to reduce the ballast that doesn't provide any relevant information for those allocation failures investigation. Please note that I have only compile tested it (in my default config setup) and I am throwing it mostly to see what people think about it. [1] http://lkml.kernel.org/r/20220817060647.1032426-1-hch@lst.de [mhocko@suse.com: update] Link: https://lkml.kernel.org/r/Yw29bmJTIkKogTiW@dhcp22.suse.cz [mhocko@suse.com: fix build] [akpm@linux-foundation.org: fix it for mapletree] [akpm@linux-foundation.org: update it for Michal's update] [mhocko@suse.com: fix arch/powerpc/xmon/xmon.c] Link: https://lkml.kernel.org/r/Ywh3C4dKB9B93jIy@dhcp22.suse.cz [akpm@linux-foundation.org: fix arch/sparc/kernel/setup_32.c] Link: https://lkml.kernel.org/r/YwScVmVofIZkopkF@dhcp22.suse.cz Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Hellwig <hch@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/974f4367dd315acc15ad4a6453f8304aea60dfbd;" Please note that I have only compile tested it (in my default config setup) and I am throwing it mostly to see what people think about it.";yes;yes;no 2;C_kwDOACN7MtoAKGJmMzk4MGM4NTIxMmZjNzE1MTJkMjdhNDZmNWFhYjY2ZjQ2Y2EyODQ;Suren Baghdasaryan;Andrew Morton;"mm: drop oom code from exit_mmap The primary reason to invoke the oom reaper from the exit_mmap path used to be a prevention of an excessive oom killing if the oom victim exit races with the oom reaper (see [1] for more details). The invocation has moved around since then because of the interaction with the munlock logic but the underlying reason has remained the same (see [2]). Munlock code is no longer a problem since [3] and there shouldn't be any blocking operation before the memory is unmapped by exit_mmap so the oom reaper invocation can be dropped. The unmapping part can be done with the non-exclusive mmap_sem and the exclusive one is only required when page tables are freed. Remove the oom_reaper from exit_mmap which will make the code easier to read. This is really unlikely to make any observable difference although some microbenchmarks could benefit from one less branch that needs to be evaluated even though it almost never is true. [1] 212925802454 (""mm: oom: let oom_reap_task and exit_mmap run concurrently"") [2] 27ae357fa82b (""mm, oom: fix concurrent munlock and oom reaper unmap, v3"") [3] a213e5cf71cb (""mm/munlock: delete munlock_vma_pages_all(), allow oomreap"") [akpm@linux-foundation.org: restore Suren's mmap_read_lock() optimization] Link: https://lkml.kernel.org/r/20220531223100.510392-1-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christian Brauner (Microsoft) <brauner@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: ""Kirill A . Shutemov"" <kirill@shutemov.name> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/bf3980c85212fc71512d27a46f5aab66f46ca284;mm: drop oom code from exit_mmap;yes;no;no 2;C_kwDOACN7MtoAKGJmMzk4MGM4NTIxMmZjNzE1MTJkMjdhNDZmNWFhYjY2ZjQ2Y2EyODQ;Suren Baghdasaryan;Andrew Morton;"mm: drop oom code from exit_mmap The primary reason to invoke the oom reaper from the exit_mmap path used to be a prevention of an excessive oom killing if the oom victim exit races with the oom reaper (see [1] for more details). The invocation has moved around since then because of the interaction with the munlock logic but the underlying reason has remained the same (see [2]). Munlock code is no longer a problem since [3] and there shouldn't be any blocking operation before the memory is unmapped by exit_mmap so the oom reaper invocation can be dropped. The unmapping part can be done with the non-exclusive mmap_sem and the exclusive one is only required when page tables are freed. Remove the oom_reaper from exit_mmap which will make the code easier to read. This is really unlikely to make any observable difference although some microbenchmarks could benefit from one less branch that needs to be evaluated even though it almost never is true. [1] 212925802454 (""mm: oom: let oom_reap_task and exit_mmap run concurrently"") [2] 27ae357fa82b (""mm, oom: fix concurrent munlock and oom reaper unmap, v3"") [3] a213e5cf71cb (""mm/munlock: delete munlock_vma_pages_all(), allow oomreap"") [akpm@linux-foundation.org: restore Suren's mmap_read_lock() optimization] Link: https://lkml.kernel.org/r/20220531223100.510392-1-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christian Brauner (Microsoft) <brauner@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: ""Kirill A . Shutemov"" <kirill@shutemov.name> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/bf3980c85212fc71512d27a46f5aab66f46ca284;"The primary reason to invoke the oom reaper from the exit_mmap path used to be a prevention of an excessive oom killing if the oom victim exit races with the oom reaper (see [1] for more details)";no;no;yes 2;C_kwDOACN7MtoAKGJmMzk4MGM4NTIxMmZjNzE1MTJkMjdhNDZmNWFhYjY2ZjQ2Y2EyODQ;Suren Baghdasaryan;Andrew Morton;"mm: drop oom code from exit_mmap The primary reason to invoke the oom reaper from the exit_mmap path used to be a prevention of an excessive oom killing if the oom victim exit races with the oom reaper (see [1] for more details). The invocation has moved around since then because of the interaction with the munlock logic but the underlying reason has remained the same (see [2]). Munlock code is no longer a problem since [3] and there shouldn't be any blocking operation before the memory is unmapped by exit_mmap so the oom reaper invocation can be dropped. The unmapping part can be done with the non-exclusive mmap_sem and the exclusive one is only required when page tables are freed. Remove the oom_reaper from exit_mmap which will make the code easier to read. This is really unlikely to make any observable difference although some microbenchmarks could benefit from one less branch that needs to be evaluated even though it almost never is true. [1] 212925802454 (""mm: oom: let oom_reap_task and exit_mmap run concurrently"") [2] 27ae357fa82b (""mm, oom: fix concurrent munlock and oom reaper unmap, v3"") [3] a213e5cf71cb (""mm/munlock: delete munlock_vma_pages_all(), allow oomreap"") [akpm@linux-foundation.org: restore Suren's mmap_read_lock() optimization] Link: https://lkml.kernel.org/r/20220531223100.510392-1-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christian Brauner (Microsoft) <brauner@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: ""Kirill A . Shutemov"" <kirill@shutemov.name> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/bf3980c85212fc71512d27a46f5aab66f46ca284;" The invocation has moved around since then because of the interaction with the munlock logic but the underlying reason has remained the same (see [2])";no;no;yes 2;C_kwDOACN7MtoAKGJmMzk4MGM4NTIxMmZjNzE1MTJkMjdhNDZmNWFhYjY2ZjQ2Y2EyODQ;Suren Baghdasaryan;Andrew Morton;"mm: drop oom code from exit_mmap The primary reason to invoke the oom reaper from the exit_mmap path used to be a prevention of an excessive oom killing if the oom victim exit races with the oom reaper (see [1] for more details). The invocation has moved around since then because of the interaction with the munlock logic but the underlying reason has remained the same (see [2]). Munlock code is no longer a problem since [3] and there shouldn't be any blocking operation before the memory is unmapped by exit_mmap so the oom reaper invocation can be dropped. The unmapping part can be done with the non-exclusive mmap_sem and the exclusive one is only required when page tables are freed. Remove the oom_reaper from exit_mmap which will make the code easier to read. This is really unlikely to make any observable difference although some microbenchmarks could benefit from one less branch that needs to be evaluated even though it almost never is true. [1] 212925802454 (""mm: oom: let oom_reap_task and exit_mmap run concurrently"") [2] 27ae357fa82b (""mm, oom: fix concurrent munlock and oom reaper unmap, v3"") [3] a213e5cf71cb (""mm/munlock: delete munlock_vma_pages_all(), allow oomreap"") [akpm@linux-foundation.org: restore Suren's mmap_read_lock() optimization] Link: https://lkml.kernel.org/r/20220531223100.510392-1-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christian Brauner (Microsoft) <brauner@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: ""Kirill A . Shutemov"" <kirill@shutemov.name> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/bf3980c85212fc71512d27a46f5aab66f46ca284;"Munlock code is no longer a problem since [3] and there shouldn't be any blocking operation before the memory is unmapped by exit_mmap so the oom reaper invocation can be dropped";yes;yes;yes 2;C_kwDOACN7MtoAKGJmMzk4MGM4NTIxMmZjNzE1MTJkMjdhNDZmNWFhYjY2ZjQ2Y2EyODQ;Suren Baghdasaryan;Andrew Morton;"mm: drop oom code from exit_mmap The primary reason to invoke the oom reaper from the exit_mmap path used to be a prevention of an excessive oom killing if the oom victim exit races with the oom reaper (see [1] for more details). The invocation has moved around since then because of the interaction with the munlock logic but the underlying reason has remained the same (see [2]). Munlock code is no longer a problem since [3] and there shouldn't be any blocking operation before the memory is unmapped by exit_mmap so the oom reaper invocation can be dropped. The unmapping part can be done with the non-exclusive mmap_sem and the exclusive one is only required when page tables are freed. Remove the oom_reaper from exit_mmap which will make the code easier to read. This is really unlikely to make any observable difference although some microbenchmarks could benefit from one less branch that needs to be evaluated even though it almost never is true. [1] 212925802454 (""mm: oom: let oom_reap_task and exit_mmap run concurrently"") [2] 27ae357fa82b (""mm, oom: fix concurrent munlock and oom reaper unmap, v3"") [3] a213e5cf71cb (""mm/munlock: delete munlock_vma_pages_all(), allow oomreap"") [akpm@linux-foundation.org: restore Suren's mmap_read_lock() optimization] Link: https://lkml.kernel.org/r/20220531223100.510392-1-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christian Brauner (Microsoft) <brauner@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: ""Kirill A . Shutemov"" <kirill@shutemov.name> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/bf3980c85212fc71512d27a46f5aab66f46ca284;" The unmapping part can be done with the non-exclusive mmap_sem and the exclusive one is only required when page tables are freed";yes;no;yes 2;C_kwDOACN7MtoAKGJmMzk4MGM4NTIxMmZjNzE1MTJkMjdhNDZmNWFhYjY2ZjQ2Y2EyODQ;Suren Baghdasaryan;Andrew Morton;"mm: drop oom code from exit_mmap The primary reason to invoke the oom reaper from the exit_mmap path used to be a prevention of an excessive oom killing if the oom victim exit races with the oom reaper (see [1] for more details). The invocation has moved around since then because of the interaction with the munlock logic but the underlying reason has remained the same (see [2]). Munlock code is no longer a problem since [3] and there shouldn't be any blocking operation before the memory is unmapped by exit_mmap so the oom reaper invocation can be dropped. The unmapping part can be done with the non-exclusive mmap_sem and the exclusive one is only required when page tables are freed. Remove the oom_reaper from exit_mmap which will make the code easier to read. This is really unlikely to make any observable difference although some microbenchmarks could benefit from one less branch that needs to be evaluated even though it almost never is true. [1] 212925802454 (""mm: oom: let oom_reap_task and exit_mmap run concurrently"") [2] 27ae357fa82b (""mm, oom: fix concurrent munlock and oom reaper unmap, v3"") [3] a213e5cf71cb (""mm/munlock: delete munlock_vma_pages_all(), allow oomreap"") [akpm@linux-foundation.org: restore Suren's mmap_read_lock() optimization] Link: https://lkml.kernel.org/r/20220531223100.510392-1-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christian Brauner (Microsoft) <brauner@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: ""Kirill A . Shutemov"" <kirill@shutemov.name> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/bf3980c85212fc71512d27a46f5aab66f46ca284;"Remove the oom_reaper from exit_mmap which will make the code easier to read";yes;yes;no 2;C_kwDOACN7MtoAKGJmMzk4MGM4NTIxMmZjNzE1MTJkMjdhNDZmNWFhYjY2ZjQ2Y2EyODQ;Suren Baghdasaryan;Andrew Morton;"mm: drop oom code from exit_mmap The primary reason to invoke the oom reaper from the exit_mmap path used to be a prevention of an excessive oom killing if the oom victim exit races with the oom reaper (see [1] for more details). The invocation has moved around since then because of the interaction with the munlock logic but the underlying reason has remained the same (see [2]). Munlock code is no longer a problem since [3] and there shouldn't be any blocking operation before the memory is unmapped by exit_mmap so the oom reaper invocation can be dropped. The unmapping part can be done with the non-exclusive mmap_sem and the exclusive one is only required when page tables are freed. Remove the oom_reaper from exit_mmap which will make the code easier to read. This is really unlikely to make any observable difference although some microbenchmarks could benefit from one less branch that needs to be evaluated even though it almost never is true. [1] 212925802454 (""mm: oom: let oom_reap_task and exit_mmap run concurrently"") [2] 27ae357fa82b (""mm, oom: fix concurrent munlock and oom reaper unmap, v3"") [3] a213e5cf71cb (""mm/munlock: delete munlock_vma_pages_all(), allow oomreap"") [akpm@linux-foundation.org: restore Suren's mmap_read_lock() optimization] Link: https://lkml.kernel.org/r/20220531223100.510392-1-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christian Brauner (Microsoft) <brauner@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: ""Kirill A . Shutemov"" <kirill@shutemov.name> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/bf3980c85212fc71512d27a46f5aab66f46ca284;" This is really unlikely to make any observable difference although some microbenchmarks could benefit from one less branch that needs to be evaluated even though it almost never is true.";yes;yes;no 4;C_kwDOACN7MtoAKGExOWNhZDA2OTE1OTdlYjc5YzEyM2I4YTE5YTlmYWJhNWFiN2Q5MGU;Andrew Morton;akpm;"mm/oom_kill.c: fix vm_oom_kill_table[] ifdeffery arm allnoconfig: mm/oom_kill.c:60:25: warning: 'vm_oom_kill_table' defined but not used [-Wunused-variable] 60 | static struct ctl_table vm_oom_kill_table[] = { | ^~~~~~~~~~~~~~~~~ Cc: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/a19cad0691597eb79c123b8a19a9faba5ab7d90e;mm/oom_kill.c: fix vm_oom_kill_table[] ifdeffery;yes;yes;no 8;C_kwDOACN7MtoAKGEyMTNlNWNmNzFjYmNlYTRiMjNjYWVkY2I4ZmU2NjI5YTMzM2IyNzU;Hugh Dickins;Matthew Wilcox (Oracle);"mm/munlock: delete munlock_vma_pages_all(), allow oomreap munlock_vma_pages_range() will still be required, when munlocking but not munmapping a set of pages; but when unmapping a pte, the mlock count will be maintained in much the same way as it will be maintained when mapping in the pte. Which removes the need for munlock_vma_pages_all() on mlocked vmas when munmapping or exiting: eliminating the catastrophic contention on i_mmap_rwsem, and the need for page lock on the pages. There is still a need to update locked_vm accounting according to the munmapped vmas when munmapping: do that in detach_vmas_to_be_unmapped(). exit_mmap() does not need locked_vm updates, so delete unlock_range(). And wasn't I the one who forbade the OOM reaper to attack mlocked vmas, because of the uncertainty in blocking on all those page locks? No fear of that now, so permit the OOM reaper on mlocked vmas. Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>";https://api.github.com/repos/torvalds/linux/git/commits/a213e5cf71cbcea4b23caedcb8fe6629a333b275;mm/munlock: delete munlock_vma_pages_all(), allow oomreap;yes;no;no 8;C_kwDOACN7MtoAKGEyMTNlNWNmNzFjYmNlYTRiMjNjYWVkY2I4ZmU2NjI5YTMzM2IyNzU;Hugh Dickins;Matthew Wilcox (Oracle);"mm/munlock: delete munlock_vma_pages_all(), allow oomreap munlock_vma_pages_range() will still be required, when munlocking but not munmapping a set of pages; but when unmapping a pte, the mlock count will be maintained in much the same way as it will be maintained when mapping in the pte. Which removes the need for munlock_vma_pages_all() on mlocked vmas when munmapping or exiting: eliminating the catastrophic contention on i_mmap_rwsem, and the need for page lock on the pages. There is still a need to update locked_vm accounting according to the munmapped vmas when munmapping: do that in detach_vmas_to_be_unmapped(). exit_mmap() does not need locked_vm updates, so delete unlock_range(). And wasn't I the one who forbade the OOM reaper to attack mlocked vmas, because of the uncertainty in blocking on all those page locks? No fear of that now, so permit the OOM reaper on mlocked vmas. Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>";https://api.github.com/repos/torvalds/linux/git/commits/a213e5cf71cbcea4b23caedcb8fe6629a333b275;"munlock_vma_pages_range() will still be required, when munlocking but not munmapping a set of pages; but when unmapping a pte, the mlock count will be maintained in much the same way as it will be maintained when mapping in the pte";yes;no;no 8;C_kwDOACN7MtoAKGEyMTNlNWNmNzFjYmNlYTRiMjNjYWVkY2I4ZmU2NjI5YTMzM2IyNzU;Hugh Dickins;Matthew Wilcox (Oracle);"mm/munlock: delete munlock_vma_pages_all(), allow oomreap munlock_vma_pages_range() will still be required, when munlocking but not munmapping a set of pages; but when unmapping a pte, the mlock count will be maintained in much the same way as it will be maintained when mapping in the pte. Which removes the need for munlock_vma_pages_all() on mlocked vmas when munmapping or exiting: eliminating the catastrophic contention on i_mmap_rwsem, and the need for page lock on the pages. There is still a need to update locked_vm accounting according to the munmapped vmas when munmapping: do that in detach_vmas_to_be_unmapped(). exit_mmap() does not need locked_vm updates, so delete unlock_range(). And wasn't I the one who forbade the OOM reaper to attack mlocked vmas, because of the uncertainty in blocking on all those page locks? No fear of that now, so permit the OOM reaper on mlocked vmas. Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>";https://api.github.com/repos/torvalds/linux/git/commits/a213e5cf71cbcea4b23caedcb8fe6629a333b275;" Which removes the need for munlock_vma_pages_all() on mlocked vmas when munmapping or exiting: eliminating the catastrophic contention on i_mmap_rwsem, and the need for page lock on the pages";yes;yes;no 8;C_kwDOACN7MtoAKGEyMTNlNWNmNzFjYmNlYTRiMjNjYWVkY2I4ZmU2NjI5YTMzM2IyNzU;Hugh Dickins;Matthew Wilcox (Oracle);"mm/munlock: delete munlock_vma_pages_all(), allow oomreap munlock_vma_pages_range() will still be required, when munlocking but not munmapping a set of pages; but when unmapping a pte, the mlock count will be maintained in much the same way as it will be maintained when mapping in the pte. Which removes the need for munlock_vma_pages_all() on mlocked vmas when munmapping or exiting: eliminating the catastrophic contention on i_mmap_rwsem, and the need for page lock on the pages. There is still a need to update locked_vm accounting according to the munmapped vmas when munmapping: do that in detach_vmas_to_be_unmapped(). exit_mmap() does not need locked_vm updates, so delete unlock_range(). And wasn't I the one who forbade the OOM reaper to attack mlocked vmas, because of the uncertainty in blocking on all those page locks? No fear of that now, so permit the OOM reaper on mlocked vmas. Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>";https://api.github.com/repos/torvalds/linux/git/commits/a213e5cf71cbcea4b23caedcb8fe6629a333b275;"There is still a need to update locked_vm accounting according to the munmapped vmas when munmapping: do that in detach_vmas_to_be_unmapped()";yes;yes;yes 8;C_kwDOACN7MtoAKGEyMTNlNWNmNzFjYmNlYTRiMjNjYWVkY2I4ZmU2NjI5YTMzM2IyNzU;Hugh Dickins;Matthew Wilcox (Oracle);"mm/munlock: delete munlock_vma_pages_all(), allow oomreap munlock_vma_pages_range() will still be required, when munlocking but not munmapping a set of pages; but when unmapping a pte, the mlock count will be maintained in much the same way as it will be maintained when mapping in the pte. Which removes the need for munlock_vma_pages_all() on mlocked vmas when munmapping or exiting: eliminating the catastrophic contention on i_mmap_rwsem, and the need for page lock on the pages. There is still a need to update locked_vm accounting according to the munmapped vmas when munmapping: do that in detach_vmas_to_be_unmapped(). exit_mmap() does not need locked_vm updates, so delete unlock_range(). And wasn't I the one who forbade the OOM reaper to attack mlocked vmas, because of the uncertainty in blocking on all those page locks? No fear of that now, so permit the OOM reaper on mlocked vmas. Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>";https://api.github.com/repos/torvalds/linux/git/commits/a213e5cf71cbcea4b23caedcb8fe6629a333b275;exit_mmap() does not need locked_vm updates, so delete unlock_range();yes;yes;yes 8;C_kwDOACN7MtoAKGEyMTNlNWNmNzFjYmNlYTRiMjNjYWVkY2I4ZmU2NjI5YTMzM2IyNzU;Hugh Dickins;Matthew Wilcox (Oracle);"mm/munlock: delete munlock_vma_pages_all(), allow oomreap munlock_vma_pages_range() will still be required, when munlocking but not munmapping a set of pages; but when unmapping a pte, the mlock count will be maintained in much the same way as it will be maintained when mapping in the pte. Which removes the need for munlock_vma_pages_all() on mlocked vmas when munmapping or exiting: eliminating the catastrophic contention on i_mmap_rwsem, and the need for page lock on the pages. There is still a need to update locked_vm accounting according to the munmapped vmas when munmapping: do that in detach_vmas_to_be_unmapped(). exit_mmap() does not need locked_vm updates, so delete unlock_range(). And wasn't I the one who forbade the OOM reaper to attack mlocked vmas, because of the uncertainty in blocking on all those page locks? No fear of that now, so permit the OOM reaper on mlocked vmas. Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>";https://api.github.com/repos/torvalds/linux/git/commits/a213e5cf71cbcea4b23caedcb8fe6629a333b275;"And wasn't I the one who forbade the OOM reaper to attack mlocked vmas, because of the uncertainty in blocking on all those page locks? No fear of that now, so permit the OOM reaper on mlocked vmas.";yes;yes;no 10;C_kwDOACN7MtoAKGJhNTM1YzFjYWYzZWU3OGFhNzcxOWU5ZTRiMDdhMGRjMWQxNTNiOWU;Suren Baghdasaryan;Linus Torvalds;"mm/oom_kill: allow process_mrelease to run under mmap_lock protection With exit_mmap holding mmap_write_lock during free_pgtables call, process_mrelease does not need to elevate mm->mm_users in order to prevent exit_mmap from destrying pagetables while __oom_reap_task_mm is walking the VMA tree. The change prevents process_mrelease from calling the last mmput, which can lead to waiting for IO completion in exit_aio. Link: https://lkml.kernel.org/r/20211209191325.3069345-3-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Christian Brauner <christian@brauner.io> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Florian Weimer <fweimer@redhat.com> Cc: Jan Engelhardt <jengelh@inai.de> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Rik van Riel <riel@surriel.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tim Murray <timmurray@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/ba535c1caf3ee78aa7719e9e4b07a0dc1d153b9e;mm/oom_kill: allow process_mrelease to run under mmap_lock protection;yes;no;no 10;C_kwDOACN7MtoAKGJhNTM1YzFjYWYzZWU3OGFhNzcxOWU5ZTRiMDdhMGRjMWQxNTNiOWU;Suren Baghdasaryan;Linus Torvalds;"mm/oom_kill: allow process_mrelease to run under mmap_lock protection With exit_mmap holding mmap_write_lock during free_pgtables call, process_mrelease does not need to elevate mm->mm_users in order to prevent exit_mmap from destrying pagetables while __oom_reap_task_mm is walking the VMA tree. The change prevents process_mrelease from calling the last mmput, which can lead to waiting for IO completion in exit_aio. Link: https://lkml.kernel.org/r/20211209191325.3069345-3-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Christian Brauner <christian@brauner.io> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Florian Weimer <fweimer@redhat.com> Cc: Jan Engelhardt <jengelh@inai.de> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Rik van Riel <riel@surriel.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tim Murray <timmurray@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/ba535c1caf3ee78aa7719e9e4b07a0dc1d153b9e;"With exit_mmap holding mmap_write_lock during free_pgtables call, process_mrelease does not need to elevate mm->mm_users in order to prevent exit_mmap from destrying pagetables while __oom_reap_task_mm is walking the VMA tree";no;yes;yes 10;C_kwDOACN7MtoAKGJhNTM1YzFjYWYzZWU3OGFhNzcxOWU5ZTRiMDdhMGRjMWQxNTNiOWU;Suren Baghdasaryan;Linus Torvalds;"mm/oom_kill: allow process_mrelease to run under mmap_lock protection With exit_mmap holding mmap_write_lock during free_pgtables call, process_mrelease does not need to elevate mm->mm_users in order to prevent exit_mmap from destrying pagetables while __oom_reap_task_mm is walking the VMA tree. The change prevents process_mrelease from calling the last mmput, which can lead to waiting for IO completion in exit_aio. Link: https://lkml.kernel.org/r/20211209191325.3069345-3-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Christian Brauner <christian@brauner.io> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Florian Weimer <fweimer@redhat.com> Cc: Jan Engelhardt <jengelh@inai.de> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Rik van Riel <riel@surriel.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tim Murray <timmurray@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/ba535c1caf3ee78aa7719e9e4b07a0dc1d153b9e;" The change prevents process_mrelease from calling the last mmput, which can lead to waiting for IO completion in exit_aio.";yes;yes;no 11;C_kwDOACN7MtoAKGI2YmY5YWJiMGFhNDRlNTNmZmU5YzFlNmUxZDMyNTY4ZjViMjVlNGE;Dan Schatzberg;Linus Torvalds;"mm/memcg: add oom_group_kill memory event Our container agent wants to know when a container exits if it was OOM killed or not to report to the user. We use memory.oom.group = 1 to ensure that OOM kills within the container's cgroup kill everything. Existing memory.events are insufficient for knowing if this triggered: 1) Our current approach reads memory.events oom_kill and reports the container was killed if the value is non-zero. This is erroneous in some cases where containers create their children cgroups with memory.oom.group=1 as such OOM kills will get counted against the container cgroup's oom_kill counter despite not actually OOM killing the entire container. 2) Reading memory.events.local will fail to identify OOM kills in leaf cgroups (that don't set memory.oom.group) within the container cgroup. This patch adds a new oom_group_kill event when memory.oom.group triggers to allow userspace to cleanly identify when an entire cgroup is oom killed. [schatzberg.dan@gmail.com: changes from Johannes and Chris] Link: https://lkml.kernel.org/r/20211213162511.2492267-1-schatzberg.dan@gmail.com Link: https://lkml.kernel.org/r/20211203162426.3375036-1-schatzberg.dan@gmail.com Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com> Reviewed-by: Roman Gushchin <guro@fb.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Chris Down <chris@chrisdown.name> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tejun Heo <tj@kernel.org> Cc: Zefan Li <lizefan.x@bytedance.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Alex Shi <alexs@kernel.org> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/b6bf9abb0aa44e53ffe9c1e6e1d32568f5b25e4a;mm/memcg: add oom_group_kill memory event;yes;no;no 11;C_kwDOACN7MtoAKGI2YmY5YWJiMGFhNDRlNTNmZmU5YzFlNmUxZDMyNTY4ZjViMjVlNGE;Dan Schatzberg;Linus Torvalds;"mm/memcg: add oom_group_kill memory event Our container agent wants to know when a container exits if it was OOM killed or not to report to the user. We use memory.oom.group = 1 to ensure that OOM kills within the container's cgroup kill everything. Existing memory.events are insufficient for knowing if this triggered: 1) Our current approach reads memory.events oom_kill and reports the container was killed if the value is non-zero. This is erroneous in some cases where containers create their children cgroups with memory.oom.group=1 as such OOM kills will get counted against the container cgroup's oom_kill counter despite not actually OOM killing the entire container. 2) Reading memory.events.local will fail to identify OOM kills in leaf cgroups (that don't set memory.oom.group) within the container cgroup. This patch adds a new oom_group_kill event when memory.oom.group triggers to allow userspace to cleanly identify when an entire cgroup is oom killed. [schatzberg.dan@gmail.com: changes from Johannes and Chris] Link: https://lkml.kernel.org/r/20211213162511.2492267-1-schatzberg.dan@gmail.com Link: https://lkml.kernel.org/r/20211203162426.3375036-1-schatzberg.dan@gmail.com Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com> Reviewed-by: Roman Gushchin <guro@fb.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Chris Down <chris@chrisdown.name> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tejun Heo <tj@kernel.org> Cc: Zefan Li <lizefan.x@bytedance.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Alex Shi <alexs@kernel.org> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/b6bf9abb0aa44e53ffe9c1e6e1d32568f5b25e4a;"Our container agent wants to know when a container exits if it was OOM killed or not to report to the user";yes;yes;yes 11;C_kwDOACN7MtoAKGI2YmY5YWJiMGFhNDRlNTNmZmU5YzFlNmUxZDMyNTY4ZjViMjVlNGE;Dan Schatzberg;Linus Torvalds;"mm/memcg: add oom_group_kill memory event Our container agent wants to know when a container exits if it was OOM killed or not to report to the user. We use memory.oom.group = 1 to ensure that OOM kills within the container's cgroup kill everything. Existing memory.events are insufficient for knowing if this triggered: 1) Our current approach reads memory.events oom_kill and reports the container was killed if the value is non-zero. This is erroneous in some cases where containers create their children cgroups with memory.oom.group=1 as such OOM kills will get counted against the container cgroup's oom_kill counter despite not actually OOM killing the entire container. 2) Reading memory.events.local will fail to identify OOM kills in leaf cgroups (that don't set memory.oom.group) within the container cgroup. This patch adds a new oom_group_kill event when memory.oom.group triggers to allow userspace to cleanly identify when an entire cgroup is oom killed. [schatzberg.dan@gmail.com: changes from Johannes and Chris] Link: https://lkml.kernel.org/r/20211213162511.2492267-1-schatzberg.dan@gmail.com Link: https://lkml.kernel.org/r/20211203162426.3375036-1-schatzberg.dan@gmail.com Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com> Reviewed-by: Roman Gushchin <guro@fb.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Chris Down <chris@chrisdown.name> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tejun Heo <tj@kernel.org> Cc: Zefan Li <lizefan.x@bytedance.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Alex Shi <alexs@kernel.org> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/b6bf9abb0aa44e53ffe9c1e6e1d32568f5b25e4a;" We use memory.oom.group = 1 to ensure that OOM kills within the container's cgroup kill everything";no;no;yes 11;C_kwDOACN7MtoAKGI2YmY5YWJiMGFhNDRlNTNmZmU5YzFlNmUxZDMyNTY4ZjViMjVlNGE;Dan Schatzberg;Linus Torvalds;"mm/memcg: add oom_group_kill memory event Our container agent wants to know when a container exits if it was OOM killed or not to report to the user. We use memory.oom.group = 1 to ensure that OOM kills within the container's cgroup kill everything. Existing memory.events are insufficient for knowing if this triggered: 1) Our current approach reads memory.events oom_kill and reports the container was killed if the value is non-zero. This is erroneous in some cases where containers create their children cgroups with memory.oom.group=1 as such OOM kills will get counted against the container cgroup's oom_kill counter despite not actually OOM killing the entire container. 2) Reading memory.events.local will fail to identify OOM kills in leaf cgroups (that don't set memory.oom.group) within the container cgroup. This patch adds a new oom_group_kill event when memory.oom.group triggers to allow userspace to cleanly identify when an entire cgroup is oom killed. [schatzberg.dan@gmail.com: changes from Johannes and Chris] Link: https://lkml.kernel.org/r/20211213162511.2492267-1-schatzberg.dan@gmail.com Link: https://lkml.kernel.org/r/20211203162426.3375036-1-schatzberg.dan@gmail.com Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com> Reviewed-by: Roman Gushchin <guro@fb.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Chris Down <chris@chrisdown.name> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tejun Heo <tj@kernel.org> Cc: Zefan Li <lizefan.x@bytedance.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Alex Shi <alexs@kernel.org> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/b6bf9abb0aa44e53ffe9c1e6e1d32568f5b25e4a;Existing memory.events are insufficient for knowing if this triggered;no;yes;yes 11;C_kwDOACN7MtoAKGI2YmY5YWJiMGFhNDRlNTNmZmU5YzFlNmUxZDMyNTY4ZjViMjVlNGE;Dan Schatzberg;Linus Torvalds;"mm/memcg: add oom_group_kill memory event Our container agent wants to know when a container exits if it was OOM killed or not to report to the user. We use memory.oom.group = 1 to ensure that OOM kills within the container's cgroup kill everything. Existing memory.events are insufficient for knowing if this triggered: 1) Our current approach reads memory.events oom_kill and reports the container was killed if the value is non-zero. This is erroneous in some cases where containers create their children cgroups with memory.oom.group=1 as such OOM kills will get counted against the container cgroup's oom_kill counter despite not actually OOM killing the entire container. 2) Reading memory.events.local will fail to identify OOM kills in leaf cgroups (that don't set memory.oom.group) within the container cgroup. This patch adds a new oom_group_kill event when memory.oom.group triggers to allow userspace to cleanly identify when an entire cgroup is oom killed. [schatzberg.dan@gmail.com: changes from Johannes and Chris] Link: https://lkml.kernel.org/r/20211213162511.2492267-1-schatzberg.dan@gmail.com Link: https://lkml.kernel.org/r/20211203162426.3375036-1-schatzberg.dan@gmail.com Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com> Reviewed-by: Roman Gushchin <guro@fb.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Chris Down <chris@chrisdown.name> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tejun Heo <tj@kernel.org> Cc: Zefan Li <lizefan.x@bytedance.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Alex Shi <alexs@kernel.org> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/b6bf9abb0aa44e53ffe9c1e6e1d32568f5b25e4a;"1) Our current approach reads memory.events oom_kill and reports the container was killed if the value is non-zero";yes;no;yes 11;C_kwDOACN7MtoAKGI2YmY5YWJiMGFhNDRlNTNmZmU5YzFlNmUxZDMyNTY4ZjViMjVlNGE;Dan Schatzberg;Linus Torvalds;"mm/memcg: add oom_group_kill memory event Our container agent wants to know when a container exits if it was OOM killed or not to report to the user. We use memory.oom.group = 1 to ensure that OOM kills within the container's cgroup kill everything. Existing memory.events are insufficient for knowing if this triggered: 1) Our current approach reads memory.events oom_kill and reports the container was killed if the value is non-zero. This is erroneous in some cases where containers create their children cgroups with memory.oom.group=1 as such OOM kills will get counted against the container cgroup's oom_kill counter despite not actually OOM killing the entire container. 2) Reading memory.events.local will fail to identify OOM kills in leaf cgroups (that don't set memory.oom.group) within the container cgroup. This patch adds a new oom_group_kill event when memory.oom.group triggers to allow userspace to cleanly identify when an entire cgroup is oom killed. [schatzberg.dan@gmail.com: changes from Johannes and Chris] Link: https://lkml.kernel.org/r/20211213162511.2492267-1-schatzberg.dan@gmail.com Link: https://lkml.kernel.org/r/20211203162426.3375036-1-schatzberg.dan@gmail.com Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com> Reviewed-by: Roman Gushchin <guro@fb.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Chris Down <chris@chrisdown.name> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tejun Heo <tj@kernel.org> Cc: Zefan Li <lizefan.x@bytedance.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Alex Shi <alexs@kernel.org> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/b6bf9abb0aa44e53ffe9c1e6e1d32568f5b25e4a;"This is erroneous in some cases where containers create their children cgroups with memory.oom.group=1 as such OOM kills will get counted against the container cgroup's oom_kill counter despite not actually OOM killing the entire container";yes;yes;yes 11;C_kwDOACN7MtoAKGI2YmY5YWJiMGFhNDRlNTNmZmU5YzFlNmUxZDMyNTY4ZjViMjVlNGE;Dan Schatzberg;Linus Torvalds;"mm/memcg: add oom_group_kill memory event Our container agent wants to know when a container exits if it was OOM killed or not to report to the user. We use memory.oom.group = 1 to ensure that OOM kills within the container's cgroup kill everything. Existing memory.events are insufficient for knowing if this triggered: 1) Our current approach reads memory.events oom_kill and reports the container was killed if the value is non-zero. This is erroneous in some cases where containers create their children cgroups with memory.oom.group=1 as such OOM kills will get counted against the container cgroup's oom_kill counter despite not actually OOM killing the entire container. 2) Reading memory.events.local will fail to identify OOM kills in leaf cgroups (that don't set memory.oom.group) within the container cgroup. This patch adds a new oom_group_kill event when memory.oom.group triggers to allow userspace to cleanly identify when an entire cgroup is oom killed. [schatzberg.dan@gmail.com: changes from Johannes and Chris] Link: https://lkml.kernel.org/r/20211213162511.2492267-1-schatzberg.dan@gmail.com Link: https://lkml.kernel.org/r/20211203162426.3375036-1-schatzberg.dan@gmail.com Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com> Reviewed-by: Roman Gushchin <guro@fb.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Chris Down <chris@chrisdown.name> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tejun Heo <tj@kernel.org> Cc: Zefan Li <lizefan.x@bytedance.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Alex Shi <alexs@kernel.org> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/b6bf9abb0aa44e53ffe9c1e6e1d32568f5b25e4a;"2) Reading memory.events.local will fail to identify OOM kills in leaf cgroups (that don't set memory.oom.group) within the container cgroup";no;yes;yes 11;C_kwDOACN7MtoAKGI2YmY5YWJiMGFhNDRlNTNmZmU5YzFlNmUxZDMyNTY4ZjViMjVlNGE;Dan Schatzberg;Linus Torvalds;"mm/memcg: add oom_group_kill memory event Our container agent wants to know when a container exits if it was OOM killed or not to report to the user. We use memory.oom.group = 1 to ensure that OOM kills within the container's cgroup kill everything. Existing memory.events are insufficient for knowing if this triggered: 1) Our current approach reads memory.events oom_kill and reports the container was killed if the value is non-zero. This is erroneous in some cases where containers create their children cgroups with memory.oom.group=1 as such OOM kills will get counted against the container cgroup's oom_kill counter despite not actually OOM killing the entire container. 2) Reading memory.events.local will fail to identify OOM kills in leaf cgroups (that don't set memory.oom.group) within the container cgroup. This patch adds a new oom_group_kill event when memory.oom.group triggers to allow userspace to cleanly identify when an entire cgroup is oom killed. [schatzberg.dan@gmail.com: changes from Johannes and Chris] Link: https://lkml.kernel.org/r/20211213162511.2492267-1-schatzberg.dan@gmail.com Link: https://lkml.kernel.org/r/20211203162426.3375036-1-schatzberg.dan@gmail.com Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com> Reviewed-by: Roman Gushchin <guro@fb.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Chris Down <chris@chrisdown.name> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tejun Heo <tj@kernel.org> Cc: Zefan Li <lizefan.x@bytedance.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Alex Shi <alexs@kernel.org> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/b6bf9abb0aa44e53ffe9c1e6e1d32568f5b25e4a;"This patch adds a new oom_group_kill event when memory.oom.group triggers to allow userspace to cleanly identify when an entire cgroup is oom killed.";yes;yes;no 12;C_kwDOACN7MtoAKDk4YjI0YjE2YjJhZWJmZmFiZjViODY3MGY0NGYxOTY2NmMxZTAyOWY;Eric W. Biederman;Eric W. Biederman;"signal: Have the oom killer detect coredumps using signal->core_state In preparation for removing the flag SIGNAL_GROUP_COREDUMP, change __task_will_free_mem to test signal->core_state instead of the flag SIGNAL_GROUP_COREDUMP. Both fields are protected by siglock and both live in signal_struct so there are no real tradeoffs here, just a change to which field is being tested. Link: https://lkml.kernel.org/r/20211213225350.27481-3-ebiederm@xmission.com Signed-off-by: ""Eric W. Biederman"" <ebiederm@xmission.com>";https://api.github.com/repos/torvalds/linux/git/commits/98b24b16b2aebffabf5b8670f44f19666c1e029f;signal: Have the oom killer detect coredumps using signal->core_state;yes;no;no 12;C_kwDOACN7MtoAKDk4YjI0YjE2YjJhZWJmZmFiZjViODY3MGY0NGYxOTY2NmMxZTAyOWY;Eric W. Biederman;Eric W. Biederman;"signal: Have the oom killer detect coredumps using signal->core_state In preparation for removing the flag SIGNAL_GROUP_COREDUMP, change __task_will_free_mem to test signal->core_state instead of the flag SIGNAL_GROUP_COREDUMP. Both fields are protected by siglock and both live in signal_struct so there are no real tradeoffs here, just a change to which field is being tested. Link: https://lkml.kernel.org/r/20211213225350.27481-3-ebiederm@xmission.com Signed-off-by: ""Eric W. Biederman"" <ebiederm@xmission.com>";https://api.github.com/repos/torvalds/linux/git/commits/98b24b16b2aebffabf5b8670f44f19666c1e029f;"In preparation for removing the flag SIGNAL_GROUP_COREDUMP, change __task_will_free_mem to test signal->core_state instead of the flag SIGNAL_GROUP_COREDUMP";yes;yes;no 12;C_kwDOACN7MtoAKDk4YjI0YjE2YjJhZWJmZmFiZjViODY3MGY0NGYxOTY2NmMxZTAyOWY;Eric W. Biederman;Eric W. Biederman;"signal: Have the oom killer detect coredumps using signal->core_state In preparation for removing the flag SIGNAL_GROUP_COREDUMP, change __task_will_free_mem to test signal->core_state instead of the flag SIGNAL_GROUP_COREDUMP. Both fields are protected by siglock and both live in signal_struct so there are no real tradeoffs here, just a change to which field is being tested. Link: https://lkml.kernel.org/r/20211213225350.27481-3-ebiederm@xmission.com Signed-off-by: ""Eric W. Biederman"" <ebiederm@xmission.com>";https://api.github.com/repos/torvalds/linux/git/commits/98b24b16b2aebffabf5b8670f44f19666c1e029f;"Both fields are protected by siglock and both live in signal_struct so there are no real tradeoffs here, just a change to which field is being tested.";yes;no;yes 13;C_kwDOACN7MtoAKDM3MjM5MjllYjBmNTBlMjEwMWRlNzM5Y2RiNjY0NThhNGYxZjRiMjc;Sultan Alsawaf;Linus Torvalds;"mm: mark the OOM reaper thread as freezable The OOM reaper alters user address space which might theoretically alter the snapshot if reaping is allowed to happen after the freezer quiescent state. To this end, the reaper kthread uses wait_event_freezable() while waiting for any work so that it cannot run while the system freezes. However, the current implementation doesn't respect the freezer because all kernel threads are created with the PF_NOFREEZE flag, so they are automatically excluded from freezing operations. This means that the OOM reaper can race with system snapshotting if it has work to do while the system is being frozen. Fix this by adding a set_freezable() call which will clear the PF_NOFREEZE flag and thus make the OOM reaper visible to the freezer. Please note that the OOM reaper altering the snapshot this way is mostly a theoretical concern and has not been observed in practice. Link: https://lkml.kernel.org/r/20210921165758.6154-1-sultan@kerneltoast.com Link: https://lkml.kernel.org/r/20210918233920.9174-1-sultan@kerneltoast.com Fixes: aac453635549 (""mm, oom: introduce oom reaper"") Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3723929eb0f50e2101de739cdb66458a4f1f4b27;mm: mark the OOM reaper thread as freezable;yes;no;no 13;C_kwDOACN7MtoAKDM3MjM5MjllYjBmNTBlMjEwMWRlNzM5Y2RiNjY0NThhNGYxZjRiMjc;Sultan Alsawaf;Linus Torvalds;"mm: mark the OOM reaper thread as freezable The OOM reaper alters user address space which might theoretically alter the snapshot if reaping is allowed to happen after the freezer quiescent state. To this end, the reaper kthread uses wait_event_freezable() while waiting for any work so that it cannot run while the system freezes. However, the current implementation doesn't respect the freezer because all kernel threads are created with the PF_NOFREEZE flag, so they are automatically excluded from freezing operations. This means that the OOM reaper can race with system snapshotting if it has work to do while the system is being frozen. Fix this by adding a set_freezable() call which will clear the PF_NOFREEZE flag and thus make the OOM reaper visible to the freezer. Please note that the OOM reaper altering the snapshot this way is mostly a theoretical concern and has not been observed in practice. Link: https://lkml.kernel.org/r/20210921165758.6154-1-sultan@kerneltoast.com Link: https://lkml.kernel.org/r/20210918233920.9174-1-sultan@kerneltoast.com Fixes: aac453635549 (""mm, oom: introduce oom reaper"") Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3723929eb0f50e2101de739cdb66458a4f1f4b27;"The OOM reaper alters user address space which might theoretically alter the snapshot if reaping is allowed to happen after the freezer quiescent state";no;no;yes 13;C_kwDOACN7MtoAKDM3MjM5MjllYjBmNTBlMjEwMWRlNzM5Y2RiNjY0NThhNGYxZjRiMjc;Sultan Alsawaf;Linus Torvalds;"mm: mark the OOM reaper thread as freezable The OOM reaper alters user address space which might theoretically alter the snapshot if reaping is allowed to happen after the freezer quiescent state. To this end, the reaper kthread uses wait_event_freezable() while waiting for any work so that it cannot run while the system freezes. However, the current implementation doesn't respect the freezer because all kernel threads are created with the PF_NOFREEZE flag, so they are automatically excluded from freezing operations. This means that the OOM reaper can race with system snapshotting if it has work to do while the system is being frozen. Fix this by adding a set_freezable() call which will clear the PF_NOFREEZE flag and thus make the OOM reaper visible to the freezer. Please note that the OOM reaper altering the snapshot this way is mostly a theoretical concern and has not been observed in practice. Link: https://lkml.kernel.org/r/20210921165758.6154-1-sultan@kerneltoast.com Link: https://lkml.kernel.org/r/20210918233920.9174-1-sultan@kerneltoast.com Fixes: aac453635549 (""mm, oom: introduce oom reaper"") Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3723929eb0f50e2101de739cdb66458a4f1f4b27;" To this end, the reaper kthread uses wait_event_freezable() while waiting for any work so that it cannot run while the system freezes";no;no;yes 13;C_kwDOACN7MtoAKDM3MjM5MjllYjBmNTBlMjEwMWRlNzM5Y2RiNjY0NThhNGYxZjRiMjc;Sultan Alsawaf;Linus Torvalds;"mm: mark the OOM reaper thread as freezable The OOM reaper alters user address space which might theoretically alter the snapshot if reaping is allowed to happen after the freezer quiescent state. To this end, the reaper kthread uses wait_event_freezable() while waiting for any work so that it cannot run while the system freezes. However, the current implementation doesn't respect the freezer because all kernel threads are created with the PF_NOFREEZE flag, so they are automatically excluded from freezing operations. This means that the OOM reaper can race with system snapshotting if it has work to do while the system is being frozen. Fix this by adding a set_freezable() call which will clear the PF_NOFREEZE flag and thus make the OOM reaper visible to the freezer. Please note that the OOM reaper altering the snapshot this way is mostly a theoretical concern and has not been observed in practice. Link: https://lkml.kernel.org/r/20210921165758.6154-1-sultan@kerneltoast.com Link: https://lkml.kernel.org/r/20210918233920.9174-1-sultan@kerneltoast.com Fixes: aac453635549 (""mm, oom: introduce oom reaper"") Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3723929eb0f50e2101de739cdb66458a4f1f4b27;"However, the current implementation doesn't respect the freezer because all kernel threads are created with the PF_NOFREEZE flag, so they are automatically excluded from freezing operations";no;yes;yes 13;C_kwDOACN7MtoAKDM3MjM5MjllYjBmNTBlMjEwMWRlNzM5Y2RiNjY0NThhNGYxZjRiMjc;Sultan Alsawaf;Linus Torvalds;"mm: mark the OOM reaper thread as freezable The OOM reaper alters user address space which might theoretically alter the snapshot if reaping is allowed to happen after the freezer quiescent state. To this end, the reaper kthread uses wait_event_freezable() while waiting for any work so that it cannot run while the system freezes. However, the current implementation doesn't respect the freezer because all kernel threads are created with the PF_NOFREEZE flag, so they are automatically excluded from freezing operations. This means that the OOM reaper can race with system snapshotting if it has work to do while the system is being frozen. Fix this by adding a set_freezable() call which will clear the PF_NOFREEZE flag and thus make the OOM reaper visible to the freezer. Please note that the OOM reaper altering the snapshot this way is mostly a theoretical concern and has not been observed in practice. Link: https://lkml.kernel.org/r/20210921165758.6154-1-sultan@kerneltoast.com Link: https://lkml.kernel.org/r/20210918233920.9174-1-sultan@kerneltoast.com Fixes: aac453635549 (""mm, oom: introduce oom reaper"") Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3723929eb0f50e2101de739cdb66458a4f1f4b27;" This means that the OOM reaper can race with system snapshotting if it has work to do while the system is being frozen";no;yes;yes 13;C_kwDOACN7MtoAKDM3MjM5MjllYjBmNTBlMjEwMWRlNzM5Y2RiNjY0NThhNGYxZjRiMjc;Sultan Alsawaf;Linus Torvalds;"mm: mark the OOM reaper thread as freezable The OOM reaper alters user address space which might theoretically alter the snapshot if reaping is allowed to happen after the freezer quiescent state. To this end, the reaper kthread uses wait_event_freezable() while waiting for any work so that it cannot run while the system freezes. However, the current implementation doesn't respect the freezer because all kernel threads are created with the PF_NOFREEZE flag, so they are automatically excluded from freezing operations. This means that the OOM reaper can race with system snapshotting if it has work to do while the system is being frozen. Fix this by adding a set_freezable() call which will clear the PF_NOFREEZE flag and thus make the OOM reaper visible to the freezer. Please note that the OOM reaper altering the snapshot this way is mostly a theoretical concern and has not been observed in practice. Link: https://lkml.kernel.org/r/20210921165758.6154-1-sultan@kerneltoast.com Link: https://lkml.kernel.org/r/20210918233920.9174-1-sultan@kerneltoast.com Fixes: aac453635549 (""mm, oom: introduce oom reaper"") Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3723929eb0f50e2101de739cdb66458a4f1f4b27;"Fix this by adding a set_freezable() call which will clear the PF_NOFREEZE flag and thus make the OOM reaper visible to the freezer";yes;yes;no 13;C_kwDOACN7MtoAKDM3MjM5MjllYjBmNTBlMjEwMWRlNzM5Y2RiNjY0NThhNGYxZjRiMjc;Sultan Alsawaf;Linus Torvalds;"mm: mark the OOM reaper thread as freezable The OOM reaper alters user address space which might theoretically alter the snapshot if reaping is allowed to happen after the freezer quiescent state. To this end, the reaper kthread uses wait_event_freezable() while waiting for any work so that it cannot run while the system freezes. However, the current implementation doesn't respect the freezer because all kernel threads are created with the PF_NOFREEZE flag, so they are automatically excluded from freezing operations. This means that the OOM reaper can race with system snapshotting if it has work to do while the system is being frozen. Fix this by adding a set_freezable() call which will clear the PF_NOFREEZE flag and thus make the OOM reaper visible to the freezer. Please note that the OOM reaper altering the snapshot this way is mostly a theoretical concern and has not been observed in practice. Link: https://lkml.kernel.org/r/20210921165758.6154-1-sultan@kerneltoast.com Link: https://lkml.kernel.org/r/20210918233920.9174-1-sultan@kerneltoast.com Fixes: aac453635549 (""mm, oom: introduce oom reaper"") Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3723929eb0f50e2101de739cdb66458a4f1f4b27;"Please note that the OOM reaper altering the snapshot this way is mostly a theoretical concern and has not been observed in practice.";no;no;yes 14;C_kwDOACN7MtoAKDYwZTI3OTNkNDQwYTNlYzk1YWJiNWQ2ZDRmYzAzNGE0YjQ4MDQ3MmQ;Michal Hocko;Linus Torvalds;"mm, oom: do not trigger out_of_memory from the #PF Any allocation failure during the #PF path will return with VM_FAULT_OOM which in turn results in pagefault_out_of_memory. This can happen for 2 different reasons. a) Memcg is out of memory and we rely on mem_cgroup_oom_synchronize to perform the memcg OOM handling or b) normal allocation fails. The latter is quite problematic because allocation paths already trigger out_of_memory and the page allocator tries really hard to not fail allocations. Anyway, if the OOM killer has been already invoked there is no reason to invoke it again from the #PF path. Especially when the OOM condition might be gone by that time and we have no way to find out other than allocate. Moreover if the allocation failed and the OOM killer hasn't been invoked then we are unlikely to do the right thing from the #PF context because we have already lost the allocation context and restictions and therefore might oom kill a task from a different NUMA domain. This all suggests that there is no legitimate reason to trigger out_of_memory from pagefault_out_of_memory so drop it. Just to be sure that no #PF path returns with VM_FAULT_OOM without allocation print a warning that this is happening before we restart the #PF. [VvS: #PF allocation can hit into limit of cgroup v1 kmem controller. This is a local problem related to memcg, however, it causes unnecessary global OOM kills that are repeated over and over again and escalate into a real disaster. This has been broken since kmem accounting has been introduced for cgroup v1 (3.8). There was no kmem specific reclaim for the separate limit so the only way to handle kmem hard limit was to return with ENOMEM. In upstream the problem will be fixed by removing the outdated kmem limit, however stable and LTS kernels cannot do it and are still affected. This patch fixes the problem and should be backported into stable/LTS.] Link: https://lkml.kernel.org/r/f5fd8dd8-0ad4-c524-5f65-920b01972a42@virtuozzo.com Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Uladzislau Rezki <urezki@gmail.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/60e2793d440a3ec95abb5d6d4fc034a4b480472d;mm, oom: do not trigger out_of_memory from the #PF;yes;no;no 14;C_kwDOACN7MtoAKDYwZTI3OTNkNDQwYTNlYzk1YWJiNWQ2ZDRmYzAzNGE0YjQ4MDQ3MmQ;Michal Hocko;Linus Torvalds;"mm, oom: do not trigger out_of_memory from the #PF Any allocation failure during the #PF path will return with VM_FAULT_OOM which in turn results in pagefault_out_of_memory. This can happen for 2 different reasons. a) Memcg is out of memory and we rely on mem_cgroup_oom_synchronize to perform the memcg OOM handling or b) normal allocation fails. The latter is quite problematic because allocation paths already trigger out_of_memory and the page allocator tries really hard to not fail allocations. Anyway, if the OOM killer has been already invoked there is no reason to invoke it again from the #PF path. Especially when the OOM condition might be gone by that time and we have no way to find out other than allocate. Moreover if the allocation failed and the OOM killer hasn't been invoked then we are unlikely to do the right thing from the #PF context because we have already lost the allocation context and restictions and therefore might oom kill a task from a different NUMA domain. This all suggests that there is no legitimate reason to trigger out_of_memory from pagefault_out_of_memory so drop it. Just to be sure that no #PF path returns with VM_FAULT_OOM without allocation print a warning that this is happening before we restart the #PF. [VvS: #PF allocation can hit into limit of cgroup v1 kmem controller. This is a local problem related to memcg, however, it causes unnecessary global OOM kills that are repeated over and over again and escalate into a real disaster. This has been broken since kmem accounting has been introduced for cgroup v1 (3.8). There was no kmem specific reclaim for the separate limit so the only way to handle kmem hard limit was to return with ENOMEM. In upstream the problem will be fixed by removing the outdated kmem limit, however stable and LTS kernels cannot do it and are still affected. This patch fixes the problem and should be backported into stable/LTS.] Link: https://lkml.kernel.org/r/f5fd8dd8-0ad4-c524-5f65-920b01972a42@virtuozzo.com Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Uladzislau Rezki <urezki@gmail.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/60e2793d440a3ec95abb5d6d4fc034a4b480472d;"Any allocation failure during the #PF path will return with VM_FAULT_OOM which in turn results in pagefault_out_of_memory";no;no;yes 14;C_kwDOACN7MtoAKDYwZTI3OTNkNDQwYTNlYzk1YWJiNWQ2ZDRmYzAzNGE0YjQ4MDQ3MmQ;Michal Hocko;Linus Torvalds;"mm, oom: do not trigger out_of_memory from the #PF Any allocation failure during the #PF path will return with VM_FAULT_OOM which in turn results in pagefault_out_of_memory. This can happen for 2 different reasons. a) Memcg is out of memory and we rely on mem_cgroup_oom_synchronize to perform the memcg OOM handling or b) normal allocation fails. The latter is quite problematic because allocation paths already trigger out_of_memory and the page allocator tries really hard to not fail allocations. Anyway, if the OOM killer has been already invoked there is no reason to invoke it again from the #PF path. Especially when the OOM condition might be gone by that time and we have no way to find out other than allocate. Moreover if the allocation failed and the OOM killer hasn't been invoked then we are unlikely to do the right thing from the #PF context because we have already lost the allocation context and restictions and therefore might oom kill a task from a different NUMA domain. This all suggests that there is no legitimate reason to trigger out_of_memory from pagefault_out_of_memory so drop it. Just to be sure that no #PF path returns with VM_FAULT_OOM without allocation print a warning that this is happening before we restart the #PF. [VvS: #PF allocation can hit into limit of cgroup v1 kmem controller. This is a local problem related to memcg, however, it causes unnecessary global OOM kills that are repeated over and over again and escalate into a real disaster. This has been broken since kmem accounting has been introduced for cgroup v1 (3.8). There was no kmem specific reclaim for the separate limit so the only way to handle kmem hard limit was to return with ENOMEM. In upstream the problem will be fixed by removing the outdated kmem limit, however stable and LTS kernels cannot do it and are still affected. This patch fixes the problem and should be backported into stable/LTS.] Link: https://lkml.kernel.org/r/f5fd8dd8-0ad4-c524-5f65-920b01972a42@virtuozzo.com Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Uladzislau Rezki <urezki@gmail.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/60e2793d440a3ec95abb5d6d4fc034a4b480472d;" This can happen for 2 different reasons";no;no;yes 14;C_kwDOACN7MtoAKDYwZTI3OTNkNDQwYTNlYzk1YWJiNWQ2ZDRmYzAzNGE0YjQ4MDQ3MmQ;Michal Hocko;Linus Torvalds;"mm, oom: do not trigger out_of_memory from the #PF Any allocation failure during the #PF path will return with VM_FAULT_OOM which in turn results in pagefault_out_of_memory. This can happen for 2 different reasons. a) Memcg is out of memory and we rely on mem_cgroup_oom_synchronize to perform the memcg OOM handling or b) normal allocation fails. The latter is quite problematic because allocation paths already trigger out_of_memory and the page allocator tries really hard to not fail allocations. Anyway, if the OOM killer has been already invoked there is no reason to invoke it again from the #PF path. Especially when the OOM condition might be gone by that time and we have no way to find out other than allocate. Moreover if the allocation failed and the OOM killer hasn't been invoked then we are unlikely to do the right thing from the #PF context because we have already lost the allocation context and restictions and therefore might oom kill a task from a different NUMA domain. This all suggests that there is no legitimate reason to trigger out_of_memory from pagefault_out_of_memory so drop it. Just to be sure that no #PF path returns with VM_FAULT_OOM without allocation print a warning that this is happening before we restart the #PF. [VvS: #PF allocation can hit into limit of cgroup v1 kmem controller. This is a local problem related to memcg, however, it causes unnecessary global OOM kills that are repeated over and over again and escalate into a real disaster. This has been broken since kmem accounting has been introduced for cgroup v1 (3.8). There was no kmem specific reclaim for the separate limit so the only way to handle kmem hard limit was to return with ENOMEM. In upstream the problem will be fixed by removing the outdated kmem limit, however stable and LTS kernels cannot do it and are still affected. This patch fixes the problem and should be backported into stable/LTS.] Link: https://lkml.kernel.org/r/f5fd8dd8-0ad4-c524-5f65-920b01972a42@virtuozzo.com Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Uladzislau Rezki <urezki@gmail.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/60e2793d440a3ec95abb5d6d4fc034a4b480472d;" a) Memcg is out of memory and we rely on mem_cgroup_oom_synchronize to perform the memcg OOM handling or b) normal allocation fails";no;no;yes 14;C_kwDOACN7MtoAKDYwZTI3OTNkNDQwYTNlYzk1YWJiNWQ2ZDRmYzAzNGE0YjQ4MDQ3MmQ;Michal Hocko;Linus Torvalds;"mm, oom: do not trigger out_of_memory from the #PF Any allocation failure during the #PF path will return with VM_FAULT_OOM which in turn results in pagefault_out_of_memory. This can happen for 2 different reasons. a) Memcg is out of memory and we rely on mem_cgroup_oom_synchronize to perform the memcg OOM handling or b) normal allocation fails. The latter is quite problematic because allocation paths already trigger out_of_memory and the page allocator tries really hard to not fail allocations. Anyway, if the OOM killer has been already invoked there is no reason to invoke it again from the #PF path. Especially when the OOM condition might be gone by that time and we have no way to find out other than allocate. Moreover if the allocation failed and the OOM killer hasn't been invoked then we are unlikely to do the right thing from the #PF context because we have already lost the allocation context and restictions and therefore might oom kill a task from a different NUMA domain. This all suggests that there is no legitimate reason to trigger out_of_memory from pagefault_out_of_memory so drop it. Just to be sure that no #PF path returns with VM_FAULT_OOM without allocation print a warning that this is happening before we restart the #PF. [VvS: #PF allocation can hit into limit of cgroup v1 kmem controller. This is a local problem related to memcg, however, it causes unnecessary global OOM kills that are repeated over and over again and escalate into a real disaster. This has been broken since kmem accounting has been introduced for cgroup v1 (3.8). There was no kmem specific reclaim for the separate limit so the only way to handle kmem hard limit was to return with ENOMEM. In upstream the problem will be fixed by removing the outdated kmem limit, however stable and LTS kernels cannot do it and are still affected. This patch fixes the problem and should be backported into stable/LTS.] Link: https://lkml.kernel.org/r/f5fd8dd8-0ad4-c524-5f65-920b01972a42@virtuozzo.com Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Uladzislau Rezki <urezki@gmail.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/60e2793d440a3ec95abb5d6d4fc034a4b480472d;"The latter is quite problematic because allocation paths already trigger out_of_memory and the page allocator tries really hard to not fail allocations";no;yes;yes 14;C_kwDOACN7MtoAKDYwZTI3OTNkNDQwYTNlYzk1YWJiNWQ2ZDRmYzAzNGE0YjQ4MDQ3MmQ;Michal Hocko;Linus Torvalds;"mm, oom: do not trigger out_of_memory from the #PF Any allocation failure during the #PF path will return with VM_FAULT_OOM which in turn results in pagefault_out_of_memory. This can happen for 2 different reasons. a) Memcg is out of memory and we rely on mem_cgroup_oom_synchronize to perform the memcg OOM handling or b) normal allocation fails. The latter is quite problematic because allocation paths already trigger out_of_memory and the page allocator tries really hard to not fail allocations. Anyway, if the OOM killer has been already invoked there is no reason to invoke it again from the #PF path. Especially when the OOM condition might be gone by that time and we have no way to find out other than allocate. Moreover if the allocation failed and the OOM killer hasn't been invoked then we are unlikely to do the right thing from the #PF context because we have already lost the allocation context and restictions and therefore might oom kill a task from a different NUMA domain. This all suggests that there is no legitimate reason to trigger out_of_memory from pagefault_out_of_memory so drop it. Just to be sure that no #PF path returns with VM_FAULT_OOM without allocation print a warning that this is happening before we restart the #PF. [VvS: #PF allocation can hit into limit of cgroup v1 kmem controller. This is a local problem related to memcg, however, it causes unnecessary global OOM kills that are repeated over and over again and escalate into a real disaster. This has been broken since kmem accounting has been introduced for cgroup v1 (3.8). There was no kmem specific reclaim for the separate limit so the only way to handle kmem hard limit was to return with ENOMEM. In upstream the problem will be fixed by removing the outdated kmem limit, however stable and LTS kernels cannot do it and are still affected. This patch fixes the problem and should be backported into stable/LTS.] Link: https://lkml.kernel.org/r/f5fd8dd8-0ad4-c524-5f65-920b01972a42@virtuozzo.com Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Uladzislau Rezki <urezki@gmail.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/60e2793d440a3ec95abb5d6d4fc034a4b480472d;" Anyway, if the OOM killer has been already invoked there is no reason to invoke it again from the #PF path";yes;yes;no 14;C_kwDOACN7MtoAKDYwZTI3OTNkNDQwYTNlYzk1YWJiNWQ2ZDRmYzAzNGE0YjQ4MDQ3MmQ;Michal Hocko;Linus Torvalds;"mm, oom: do not trigger out_of_memory from the #PF Any allocation failure during the #PF path will return with VM_FAULT_OOM which in turn results in pagefault_out_of_memory. This can happen for 2 different reasons. a) Memcg is out of memory and we rely on mem_cgroup_oom_synchronize to perform the memcg OOM handling or b) normal allocation fails. The latter is quite problematic because allocation paths already trigger out_of_memory and the page allocator tries really hard to not fail allocations. Anyway, if the OOM killer has been already invoked there is no reason to invoke it again from the #PF path. Especially when the OOM condition might be gone by that time and we have no way to find out other than allocate. Moreover if the allocation failed and the OOM killer hasn't been invoked then we are unlikely to do the right thing from the #PF context because we have already lost the allocation context and restictions and therefore might oom kill a task from a different NUMA domain. This all suggests that there is no legitimate reason to trigger out_of_memory from pagefault_out_of_memory so drop it. Just to be sure that no #PF path returns with VM_FAULT_OOM without allocation print a warning that this is happening before we restart the #PF. [VvS: #PF allocation can hit into limit of cgroup v1 kmem controller. This is a local problem related to memcg, however, it causes unnecessary global OOM kills that are repeated over and over again and escalate into a real disaster. This has been broken since kmem accounting has been introduced for cgroup v1 (3.8). There was no kmem specific reclaim for the separate limit so the only way to handle kmem hard limit was to return with ENOMEM. In upstream the problem will be fixed by removing the outdated kmem limit, however stable and LTS kernels cannot do it and are still affected. This patch fixes the problem and should be backported into stable/LTS.] Link: https://lkml.kernel.org/r/f5fd8dd8-0ad4-c524-5f65-920b01972a42@virtuozzo.com Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Uladzislau Rezki <urezki@gmail.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/60e2793d440a3ec95abb5d6d4fc034a4b480472d;" Especially when the OOM condition might be gone by that time and we have no way to find out other than allocate";no;yes;yes 14;C_kwDOACN7MtoAKDYwZTI3OTNkNDQwYTNlYzk1YWJiNWQ2ZDRmYzAzNGE0YjQ4MDQ3MmQ;Michal Hocko;Linus Torvalds;"mm, oom: do not trigger out_of_memory from the #PF Any allocation failure during the #PF path will return with VM_FAULT_OOM which in turn results in pagefault_out_of_memory. This can happen for 2 different reasons. a) Memcg is out of memory and we rely on mem_cgroup_oom_synchronize to perform the memcg OOM handling or b) normal allocation fails. The latter is quite problematic because allocation paths already trigger out_of_memory and the page allocator tries really hard to not fail allocations. Anyway, if the OOM killer has been already invoked there is no reason to invoke it again from the #PF path. Especially when the OOM condition might be gone by that time and we have no way to find out other than allocate. Moreover if the allocation failed and the OOM killer hasn't been invoked then we are unlikely to do the right thing from the #PF context because we have already lost the allocation context and restictions and therefore might oom kill a task from a different NUMA domain. This all suggests that there is no legitimate reason to trigger out_of_memory from pagefault_out_of_memory so drop it. Just to be sure that no #PF path returns with VM_FAULT_OOM without allocation print a warning that this is happening before we restart the #PF. [VvS: #PF allocation can hit into limit of cgroup v1 kmem controller. This is a local problem related to memcg, however, it causes unnecessary global OOM kills that are repeated over and over again and escalate into a real disaster. This has been broken since kmem accounting has been introduced for cgroup v1 (3.8). There was no kmem specific reclaim for the separate limit so the only way to handle kmem hard limit was to return with ENOMEM. In upstream the problem will be fixed by removing the outdated kmem limit, however stable and LTS kernels cannot do it and are still affected. This patch fixes the problem and should be backported into stable/LTS.] Link: https://lkml.kernel.org/r/f5fd8dd8-0ad4-c524-5f65-920b01972a42@virtuozzo.com Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Uladzislau Rezki <urezki@gmail.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/60e2793d440a3ec95abb5d6d4fc034a4b480472d;"Moreover if the allocation failed and the OOM killer hasn't been invoked then we are unlikely to do the right thing from the #PF context because we have already lost the allocation context and restictions and therefore might oom kill a task from a different NUMA domain";no;no;yes 14;C_kwDOACN7MtoAKDYwZTI3OTNkNDQwYTNlYzk1YWJiNWQ2ZDRmYzAzNGE0YjQ4MDQ3MmQ;Michal Hocko;Linus Torvalds;"mm, oom: do not trigger out_of_memory from the #PF Any allocation failure during the #PF path will return with VM_FAULT_OOM which in turn results in pagefault_out_of_memory. This can happen for 2 different reasons. a) Memcg is out of memory and we rely on mem_cgroup_oom_synchronize to perform the memcg OOM handling or b) normal allocation fails. The latter is quite problematic because allocation paths already trigger out_of_memory and the page allocator tries really hard to not fail allocations. Anyway, if the OOM killer has been already invoked there is no reason to invoke it again from the #PF path. Especially when the OOM condition might be gone by that time and we have no way to find out other than allocate. Moreover if the allocation failed and the OOM killer hasn't been invoked then we are unlikely to do the right thing from the #PF context because we have already lost the allocation context and restictions and therefore might oom kill a task from a different NUMA domain. This all suggests that there is no legitimate reason to trigger out_of_memory from pagefault_out_of_memory so drop it. Just to be sure that no #PF path returns with VM_FAULT_OOM without allocation print a warning that this is happening before we restart the #PF. [VvS: #PF allocation can hit into limit of cgroup v1 kmem controller. This is a local problem related to memcg, however, it causes unnecessary global OOM kills that are repeated over and over again and escalate into a real disaster. This has been broken since kmem accounting has been introduced for cgroup v1 (3.8). There was no kmem specific reclaim for the separate limit so the only way to handle kmem hard limit was to return with ENOMEM. In upstream the problem will be fixed by removing the outdated kmem limit, however stable and LTS kernels cannot do it and are still affected. This patch fixes the problem and should be backported into stable/LTS.] Link: https://lkml.kernel.org/r/f5fd8dd8-0ad4-c524-5f65-920b01972a42@virtuozzo.com Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Uladzislau Rezki <urezki@gmail.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/60e2793d440a3ec95abb5d6d4fc034a4b480472d;"This all suggests that there is no legitimate reason to trigger out_of_memory from pagefault_out_of_memory so drop it";yes;yes;no 14;C_kwDOACN7MtoAKDYwZTI3OTNkNDQwYTNlYzk1YWJiNWQ2ZDRmYzAzNGE0YjQ4MDQ3MmQ;Michal Hocko;Linus Torvalds;"mm, oom: do not trigger out_of_memory from the #PF Any allocation failure during the #PF path will return with VM_FAULT_OOM which in turn results in pagefault_out_of_memory. This can happen for 2 different reasons. a) Memcg is out of memory and we rely on mem_cgroup_oom_synchronize to perform the memcg OOM handling or b) normal allocation fails. The latter is quite problematic because allocation paths already trigger out_of_memory and the page allocator tries really hard to not fail allocations. Anyway, if the OOM killer has been already invoked there is no reason to invoke it again from the #PF path. Especially when the OOM condition might be gone by that time and we have no way to find out other than allocate. Moreover if the allocation failed and the OOM killer hasn't been invoked then we are unlikely to do the right thing from the #PF context because we have already lost the allocation context and restictions and therefore might oom kill a task from a different NUMA domain. This all suggests that there is no legitimate reason to trigger out_of_memory from pagefault_out_of_memory so drop it. Just to be sure that no #PF path returns with VM_FAULT_OOM without allocation print a warning that this is happening before we restart the #PF. [VvS: #PF allocation can hit into limit of cgroup v1 kmem controller. This is a local problem related to memcg, however, it causes unnecessary global OOM kills that are repeated over and over again and escalate into a real disaster. This has been broken since kmem accounting has been introduced for cgroup v1 (3.8). There was no kmem specific reclaim for the separate limit so the only way to handle kmem hard limit was to return with ENOMEM. In upstream the problem will be fixed by removing the outdated kmem limit, however stable and LTS kernels cannot do it and are still affected. This patch fixes the problem and should be backported into stable/LTS.] Link: https://lkml.kernel.org/r/f5fd8dd8-0ad4-c524-5f65-920b01972a42@virtuozzo.com Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Uladzislau Rezki <urezki@gmail.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/60e2793d440a3ec95abb5d6d4fc034a4b480472d;" Just to be sure that no #PF path returns with VM_FAULT_OOM without allocation print a warning that this is happening before we restart the #PF";yes;yes;no 14;C_kwDOACN7MtoAKDYwZTI3OTNkNDQwYTNlYzk1YWJiNWQ2ZDRmYzAzNGE0YjQ4MDQ3MmQ;Michal Hocko;Linus Torvalds;"mm, oom: do not trigger out_of_memory from the #PF Any allocation failure during the #PF path will return with VM_FAULT_OOM which in turn results in pagefault_out_of_memory. This can happen for 2 different reasons. a) Memcg is out of memory and we rely on mem_cgroup_oom_synchronize to perform the memcg OOM handling or b) normal allocation fails. The latter is quite problematic because allocation paths already trigger out_of_memory and the page allocator tries really hard to not fail allocations. Anyway, if the OOM killer has been already invoked there is no reason to invoke it again from the #PF path. Especially when the OOM condition might be gone by that time and we have no way to find out other than allocate. Moreover if the allocation failed and the OOM killer hasn't been invoked then we are unlikely to do the right thing from the #PF context because we have already lost the allocation context and restictions and therefore might oom kill a task from a different NUMA domain. This all suggests that there is no legitimate reason to trigger out_of_memory from pagefault_out_of_memory so drop it. Just to be sure that no #PF path returns with VM_FAULT_OOM without allocation print a warning that this is happening before we restart the #PF. [VvS: #PF allocation can hit into limit of cgroup v1 kmem controller. This is a local problem related to memcg, however, it causes unnecessary global OOM kills that are repeated over and over again and escalate into a real disaster. This has been broken since kmem accounting has been introduced for cgroup v1 (3.8). There was no kmem specific reclaim for the separate limit so the only way to handle kmem hard limit was to return with ENOMEM. In upstream the problem will be fixed by removing the outdated kmem limit, however stable and LTS kernels cannot do it and are still affected. This patch fixes the problem and should be backported into stable/LTS.] Link: https://lkml.kernel.org/r/f5fd8dd8-0ad4-c524-5f65-920b01972a42@virtuozzo.com Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Uladzislau Rezki <urezki@gmail.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/60e2793d440a3ec95abb5d6d4fc034a4b480472d;[VvS: #PF allocation can hit into limit of cgroup v1 kmem controller;no;no;yes 14;C_kwDOACN7MtoAKDYwZTI3OTNkNDQwYTNlYzk1YWJiNWQ2ZDRmYzAzNGE0YjQ4MDQ3MmQ;Michal Hocko;Linus Torvalds;"mm, oom: do not trigger out_of_memory from the #PF Any allocation failure during the #PF path will return with VM_FAULT_OOM which in turn results in pagefault_out_of_memory. This can happen for 2 different reasons. a) Memcg is out of memory and we rely on mem_cgroup_oom_synchronize to perform the memcg OOM handling or b) normal allocation fails. The latter is quite problematic because allocation paths already trigger out_of_memory and the page allocator tries really hard to not fail allocations. Anyway, if the OOM killer has been already invoked there is no reason to invoke it again from the #PF path. Especially when the OOM condition might be gone by that time and we have no way to find out other than allocate. Moreover if the allocation failed and the OOM killer hasn't been invoked then we are unlikely to do the right thing from the #PF context because we have already lost the allocation context and restictions and therefore might oom kill a task from a different NUMA domain. This all suggests that there is no legitimate reason to trigger out_of_memory from pagefault_out_of_memory so drop it. Just to be sure that no #PF path returns with VM_FAULT_OOM without allocation print a warning that this is happening before we restart the #PF. [VvS: #PF allocation can hit into limit of cgroup v1 kmem controller. This is a local problem related to memcg, however, it causes unnecessary global OOM kills that are repeated over and over again and escalate into a real disaster. This has been broken since kmem accounting has been introduced for cgroup v1 (3.8). There was no kmem specific reclaim for the separate limit so the only way to handle kmem hard limit was to return with ENOMEM. In upstream the problem will be fixed by removing the outdated kmem limit, however stable and LTS kernels cannot do it and are still affected. This patch fixes the problem and should be backported into stable/LTS.] Link: https://lkml.kernel.org/r/f5fd8dd8-0ad4-c524-5f65-920b01972a42@virtuozzo.com Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Uladzislau Rezki <urezki@gmail.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/60e2793d440a3ec95abb5d6d4fc034a4b480472d;"This is a local problem related to memcg, however, it causes unnecessary global OOM kills that are repeated over and over again and escalate into a real disaster";no;yes;yes 14;C_kwDOACN7MtoAKDYwZTI3OTNkNDQwYTNlYzk1YWJiNWQ2ZDRmYzAzNGE0YjQ4MDQ3MmQ;Michal Hocko;Linus Torvalds;"mm, oom: do not trigger out_of_memory from the #PF Any allocation failure during the #PF path will return with VM_FAULT_OOM which in turn results in pagefault_out_of_memory. This can happen for 2 different reasons. a) Memcg is out of memory and we rely on mem_cgroup_oom_synchronize to perform the memcg OOM handling or b) normal allocation fails. The latter is quite problematic because allocation paths already trigger out_of_memory and the page allocator tries really hard to not fail allocations. Anyway, if the OOM killer has been already invoked there is no reason to invoke it again from the #PF path. Especially when the OOM condition might be gone by that time and we have no way to find out other than allocate. Moreover if the allocation failed and the OOM killer hasn't been invoked then we are unlikely to do the right thing from the #PF context because we have already lost the allocation context and restictions and therefore might oom kill a task from a different NUMA domain. This all suggests that there is no legitimate reason to trigger out_of_memory from pagefault_out_of_memory so drop it. Just to be sure that no #PF path returns with VM_FAULT_OOM without allocation print a warning that this is happening before we restart the #PF. [VvS: #PF allocation can hit into limit of cgroup v1 kmem controller. This is a local problem related to memcg, however, it causes unnecessary global OOM kills that are repeated over and over again and escalate into a real disaster. This has been broken since kmem accounting has been introduced for cgroup v1 (3.8). There was no kmem specific reclaim for the separate limit so the only way to handle kmem hard limit was to return with ENOMEM. In upstream the problem will be fixed by removing the outdated kmem limit, however stable and LTS kernels cannot do it and are still affected. This patch fixes the problem and should be backported into stable/LTS.] Link: https://lkml.kernel.org/r/f5fd8dd8-0ad4-c524-5f65-920b01972a42@virtuozzo.com Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Uladzislau Rezki <urezki@gmail.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/60e2793d440a3ec95abb5d6d4fc034a4b480472d;" This has been broken since kmem accounting has been introduced for cgroup v1 (3.8)";no;no;yes 14;C_kwDOACN7MtoAKDYwZTI3OTNkNDQwYTNlYzk1YWJiNWQ2ZDRmYzAzNGE0YjQ4MDQ3MmQ;Michal Hocko;Linus Torvalds;"mm, oom: do not trigger out_of_memory from the #PF Any allocation failure during the #PF path will return with VM_FAULT_OOM which in turn results in pagefault_out_of_memory. This can happen for 2 different reasons. a) Memcg is out of memory and we rely on mem_cgroup_oom_synchronize to perform the memcg OOM handling or b) normal allocation fails. The latter is quite problematic because allocation paths already trigger out_of_memory and the page allocator tries really hard to not fail allocations. Anyway, if the OOM killer has been already invoked there is no reason to invoke it again from the #PF path. Especially when the OOM condition might be gone by that time and we have no way to find out other than allocate. Moreover if the allocation failed and the OOM killer hasn't been invoked then we are unlikely to do the right thing from the #PF context because we have already lost the allocation context and restictions and therefore might oom kill a task from a different NUMA domain. This all suggests that there is no legitimate reason to trigger out_of_memory from pagefault_out_of_memory so drop it. Just to be sure that no #PF path returns with VM_FAULT_OOM without allocation print a warning that this is happening before we restart the #PF. [VvS: #PF allocation can hit into limit of cgroup v1 kmem controller. This is a local problem related to memcg, however, it causes unnecessary global OOM kills that are repeated over and over again and escalate into a real disaster. This has been broken since kmem accounting has been introduced for cgroup v1 (3.8). There was no kmem specific reclaim for the separate limit so the only way to handle kmem hard limit was to return with ENOMEM. In upstream the problem will be fixed by removing the outdated kmem limit, however stable and LTS kernels cannot do it and are still affected. This patch fixes the problem and should be backported into stable/LTS.] Link: https://lkml.kernel.org/r/f5fd8dd8-0ad4-c524-5f65-920b01972a42@virtuozzo.com Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Uladzislau Rezki <urezki@gmail.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/60e2793d440a3ec95abb5d6d4fc034a4b480472d;" There was no kmem specific reclaim for the separate limit so the only way to handle kmem hard limit was to return with ENOMEM";no;no;yes 14;C_kwDOACN7MtoAKDYwZTI3OTNkNDQwYTNlYzk1YWJiNWQ2ZDRmYzAzNGE0YjQ4MDQ3MmQ;Michal Hocko;Linus Torvalds;"mm, oom: do not trigger out_of_memory from the #PF Any allocation failure during the #PF path will return with VM_FAULT_OOM which in turn results in pagefault_out_of_memory. This can happen for 2 different reasons. a) Memcg is out of memory and we rely on mem_cgroup_oom_synchronize to perform the memcg OOM handling or b) normal allocation fails. The latter is quite problematic because allocation paths already trigger out_of_memory and the page allocator tries really hard to not fail allocations. Anyway, if the OOM killer has been already invoked there is no reason to invoke it again from the #PF path. Especially when the OOM condition might be gone by that time and we have no way to find out other than allocate. Moreover if the allocation failed and the OOM killer hasn't been invoked then we are unlikely to do the right thing from the #PF context because we have already lost the allocation context and restictions and therefore might oom kill a task from a different NUMA domain. This all suggests that there is no legitimate reason to trigger out_of_memory from pagefault_out_of_memory so drop it. Just to be sure that no #PF path returns with VM_FAULT_OOM without allocation print a warning that this is happening before we restart the #PF. [VvS: #PF allocation can hit into limit of cgroup v1 kmem controller. This is a local problem related to memcg, however, it causes unnecessary global OOM kills that are repeated over and over again and escalate into a real disaster. This has been broken since kmem accounting has been introduced for cgroup v1 (3.8). There was no kmem specific reclaim for the separate limit so the only way to handle kmem hard limit was to return with ENOMEM. In upstream the problem will be fixed by removing the outdated kmem limit, however stable and LTS kernels cannot do it and are still affected. This patch fixes the problem and should be backported into stable/LTS.] Link: https://lkml.kernel.org/r/f5fd8dd8-0ad4-c524-5f65-920b01972a42@virtuozzo.com Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Uladzislau Rezki <urezki@gmail.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/60e2793d440a3ec95abb5d6d4fc034a4b480472d;" In upstream the problem will be fixed by removing the outdated kmem limit, however stable and LTS kernels cannot do it and are still affected";no;no;yes 14;C_kwDOACN7MtoAKDYwZTI3OTNkNDQwYTNlYzk1YWJiNWQ2ZDRmYzAzNGE0YjQ4MDQ3MmQ;Michal Hocko;Linus Torvalds;"mm, oom: do not trigger out_of_memory from the #PF Any allocation failure during the #PF path will return with VM_FAULT_OOM which in turn results in pagefault_out_of_memory. This can happen for 2 different reasons. a) Memcg is out of memory and we rely on mem_cgroup_oom_synchronize to perform the memcg OOM handling or b) normal allocation fails. The latter is quite problematic because allocation paths already trigger out_of_memory and the page allocator tries really hard to not fail allocations. Anyway, if the OOM killer has been already invoked there is no reason to invoke it again from the #PF path. Especially when the OOM condition might be gone by that time and we have no way to find out other than allocate. Moreover if the allocation failed and the OOM killer hasn't been invoked then we are unlikely to do the right thing from the #PF context because we have already lost the allocation context and restictions and therefore might oom kill a task from a different NUMA domain. This all suggests that there is no legitimate reason to trigger out_of_memory from pagefault_out_of_memory so drop it. Just to be sure that no #PF path returns with VM_FAULT_OOM without allocation print a warning that this is happening before we restart the #PF. [VvS: #PF allocation can hit into limit of cgroup v1 kmem controller. This is a local problem related to memcg, however, it causes unnecessary global OOM kills that are repeated over and over again and escalate into a real disaster. This has been broken since kmem accounting has been introduced for cgroup v1 (3.8). There was no kmem specific reclaim for the separate limit so the only way to handle kmem hard limit was to return with ENOMEM. In upstream the problem will be fixed by removing the outdated kmem limit, however stable and LTS kernels cannot do it and are still affected. This patch fixes the problem and should be backported into stable/LTS.] Link: https://lkml.kernel.org/r/f5fd8dd8-0ad4-c524-5f65-920b01972a42@virtuozzo.com Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Uladzislau Rezki <urezki@gmail.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/60e2793d440a3ec95abb5d6d4fc034a4b480472d;" This patch fixes the problem and should be backported into stable/LTS.]";yes;yes;no 16;C_kwDOACN7MtoAKDMzNzU0NmU4M2ZjN2U1MDkxN2Y0NDg0NmJlZWU5MzZhYmI5YzlmMWY;Suren Baghdasaryan;Linus Torvalds;"mm/oom_kill.c: prevent a race between process_mrelease and exit_mmap Race between process_mrelease and exit_mmap, where free_pgtables is called while __oom_reap_task_mm is in progress, leads to kernel crash during pte_offset_map_lock call. oom-reaper avoids this race by setting MMF_OOM_VICTIM flag and causing exit_mmap to take and release mmap_write_lock, blocking it until oom-reaper releases mmap_read_lock. Reusing MMF_OOM_VICTIM for process_mrelease would be the simplest way to fix this race, however that would be considered a hack. Fix this race by elevating mm->mm_users and preventing exit_mmap from executing until process_mrelease is finished. Patch slightly refactors the code to adapt for a possible mmget_not_zero failure. This fix has considerable negative impact on process_mrelease performance and will likely need later optimization. Link: https://lkml.kernel.org/r/20211022014658.263508-1-surenb@google.com Fixes: 884a7e5964e0 (""mm: introduce process_mrelease system call"") Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <guro@fb.com> Cc: Rik van Riel <riel@surriel.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Christian Brauner <christian@brauner.io> Cc: Christoph Hellwig <hch@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Florian Weimer <fweimer@redhat.com> Cc: Jan Engelhardt <jengelh@inai.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/337546e83fc7e50917f44846beee936abb9c9f1f;mm/oom_kill.c: prevent a race between process_mrelease and exit_mmap;yes;yes;no 16;C_kwDOACN7MtoAKDMzNzU0NmU4M2ZjN2U1MDkxN2Y0NDg0NmJlZWU5MzZhYmI5YzlmMWY;Suren Baghdasaryan;Linus Torvalds;"mm/oom_kill.c: prevent a race between process_mrelease and exit_mmap Race between process_mrelease and exit_mmap, where free_pgtables is called while __oom_reap_task_mm is in progress, leads to kernel crash during pte_offset_map_lock call. oom-reaper avoids this race by setting MMF_OOM_VICTIM flag and causing exit_mmap to take and release mmap_write_lock, blocking it until oom-reaper releases mmap_read_lock. Reusing MMF_OOM_VICTIM for process_mrelease would be the simplest way to fix this race, however that would be considered a hack. Fix this race by elevating mm->mm_users and preventing exit_mmap from executing until process_mrelease is finished. Patch slightly refactors the code to adapt for a possible mmget_not_zero failure. This fix has considerable negative impact on process_mrelease performance and will likely need later optimization. Link: https://lkml.kernel.org/r/20211022014658.263508-1-surenb@google.com Fixes: 884a7e5964e0 (""mm: introduce process_mrelease system call"") Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <guro@fb.com> Cc: Rik van Riel <riel@surriel.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Christian Brauner <christian@brauner.io> Cc: Christoph Hellwig <hch@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Florian Weimer <fweimer@redhat.com> Cc: Jan Engelhardt <jengelh@inai.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/337546e83fc7e50917f44846beee936abb9c9f1f;"Race between process_mrelease and exit_mmap, where free_pgtables is called while __oom_reap_task_mm is in progress, leads to kernel crash during pte_offset_map_lock call";no;yes;yes 16;C_kwDOACN7MtoAKDMzNzU0NmU4M2ZjN2U1MDkxN2Y0NDg0NmJlZWU5MzZhYmI5YzlmMWY;Suren Baghdasaryan;Linus Torvalds;"mm/oom_kill.c: prevent a race between process_mrelease and exit_mmap Race between process_mrelease and exit_mmap, where free_pgtables is called while __oom_reap_task_mm is in progress, leads to kernel crash during pte_offset_map_lock call. oom-reaper avoids this race by setting MMF_OOM_VICTIM flag and causing exit_mmap to take and release mmap_write_lock, blocking it until oom-reaper releases mmap_read_lock. Reusing MMF_OOM_VICTIM for process_mrelease would be the simplest way to fix this race, however that would be considered a hack. Fix this race by elevating mm->mm_users and preventing exit_mmap from executing until process_mrelease is finished. Patch slightly refactors the code to adapt for a possible mmget_not_zero failure. This fix has considerable negative impact on process_mrelease performance and will likely need later optimization. Link: https://lkml.kernel.org/r/20211022014658.263508-1-surenb@google.com Fixes: 884a7e5964e0 (""mm: introduce process_mrelease system call"") Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <guro@fb.com> Cc: Rik van Riel <riel@surriel.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Christian Brauner <christian@brauner.io> Cc: Christoph Hellwig <hch@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Florian Weimer <fweimer@redhat.com> Cc: Jan Engelhardt <jengelh@inai.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/337546e83fc7e50917f44846beee936abb9c9f1f;" oom-reaper avoids this race by setting MMF_OOM_VICTIM flag and causing exit_mmap to take and release mmap_write_lock, blocking it until oom-reaper releases mmap_read_lock";no;no;yes 16;C_kwDOACN7MtoAKDMzNzU0NmU4M2ZjN2U1MDkxN2Y0NDg0NmJlZWU5MzZhYmI5YzlmMWY;Suren Baghdasaryan;Linus Torvalds;"mm/oom_kill.c: prevent a race between process_mrelease and exit_mmap Race between process_mrelease and exit_mmap, where free_pgtables is called while __oom_reap_task_mm is in progress, leads to kernel crash during pte_offset_map_lock call. oom-reaper avoids this race by setting MMF_OOM_VICTIM flag and causing exit_mmap to take and release mmap_write_lock, blocking it until oom-reaper releases mmap_read_lock. Reusing MMF_OOM_VICTIM for process_mrelease would be the simplest way to fix this race, however that would be considered a hack. Fix this race by elevating mm->mm_users and preventing exit_mmap from executing until process_mrelease is finished. Patch slightly refactors the code to adapt for a possible mmget_not_zero failure. This fix has considerable negative impact on process_mrelease performance and will likely need later optimization. Link: https://lkml.kernel.org/r/20211022014658.263508-1-surenb@google.com Fixes: 884a7e5964e0 (""mm: introduce process_mrelease system call"") Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <guro@fb.com> Cc: Rik van Riel <riel@surriel.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Christian Brauner <christian@brauner.io> Cc: Christoph Hellwig <hch@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Florian Weimer <fweimer@redhat.com> Cc: Jan Engelhardt <jengelh@inai.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/337546e83fc7e50917f44846beee936abb9c9f1f;"Reusing MMF_OOM_VICTIM for process_mrelease would be the simplest way to fix this race, however that would be considered a hack";yes;no;yes 16;C_kwDOACN7MtoAKDMzNzU0NmU4M2ZjN2U1MDkxN2Y0NDg0NmJlZWU5MzZhYmI5YzlmMWY;Suren Baghdasaryan;Linus Torvalds;"mm/oom_kill.c: prevent a race between process_mrelease and exit_mmap Race between process_mrelease and exit_mmap, where free_pgtables is called while __oom_reap_task_mm is in progress, leads to kernel crash during pte_offset_map_lock call. oom-reaper avoids this race by setting MMF_OOM_VICTIM flag and causing exit_mmap to take and release mmap_write_lock, blocking it until oom-reaper releases mmap_read_lock. Reusing MMF_OOM_VICTIM for process_mrelease would be the simplest way to fix this race, however that would be considered a hack. Fix this race by elevating mm->mm_users and preventing exit_mmap from executing until process_mrelease is finished. Patch slightly refactors the code to adapt for a possible mmget_not_zero failure. This fix has considerable negative impact on process_mrelease performance and will likely need later optimization. Link: https://lkml.kernel.org/r/20211022014658.263508-1-surenb@google.com Fixes: 884a7e5964e0 (""mm: introduce process_mrelease system call"") Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <guro@fb.com> Cc: Rik van Riel <riel@surriel.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Christian Brauner <christian@brauner.io> Cc: Christoph Hellwig <hch@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Florian Weimer <fweimer@redhat.com> Cc: Jan Engelhardt <jengelh@inai.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/337546e83fc7e50917f44846beee936abb9c9f1f;" Fix this race by elevating mm->mm_users and preventing exit_mmap from executing until process_mrelease is finished";yes;yes;no 16;C_kwDOACN7MtoAKDMzNzU0NmU4M2ZjN2U1MDkxN2Y0NDg0NmJlZWU5MzZhYmI5YzlmMWY;Suren Baghdasaryan;Linus Torvalds;"mm/oom_kill.c: prevent a race between process_mrelease and exit_mmap Race between process_mrelease and exit_mmap, where free_pgtables is called while __oom_reap_task_mm is in progress, leads to kernel crash during pte_offset_map_lock call. oom-reaper avoids this race by setting MMF_OOM_VICTIM flag and causing exit_mmap to take and release mmap_write_lock, blocking it until oom-reaper releases mmap_read_lock. Reusing MMF_OOM_VICTIM for process_mrelease would be the simplest way to fix this race, however that would be considered a hack. Fix this race by elevating mm->mm_users and preventing exit_mmap from executing until process_mrelease is finished. Patch slightly refactors the code to adapt for a possible mmget_not_zero failure. This fix has considerable negative impact on process_mrelease performance and will likely need later optimization. Link: https://lkml.kernel.org/r/20211022014658.263508-1-surenb@google.com Fixes: 884a7e5964e0 (""mm: introduce process_mrelease system call"") Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <guro@fb.com> Cc: Rik van Riel <riel@surriel.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Christian Brauner <christian@brauner.io> Cc: Christoph Hellwig <hch@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Florian Weimer <fweimer@redhat.com> Cc: Jan Engelhardt <jengelh@inai.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/337546e83fc7e50917f44846beee936abb9c9f1f;" Patch slightly refactors the code to adapt for a possible mmget_not_zero failure";yes;yes;no 16;C_kwDOACN7MtoAKDMzNzU0NmU4M2ZjN2U1MDkxN2Y0NDg0NmJlZWU5MzZhYmI5YzlmMWY;Suren Baghdasaryan;Linus Torvalds;"mm/oom_kill.c: prevent a race between process_mrelease and exit_mmap Race between process_mrelease and exit_mmap, where free_pgtables is called while __oom_reap_task_mm is in progress, leads to kernel crash during pte_offset_map_lock call. oom-reaper avoids this race by setting MMF_OOM_VICTIM flag and causing exit_mmap to take and release mmap_write_lock, blocking it until oom-reaper releases mmap_read_lock. Reusing MMF_OOM_VICTIM for process_mrelease would be the simplest way to fix this race, however that would be considered a hack. Fix this race by elevating mm->mm_users and preventing exit_mmap from executing until process_mrelease is finished. Patch slightly refactors the code to adapt for a possible mmget_not_zero failure. This fix has considerable negative impact on process_mrelease performance and will likely need later optimization. Link: https://lkml.kernel.org/r/20211022014658.263508-1-surenb@google.com Fixes: 884a7e5964e0 (""mm: introduce process_mrelease system call"") Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <guro@fb.com> Cc: Rik van Riel <riel@surriel.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Christian Brauner <christian@brauner.io> Cc: Christoph Hellwig <hch@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Florian Weimer <fweimer@redhat.com> Cc: Jan Engelhardt <jengelh@inai.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/337546e83fc7e50917f44846beee936abb9c9f1f;"This fix has considerable negative impact on process_mrelease performance and will likely need later optimization.";yes;yes;no 17;C_kwDOACN7MtoAKGVlOTk1NWQ2MWEwYTc3MDE1MmY5YzNhZjQ3MGJkMTY4OWYwMzRjNzQ;Christian Brauner;Christian Brauner;"mm: use pidfd_get_task() Instead of duplicating the same code in two places use the newly added pidfd_get_task() helper. This fixes an (unimportant for now) bug where PIDTYPE_PID is used whereas PIDTYPE_TGID should have been used. Link: https://lore.kernel.org/r/20211004125050.1153693-3-christian.brauner@ubuntu.com Link: https://lore.kernel.org/r/20211011133245.1703103-3-brauner@kernel.org Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Matthew Bobrowski <repnop@google.com> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Minchan Kim <minchan@kernel.org> Reviewed-by: Matthew Bobrowski <repnop@google.com> Acked-by: David Hildenbrand <david@redhat.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>";https://api.github.com/repos/torvalds/linux/git/commits/ee9955d61a0a770152f9c3af470bd1689f034c74;mm: use pidfd_get_task();yes;no;no 17;C_kwDOACN7MtoAKGVlOTk1NWQ2MWEwYTc3MDE1MmY5YzNhZjQ3MGJkMTY4OWYwMzRjNzQ;Christian Brauner;Christian Brauner;"mm: use pidfd_get_task() Instead of duplicating the same code in two places use the newly added pidfd_get_task() helper. This fixes an (unimportant for now) bug where PIDTYPE_PID is used whereas PIDTYPE_TGID should have been used. Link: https://lore.kernel.org/r/20211004125050.1153693-3-christian.brauner@ubuntu.com Link: https://lore.kernel.org/r/20211011133245.1703103-3-brauner@kernel.org Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Matthew Bobrowski <repnop@google.com> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Minchan Kim <minchan@kernel.org> Reviewed-by: Matthew Bobrowski <repnop@google.com> Acked-by: David Hildenbrand <david@redhat.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>";https://api.github.com/repos/torvalds/linux/git/commits/ee9955d61a0a770152f9c3af470bd1689f034c74;"Instead of duplicating the same code in two places use the newly added pidfd_get_task() helper";yes;yes;no 17;C_kwDOACN7MtoAKGVlOTk1NWQ2MWEwYTc3MDE1MmY5YzNhZjQ3MGJkMTY4OWYwMzRjNzQ;Christian Brauner;Christian Brauner;"mm: use pidfd_get_task() Instead of duplicating the same code in two places use the newly added pidfd_get_task() helper. This fixes an (unimportant for now) bug where PIDTYPE_PID is used whereas PIDTYPE_TGID should have been used. Link: https://lore.kernel.org/r/20211004125050.1153693-3-christian.brauner@ubuntu.com Link: https://lore.kernel.org/r/20211011133245.1703103-3-brauner@kernel.org Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Matthew Bobrowski <repnop@google.com> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Minchan Kim <minchan@kernel.org> Reviewed-by: Matthew Bobrowski <repnop@google.com> Acked-by: David Hildenbrand <david@redhat.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>";https://api.github.com/repos/torvalds/linux/git/commits/ee9955d61a0a770152f9c3af470bd1689f034c74;"This fixes an (unimportant for now) bug where PIDTYPE_PID is used whereas PIDTYPE_TGID should have been used.";yes;yes;no 18;C_kwDOACN7MtoAKDkyMzA3MzgzMDgyZGFmZjVkZjg4NGEyNWRmOWUyODNlZmI3ZWYyNjE;Eric W. Biederman;Eric W. Biederman;"coredump: Don't perform any cleanups before dumping core Rename coredump_exit_mm to coredump_task_exit and call it from do_exit before PTRACE_EVENT_EXIT, and before any cleanup work for a task happens. This ensures that an accurate copy of the process can be captured in the coredump as no cleanup for the process happens before the coredump completes. This also ensures that PTRACE_EVENT_EXIT will not be visited by any thread until the coredump is complete. Add a new flag PF_POSTCOREDUMP so that tasks that have passed through coredump_task_exit can be recognized and ignored in zap_process. Now that all of the coredumping happens before exit_mm remove code to test for a coredump in progress from mm_release. Replace ""may_ptrace_stop()"" with a simple test of ""current->ptrace"". The other tests in may_ptrace_stop all concern avoiding stopping during a coredump. These tests are no longer necessary as it is now guaranteed that fatal_signal_pending will be set if the code enters ptrace_stop during a coredump. The code in ptrace_stop is guaranteed not to stop if fatal_signal_pending returns true. Until this change ""ptrace_event(PTRACE_EVENT_EXIT)"" could call ptrace_stop without fatal_signal_pending being true, as signals are dequeued in get_signal before calling do_exit. This is no longer an issue as ""ptrace_event(PTRACE_EVENT_EXIT)"" is no longer reached until after the coredump completes. Link: https://lkml.kernel.org/r/874kaax26c.fsf@disp2133 Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: ""Eric W. Biederman"" <ebiederm@xmission.com>";https://api.github.com/repos/torvalds/linux/git/commits/92307383082daff5df884a25df9e283efb7ef261;coredump: Don't perform any cleanups before dumping core;yes;no;no 18;C_kwDOACN7MtoAKDkyMzA3MzgzMDgyZGFmZjVkZjg4NGEyNWRmOWUyODNlZmI3ZWYyNjE;Eric W. Biederman;Eric W. Biederman;"coredump: Don't perform any cleanups before dumping core Rename coredump_exit_mm to coredump_task_exit and call it from do_exit before PTRACE_EVENT_EXIT, and before any cleanup work for a task happens. This ensures that an accurate copy of the process can be captured in the coredump as no cleanup for the process happens before the coredump completes. This also ensures that PTRACE_EVENT_EXIT will not be visited by any thread until the coredump is complete. Add a new flag PF_POSTCOREDUMP so that tasks that have passed through coredump_task_exit can be recognized and ignored in zap_process. Now that all of the coredumping happens before exit_mm remove code to test for a coredump in progress from mm_release. Replace ""may_ptrace_stop()"" with a simple test of ""current->ptrace"". The other tests in may_ptrace_stop all concern avoiding stopping during a coredump. These tests are no longer necessary as it is now guaranteed that fatal_signal_pending will be set if the code enters ptrace_stop during a coredump. The code in ptrace_stop is guaranteed not to stop if fatal_signal_pending returns true. Until this change ""ptrace_event(PTRACE_EVENT_EXIT)"" could call ptrace_stop without fatal_signal_pending being true, as signals are dequeued in get_signal before calling do_exit. This is no longer an issue as ""ptrace_event(PTRACE_EVENT_EXIT)"" is no longer reached until after the coredump completes. Link: https://lkml.kernel.org/r/874kaax26c.fsf@disp2133 Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: ""Eric W. Biederman"" <ebiederm@xmission.com>";https://api.github.com/repos/torvalds/linux/git/commits/92307383082daff5df884a25df9e283efb7ef261;"Rename coredump_exit_mm to coredump_task_exit and call it from do_exit before PTRACE_EVENT_EXIT, and before any cleanup work for a task happens";yes;no;no 18;C_kwDOACN7MtoAKDkyMzA3MzgzMDgyZGFmZjVkZjg4NGEyNWRmOWUyODNlZmI3ZWYyNjE;Eric W. Biederman;Eric W. Biederman;"coredump: Don't perform any cleanups before dumping core Rename coredump_exit_mm to coredump_task_exit and call it from do_exit before PTRACE_EVENT_EXIT, and before any cleanup work for a task happens. This ensures that an accurate copy of the process can be captured in the coredump as no cleanup for the process happens before the coredump completes. This also ensures that PTRACE_EVENT_EXIT will not be visited by any thread until the coredump is complete. Add a new flag PF_POSTCOREDUMP so that tasks that have passed through coredump_task_exit can be recognized and ignored in zap_process. Now that all of the coredumping happens before exit_mm remove code to test for a coredump in progress from mm_release. Replace ""may_ptrace_stop()"" with a simple test of ""current->ptrace"". The other tests in may_ptrace_stop all concern avoiding stopping during a coredump. These tests are no longer necessary as it is now guaranteed that fatal_signal_pending will be set if the code enters ptrace_stop during a coredump. The code in ptrace_stop is guaranteed not to stop if fatal_signal_pending returns true. Until this change ""ptrace_event(PTRACE_EVENT_EXIT)"" could call ptrace_stop without fatal_signal_pending being true, as signals are dequeued in get_signal before calling do_exit. This is no longer an issue as ""ptrace_event(PTRACE_EVENT_EXIT)"" is no longer reached until after the coredump completes. Link: https://lkml.kernel.org/r/874kaax26c.fsf@disp2133 Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: ""Eric W. Biederman"" <ebiederm@xmission.com>";https://api.github.com/repos/torvalds/linux/git/commits/92307383082daff5df884a25df9e283efb7ef261;" This ensures that an accurate copy of the process can be captured in the coredump as no cleanup for the process happens before the coredump completes";yes;yes;no 18;C_kwDOACN7MtoAKDkyMzA3MzgzMDgyZGFmZjVkZjg4NGEyNWRmOWUyODNlZmI3ZWYyNjE;Eric W. Biederman;Eric W. Biederman;"coredump: Don't perform any cleanups before dumping core Rename coredump_exit_mm to coredump_task_exit and call it from do_exit before PTRACE_EVENT_EXIT, and before any cleanup work for a task happens. This ensures that an accurate copy of the process can be captured in the coredump as no cleanup for the process happens before the coredump completes. This also ensures that PTRACE_EVENT_EXIT will not be visited by any thread until the coredump is complete. Add a new flag PF_POSTCOREDUMP so that tasks that have passed through coredump_task_exit can be recognized and ignored in zap_process. Now that all of the coredumping happens before exit_mm remove code to test for a coredump in progress from mm_release. Replace ""may_ptrace_stop()"" with a simple test of ""current->ptrace"". The other tests in may_ptrace_stop all concern avoiding stopping during a coredump. These tests are no longer necessary as it is now guaranteed that fatal_signal_pending will be set if the code enters ptrace_stop during a coredump. The code in ptrace_stop is guaranteed not to stop if fatal_signal_pending returns true. Until this change ""ptrace_event(PTRACE_EVENT_EXIT)"" could call ptrace_stop without fatal_signal_pending being true, as signals are dequeued in get_signal before calling do_exit. This is no longer an issue as ""ptrace_event(PTRACE_EVENT_EXIT)"" is no longer reached until after the coredump completes. Link: https://lkml.kernel.org/r/874kaax26c.fsf@disp2133 Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: ""Eric W. Biederman"" <ebiederm@xmission.com>";https://api.github.com/repos/torvalds/linux/git/commits/92307383082daff5df884a25df9e283efb7ef261;" This also ensures that PTRACE_EVENT_EXIT will not be visited by any thread until the coredump is complete";yes;yes;no 18;C_kwDOACN7MtoAKDkyMzA3MzgzMDgyZGFmZjVkZjg4NGEyNWRmOWUyODNlZmI3ZWYyNjE;Eric W. Biederman;Eric W. Biederman;"coredump: Don't perform any cleanups before dumping core Rename coredump_exit_mm to coredump_task_exit and call it from do_exit before PTRACE_EVENT_EXIT, and before any cleanup work for a task happens. This ensures that an accurate copy of the process can be captured in the coredump as no cleanup for the process happens before the coredump completes. This also ensures that PTRACE_EVENT_EXIT will not be visited by any thread until the coredump is complete. Add a new flag PF_POSTCOREDUMP so that tasks that have passed through coredump_task_exit can be recognized and ignored in zap_process. Now that all of the coredumping happens before exit_mm remove code to test for a coredump in progress from mm_release. Replace ""may_ptrace_stop()"" with a simple test of ""current->ptrace"". The other tests in may_ptrace_stop all concern avoiding stopping during a coredump. These tests are no longer necessary as it is now guaranteed that fatal_signal_pending will be set if the code enters ptrace_stop during a coredump. The code in ptrace_stop is guaranteed not to stop if fatal_signal_pending returns true. Until this change ""ptrace_event(PTRACE_EVENT_EXIT)"" could call ptrace_stop without fatal_signal_pending being true, as signals are dequeued in get_signal before calling do_exit. This is no longer an issue as ""ptrace_event(PTRACE_EVENT_EXIT)"" is no longer reached until after the coredump completes. Link: https://lkml.kernel.org/r/874kaax26c.fsf@disp2133 Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: ""Eric W. Biederman"" <ebiederm@xmission.com>";https://api.github.com/repos/torvalds/linux/git/commits/92307383082daff5df884a25df9e283efb7ef261;"Add a new flag PF_POSTCOREDUMP so that tasks that have passed through coredump_task_exit can be recognized and ignored in zap_process";yes;yes;no 18;C_kwDOACN7MtoAKDkyMzA3MzgzMDgyZGFmZjVkZjg4NGEyNWRmOWUyODNlZmI3ZWYyNjE;Eric W. Biederman;Eric W. Biederman;"coredump: Don't perform any cleanups before dumping core Rename coredump_exit_mm to coredump_task_exit and call it from do_exit before PTRACE_EVENT_EXIT, and before any cleanup work for a task happens. This ensures that an accurate copy of the process can be captured in the coredump as no cleanup for the process happens before the coredump completes. This also ensures that PTRACE_EVENT_EXIT will not be visited by any thread until the coredump is complete. Add a new flag PF_POSTCOREDUMP so that tasks that have passed through coredump_task_exit can be recognized and ignored in zap_process. Now that all of the coredumping happens before exit_mm remove code to test for a coredump in progress from mm_release. Replace ""may_ptrace_stop()"" with a simple test of ""current->ptrace"". The other tests in may_ptrace_stop all concern avoiding stopping during a coredump. These tests are no longer necessary as it is now guaranteed that fatal_signal_pending will be set if the code enters ptrace_stop during a coredump. The code in ptrace_stop is guaranteed not to stop if fatal_signal_pending returns true. Until this change ""ptrace_event(PTRACE_EVENT_EXIT)"" could call ptrace_stop without fatal_signal_pending being true, as signals are dequeued in get_signal before calling do_exit. This is no longer an issue as ""ptrace_event(PTRACE_EVENT_EXIT)"" is no longer reached until after the coredump completes. Link: https://lkml.kernel.org/r/874kaax26c.fsf@disp2133 Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: ""Eric W. Biederman"" <ebiederm@xmission.com>";https://api.github.com/repos/torvalds/linux/git/commits/92307383082daff5df884a25df9e283efb7ef261;"Now that all of the coredumping happens before exit_mm remove code to test for a coredump in progress from mm_release";yes;no;no 18;C_kwDOACN7MtoAKDkyMzA3MzgzMDgyZGFmZjVkZjg4NGEyNWRmOWUyODNlZmI3ZWYyNjE;Eric W. Biederman;Eric W. Biederman;"coredump: Don't perform any cleanups before dumping core Rename coredump_exit_mm to coredump_task_exit and call it from do_exit before PTRACE_EVENT_EXIT, and before any cleanup work for a task happens. This ensures that an accurate copy of the process can be captured in the coredump as no cleanup for the process happens before the coredump completes. This also ensures that PTRACE_EVENT_EXIT will not be visited by any thread until the coredump is complete. Add a new flag PF_POSTCOREDUMP so that tasks that have passed through coredump_task_exit can be recognized and ignored in zap_process. Now that all of the coredumping happens before exit_mm remove code to test for a coredump in progress from mm_release. Replace ""may_ptrace_stop()"" with a simple test of ""current->ptrace"". The other tests in may_ptrace_stop all concern avoiding stopping during a coredump. These tests are no longer necessary as it is now guaranteed that fatal_signal_pending will be set if the code enters ptrace_stop during a coredump. The code in ptrace_stop is guaranteed not to stop if fatal_signal_pending returns true. Until this change ""ptrace_event(PTRACE_EVENT_EXIT)"" could call ptrace_stop without fatal_signal_pending being true, as signals are dequeued in get_signal before calling do_exit. This is no longer an issue as ""ptrace_event(PTRACE_EVENT_EXIT)"" is no longer reached until after the coredump completes. Link: https://lkml.kernel.org/r/874kaax26c.fsf@disp2133 Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: ""Eric W. Biederman"" <ebiederm@xmission.com>";https://api.github.com/repos/torvalds/linux/git/commits/92307383082daff5df884a25df9e283efb7ef261;"Replace ""may_ptrace_stop()"" with a simple test of ""current->ptrace""";yes;yes;no 18;C_kwDOACN7MtoAKDkyMzA3MzgzMDgyZGFmZjVkZjg4NGEyNWRmOWUyODNlZmI3ZWYyNjE;Eric W. Biederman;Eric W. Biederman;"coredump: Don't perform any cleanups before dumping core Rename coredump_exit_mm to coredump_task_exit and call it from do_exit before PTRACE_EVENT_EXIT, and before any cleanup work for a task happens. This ensures that an accurate copy of the process can be captured in the coredump as no cleanup for the process happens before the coredump completes. This also ensures that PTRACE_EVENT_EXIT will not be visited by any thread until the coredump is complete. Add a new flag PF_POSTCOREDUMP so that tasks that have passed through coredump_task_exit can be recognized and ignored in zap_process. Now that all of the coredumping happens before exit_mm remove code to test for a coredump in progress from mm_release. Replace ""may_ptrace_stop()"" with a simple test of ""current->ptrace"". The other tests in may_ptrace_stop all concern avoiding stopping during a coredump. These tests are no longer necessary as it is now guaranteed that fatal_signal_pending will be set if the code enters ptrace_stop during a coredump. The code in ptrace_stop is guaranteed not to stop if fatal_signal_pending returns true. Until this change ""ptrace_event(PTRACE_EVENT_EXIT)"" could call ptrace_stop without fatal_signal_pending being true, as signals are dequeued in get_signal before calling do_exit. This is no longer an issue as ""ptrace_event(PTRACE_EVENT_EXIT)"" is no longer reached until after the coredump completes. Link: https://lkml.kernel.org/r/874kaax26c.fsf@disp2133 Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: ""Eric W. Biederman"" <ebiederm@xmission.com>";https://api.github.com/repos/torvalds/linux/git/commits/92307383082daff5df884a25df9e283efb7ef261;"The other tests in may_ptrace_stop all concern avoiding stopping during a coredump";no;no;yes 18;C_kwDOACN7MtoAKDkyMzA3MzgzMDgyZGFmZjVkZjg4NGEyNWRmOWUyODNlZmI3ZWYyNjE;Eric W. Biederman;Eric W. Biederman;"coredump: Don't perform any cleanups before dumping core Rename coredump_exit_mm to coredump_task_exit and call it from do_exit before PTRACE_EVENT_EXIT, and before any cleanup work for a task happens. This ensures that an accurate copy of the process can be captured in the coredump as no cleanup for the process happens before the coredump completes. This also ensures that PTRACE_EVENT_EXIT will not be visited by any thread until the coredump is complete. Add a new flag PF_POSTCOREDUMP so that tasks that have passed through coredump_task_exit can be recognized and ignored in zap_process. Now that all of the coredumping happens before exit_mm remove code to test for a coredump in progress from mm_release. Replace ""may_ptrace_stop()"" with a simple test of ""current->ptrace"". The other tests in may_ptrace_stop all concern avoiding stopping during a coredump. These tests are no longer necessary as it is now guaranteed that fatal_signal_pending will be set if the code enters ptrace_stop during a coredump. The code in ptrace_stop is guaranteed not to stop if fatal_signal_pending returns true. Until this change ""ptrace_event(PTRACE_EVENT_EXIT)"" could call ptrace_stop without fatal_signal_pending being true, as signals are dequeued in get_signal before calling do_exit. This is no longer an issue as ""ptrace_event(PTRACE_EVENT_EXIT)"" is no longer reached until after the coredump completes. Link: https://lkml.kernel.org/r/874kaax26c.fsf@disp2133 Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: ""Eric W. Biederman"" <ebiederm@xmission.com>";https://api.github.com/repos/torvalds/linux/git/commits/92307383082daff5df884a25df9e283efb7ef261;" These tests are no longer necessary as it is now guaranteed that fatal_signal_pending will be set if the code enters ptrace_stop during a coredump";yes;yes;yes 18;C_kwDOACN7MtoAKDkyMzA3MzgzMDgyZGFmZjVkZjg4NGEyNWRmOWUyODNlZmI3ZWYyNjE;Eric W. Biederman;Eric W. Biederman;"coredump: Don't perform any cleanups before dumping core Rename coredump_exit_mm to coredump_task_exit and call it from do_exit before PTRACE_EVENT_EXIT, and before any cleanup work for a task happens. This ensures that an accurate copy of the process can be captured in the coredump as no cleanup for the process happens before the coredump completes. This also ensures that PTRACE_EVENT_EXIT will not be visited by any thread until the coredump is complete. Add a new flag PF_POSTCOREDUMP so that tasks that have passed through coredump_task_exit can be recognized and ignored in zap_process. Now that all of the coredumping happens before exit_mm remove code to test for a coredump in progress from mm_release. Replace ""may_ptrace_stop()"" with a simple test of ""current->ptrace"". The other tests in may_ptrace_stop all concern avoiding stopping during a coredump. These tests are no longer necessary as it is now guaranteed that fatal_signal_pending will be set if the code enters ptrace_stop during a coredump. The code in ptrace_stop is guaranteed not to stop if fatal_signal_pending returns true. Until this change ""ptrace_event(PTRACE_EVENT_EXIT)"" could call ptrace_stop without fatal_signal_pending being true, as signals are dequeued in get_signal before calling do_exit. This is no longer an issue as ""ptrace_event(PTRACE_EVENT_EXIT)"" is no longer reached until after the coredump completes. Link: https://lkml.kernel.org/r/874kaax26c.fsf@disp2133 Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: ""Eric W. Biederman"" <ebiederm@xmission.com>";https://api.github.com/repos/torvalds/linux/git/commits/92307383082daff5df884a25df9e283efb7ef261;" The code in ptrace_stop is guaranteed not to stop if fatal_signal_pending returns true";no;no;yes 18;C_kwDOACN7MtoAKDkyMzA3MzgzMDgyZGFmZjVkZjg4NGEyNWRmOWUyODNlZmI3ZWYyNjE;Eric W. Biederman;Eric W. Biederman;"coredump: Don't perform any cleanups before dumping core Rename coredump_exit_mm to coredump_task_exit and call it from do_exit before PTRACE_EVENT_EXIT, and before any cleanup work for a task happens. This ensures that an accurate copy of the process can be captured in the coredump as no cleanup for the process happens before the coredump completes. This also ensures that PTRACE_EVENT_EXIT will not be visited by any thread until the coredump is complete. Add a new flag PF_POSTCOREDUMP so that tasks that have passed through coredump_task_exit can be recognized and ignored in zap_process. Now that all of the coredumping happens before exit_mm remove code to test for a coredump in progress from mm_release. Replace ""may_ptrace_stop()"" with a simple test of ""current->ptrace"". The other tests in may_ptrace_stop all concern avoiding stopping during a coredump. These tests are no longer necessary as it is now guaranteed that fatal_signal_pending will be set if the code enters ptrace_stop during a coredump. The code in ptrace_stop is guaranteed not to stop if fatal_signal_pending returns true. Until this change ""ptrace_event(PTRACE_EVENT_EXIT)"" could call ptrace_stop without fatal_signal_pending being true, as signals are dequeued in get_signal before calling do_exit. This is no longer an issue as ""ptrace_event(PTRACE_EVENT_EXIT)"" is no longer reached until after the coredump completes. Link: https://lkml.kernel.org/r/874kaax26c.fsf@disp2133 Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: ""Eric W. Biederman"" <ebiederm@xmission.com>";https://api.github.com/repos/torvalds/linux/git/commits/92307383082daff5df884a25df9e283efb7ef261;"Until this change ""ptrace_event(PTRACE_EVENT_EXIT)"" could call ptrace_stop without fatal_signal_pending being true, as signals are dequeued in get_signal before calling do_exit";no;no;yes 18;C_kwDOACN7MtoAKDkyMzA3MzgzMDgyZGFmZjVkZjg4NGEyNWRmOWUyODNlZmI3ZWYyNjE;Eric W. Biederman;Eric W. Biederman;"coredump: Don't perform any cleanups before dumping core Rename coredump_exit_mm to coredump_task_exit and call it from do_exit before PTRACE_EVENT_EXIT, and before any cleanup work for a task happens. This ensures that an accurate copy of the process can be captured in the coredump as no cleanup for the process happens before the coredump completes. This also ensures that PTRACE_EVENT_EXIT will not be visited by any thread until the coredump is complete. Add a new flag PF_POSTCOREDUMP so that tasks that have passed through coredump_task_exit can be recognized and ignored in zap_process. Now that all of the coredumping happens before exit_mm remove code to test for a coredump in progress from mm_release. Replace ""may_ptrace_stop()"" with a simple test of ""current->ptrace"". The other tests in may_ptrace_stop all concern avoiding stopping during a coredump. These tests are no longer necessary as it is now guaranteed that fatal_signal_pending will be set if the code enters ptrace_stop during a coredump. The code in ptrace_stop is guaranteed not to stop if fatal_signal_pending returns true. Until this change ""ptrace_event(PTRACE_EVENT_EXIT)"" could call ptrace_stop without fatal_signal_pending being true, as signals are dequeued in get_signal before calling do_exit. This is no longer an issue as ""ptrace_event(PTRACE_EVENT_EXIT)"" is no longer reached until after the coredump completes. Link: https://lkml.kernel.org/r/874kaax26c.fsf@disp2133 Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: ""Eric W. Biederman"" <ebiederm@xmission.com>";https://api.github.com/repos/torvalds/linux/git/commits/92307383082daff5df884a25df9e283efb7ef261;" This is no longer an issue as ""ptrace_event(PTRACE_EVENT_EXIT)"" is no longer reached until after the coredump completes.";yes;yes;no 19;C_kwDOACN7MtoAKGQ2N2UwM2UzNjE2MTliMjBjNTFhYWVmM2I3ZGQxNDk3NjE3YzM3MWQ;Eric W. Biederman;Eric W. Biederman;"exit: Factor coredump_exit_mm out of exit_mm Separate the coredump logic from the ordinary exit_mm logic by moving the coredump logic out of exit_mm into it's own function coredump_exit_mm. Link: https://lkml.kernel.org/r/87a6k2x277.fsf@disp2133 Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: ""Eric W. Biederman"" <ebiederm@xmission.com>";https://api.github.com/repos/torvalds/linux/git/commits/d67e03e361619b20c51aaef3b7dd1497617c371d;exit: Factor coredump_exit_mm out of exit_mm;yes;yes;no 19;C_kwDOACN7MtoAKGQ2N2UwM2UzNjE2MTliMjBjNTFhYWVmM2I3ZGQxNDk3NjE3YzM3MWQ;Eric W. Biederman;Eric W. Biederman;"exit: Factor coredump_exit_mm out of exit_mm Separate the coredump logic from the ordinary exit_mm logic by moving the coredump logic out of exit_mm into it's own function coredump_exit_mm. Link: https://lkml.kernel.org/r/87a6k2x277.fsf@disp2133 Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: ""Eric W. Biederman"" <ebiederm@xmission.com>";https://api.github.com/repos/torvalds/linux/git/commits/d67e03e361619b20c51aaef3b7dd1497617c371d;"Separate the coredump logic from the ordinary exit_mm logic by moving the coredump logic out of exit_mm into it's own function coredump_exit_mm.";yes;yes;no 20;MDY6Q29tbWl0MjMyNTI5ODo4ODRhN2U1OTY0ZTA2ZWQ5M2M3NzcxYzBkN2NmMTljMDlhODk0NmYx;Suren Baghdasaryan;Linus Torvalds;"mm: introduce process_mrelease system call In modern systems it's not unusual to have a system component monitoring memory conditions of the system and tasked with keeping system memory pressure under control. One way to accomplish that is to kill non-essential processes to free up memory for more important ones. Examples of this are Facebook's OOM killer daemon called oomd and Android's low memory killer daemon called lmkd. For such system component it's important to be able to free memory quickly and efficiently. Unfortunately the time process takes to free up its memory after receiving a SIGKILL might vary based on the state of the process (uninterruptible sleep), size and OPP level of the core the process is running. A mechanism to free resources of the target process in a more predictable way would improve system's ability to control its memory pressure. Introduce process_mrelease system call that releases memory of a dying process from the context of the caller. This way the memory is freed in a more controllable way with CPU affinity and priority of the caller. The workload of freeing the memory will also be charged to the caller. The operation is allowed only on a dying process. After previous discussions [1, 2, 3] the decision was made [4] to introduce a dedicated system call to cover this use case. The API is as follows, int process_mrelease(int pidfd, unsigned int flags); DESCRIPTION The process_mrelease() system call is used to free the memory of an exiting process. The pidfd selects the process referred to by the PID file descriptor. (See pidfd_open(2) for further information) The flags argument is reserved for future use; currently, this argument must be specified as 0. RETURN VALUE On success, process_mrelease() returns 0. On error, -1 is returned and errno is set to indicate the error. ERRORS EBADF pidfd is not a valid PID file descriptor. EAGAIN Failed to release part of the address space. EINTR The call was interrupted by a signal; see signal(7). EINVAL flags is not 0. EINVAL The memory of the task cannot be released because the process is not exiting, the address space is shared with another live process or there is a core dump in progress. ENOSYS This system call is not supported, for example, without MMU support built into Linux. ESRCH The target process does not exist (i.e., it has terminated and been waited on). [1] https://lore.kernel.org/lkml/20190411014353.113252-3-surenb@google.com/ [2] https://lore.kernel.org/linux-api/20201113173448.1863419-1-surenb@google.com/ [3] https://lore.kernel.org/linux-api/20201124053943.1684874-3-surenb@google.com/ [4] https://lore.kernel.org/linux-api/20201223075712.GA4719@lst.de/ Link: https://lkml.kernel.org/r/20210809185259.405936-1-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Cc: David Rientjes <rientjes@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <guro@fb.com> Cc: Rik van Riel <riel@surriel.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Florian Weimer <fweimer@redhat.com> Cc: Jan Engelhardt <jengelh@inai.de> Cc: Tim Murray <timmurray@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/884a7e5964e06ed93c7771c0d7cf19c09a8946f1;mm: introduce process_mrelease system call;yes;no;no 20;MDY6Q29tbWl0MjMyNTI5ODo4ODRhN2U1OTY0ZTA2ZWQ5M2M3NzcxYzBkN2NmMTljMDlhODk0NmYx;Suren Baghdasaryan;Linus Torvalds;"mm: introduce process_mrelease system call In modern systems it's not unusual to have a system component monitoring memory conditions of the system and tasked with keeping system memory pressure under control. One way to accomplish that is to kill non-essential processes to free up memory for more important ones. Examples of this are Facebook's OOM killer daemon called oomd and Android's low memory killer daemon called lmkd. For such system component it's important to be able to free memory quickly and efficiently. Unfortunately the time process takes to free up its memory after receiving a SIGKILL might vary based on the state of the process (uninterruptible sleep), size and OPP level of the core the process is running. A mechanism to free resources of the target process in a more predictable way would improve system's ability to control its memory pressure. Introduce process_mrelease system call that releases memory of a dying process from the context of the caller. This way the memory is freed in a more controllable way with CPU affinity and priority of the caller. The workload of freeing the memory will also be charged to the caller. The operation is allowed only on a dying process. After previous discussions [1, 2, 3] the decision was made [4] to introduce a dedicated system call to cover this use case. The API is as follows, int process_mrelease(int pidfd, unsigned int flags); DESCRIPTION The process_mrelease() system call is used to free the memory of an exiting process. The pidfd selects the process referred to by the PID file descriptor. (See pidfd_open(2) for further information) The flags argument is reserved for future use; currently, this argument must be specified as 0. RETURN VALUE On success, process_mrelease() returns 0. On error, -1 is returned and errno is set to indicate the error. ERRORS EBADF pidfd is not a valid PID file descriptor. EAGAIN Failed to release part of the address space. EINTR The call was interrupted by a signal; see signal(7). EINVAL flags is not 0. EINVAL The memory of the task cannot be released because the process is not exiting, the address space is shared with another live process or there is a core dump in progress. ENOSYS This system call is not supported, for example, without MMU support built into Linux. ESRCH The target process does not exist (i.e., it has terminated and been waited on). [1] https://lore.kernel.org/lkml/20190411014353.113252-3-surenb@google.com/ [2] https://lore.kernel.org/linux-api/20201113173448.1863419-1-surenb@google.com/ [3] https://lore.kernel.org/linux-api/20201124053943.1684874-3-surenb@google.com/ [4] https://lore.kernel.org/linux-api/20201223075712.GA4719@lst.de/ Link: https://lkml.kernel.org/r/20210809185259.405936-1-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Cc: David Rientjes <rientjes@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <guro@fb.com> Cc: Rik van Riel <riel@surriel.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Florian Weimer <fweimer@redhat.com> Cc: Jan Engelhardt <jengelh@inai.de> Cc: Tim Murray <timmurray@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/884a7e5964e06ed93c7771c0d7cf19c09a8946f1;"In modern systems it's not unusual to have a system component monitoring memory conditions of the system and tasked with keeping system memory pressure under control";no;no;yes 20;MDY6Q29tbWl0MjMyNTI5ODo4ODRhN2U1OTY0ZTA2ZWQ5M2M3NzcxYzBkN2NmMTljMDlhODk0NmYx;Suren Baghdasaryan;Linus Torvalds;"mm: introduce process_mrelease system call In modern systems it's not unusual to have a system component monitoring memory conditions of the system and tasked with keeping system memory pressure under control. One way to accomplish that is to kill non-essential processes to free up memory for more important ones. Examples of this are Facebook's OOM killer daemon called oomd and Android's low memory killer daemon called lmkd. For such system component it's important to be able to free memory quickly and efficiently. Unfortunately the time process takes to free up its memory after receiving a SIGKILL might vary based on the state of the process (uninterruptible sleep), size and OPP level of the core the process is running. A mechanism to free resources of the target process in a more predictable way would improve system's ability to control its memory pressure. Introduce process_mrelease system call that releases memory of a dying process from the context of the caller. This way the memory is freed in a more controllable way with CPU affinity and priority of the caller. The workload of freeing the memory will also be charged to the caller. The operation is allowed only on a dying process. After previous discussions [1, 2, 3] the decision was made [4] to introduce a dedicated system call to cover this use case. The API is as follows, int process_mrelease(int pidfd, unsigned int flags); DESCRIPTION The process_mrelease() system call is used to free the memory of an exiting process. The pidfd selects the process referred to by the PID file descriptor. (See pidfd_open(2) for further information) The flags argument is reserved for future use; currently, this argument must be specified as 0. RETURN VALUE On success, process_mrelease() returns 0. On error, -1 is returned and errno is set to indicate the error. ERRORS EBADF pidfd is not a valid PID file descriptor. EAGAIN Failed to release part of the address space. EINTR The call was interrupted by a signal; see signal(7). EINVAL flags is not 0. EINVAL The memory of the task cannot be released because the process is not exiting, the address space is shared with another live process or there is a core dump in progress. ENOSYS This system call is not supported, for example, without MMU support built into Linux. ESRCH The target process does not exist (i.e., it has terminated and been waited on). [1] https://lore.kernel.org/lkml/20190411014353.113252-3-surenb@google.com/ [2] https://lore.kernel.org/linux-api/20201113173448.1863419-1-surenb@google.com/ [3] https://lore.kernel.org/linux-api/20201124053943.1684874-3-surenb@google.com/ [4] https://lore.kernel.org/linux-api/20201223075712.GA4719@lst.de/ Link: https://lkml.kernel.org/r/20210809185259.405936-1-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Cc: David Rientjes <rientjes@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <guro@fb.com> Cc: Rik van Riel <riel@surriel.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Florian Weimer <fweimer@redhat.com> Cc: Jan Engelhardt <jengelh@inai.de> Cc: Tim Murray <timmurray@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/884a7e5964e06ed93c7771c0d7cf19c09a8946f1;" One way to accomplish that is to kill non-essential processes to free up memory for more important ones";no;no;yes 20;MDY6Q29tbWl0MjMyNTI5ODo4ODRhN2U1OTY0ZTA2ZWQ5M2M3NzcxYzBkN2NmMTljMDlhODk0NmYx;Suren Baghdasaryan;Linus Torvalds;"mm: introduce process_mrelease system call In modern systems it's not unusual to have a system component monitoring memory conditions of the system and tasked with keeping system memory pressure under control. One way to accomplish that is to kill non-essential processes to free up memory for more important ones. Examples of this are Facebook's OOM killer daemon called oomd and Android's low memory killer daemon called lmkd. For such system component it's important to be able to free memory quickly and efficiently. Unfortunately the time process takes to free up its memory after receiving a SIGKILL might vary based on the state of the process (uninterruptible sleep), size and OPP level of the core the process is running. A mechanism to free resources of the target process in a more predictable way would improve system's ability to control its memory pressure. Introduce process_mrelease system call that releases memory of a dying process from the context of the caller. This way the memory is freed in a more controllable way with CPU affinity and priority of the caller. The workload of freeing the memory will also be charged to the caller. The operation is allowed only on a dying process. After previous discussions [1, 2, 3] the decision was made [4] to introduce a dedicated system call to cover this use case. The API is as follows, int process_mrelease(int pidfd, unsigned int flags); DESCRIPTION The process_mrelease() system call is used to free the memory of an exiting process. The pidfd selects the process referred to by the PID file descriptor. (See pidfd_open(2) for further information) The flags argument is reserved for future use; currently, this argument must be specified as 0. RETURN VALUE On success, process_mrelease() returns 0. On error, -1 is returned and errno is set to indicate the error. ERRORS EBADF pidfd is not a valid PID file descriptor. EAGAIN Failed to release part of the address space. EINTR The call was interrupted by a signal; see signal(7). EINVAL flags is not 0. EINVAL The memory of the task cannot be released because the process is not exiting, the address space is shared with another live process or there is a core dump in progress. ENOSYS This system call is not supported, for example, without MMU support built into Linux. ESRCH The target process does not exist (i.e., it has terminated and been waited on). [1] https://lore.kernel.org/lkml/20190411014353.113252-3-surenb@google.com/ [2] https://lore.kernel.org/linux-api/20201113173448.1863419-1-surenb@google.com/ [3] https://lore.kernel.org/linux-api/20201124053943.1684874-3-surenb@google.com/ [4] https://lore.kernel.org/linux-api/20201223075712.GA4719@lst.de/ Link: https://lkml.kernel.org/r/20210809185259.405936-1-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Cc: David Rientjes <rientjes@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <guro@fb.com> Cc: Rik van Riel <riel@surriel.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Florian Weimer <fweimer@redhat.com> Cc: Jan Engelhardt <jengelh@inai.de> Cc: Tim Murray <timmurray@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/884a7e5964e06ed93c7771c0d7cf19c09a8946f1;"Examples of this are Facebook's OOM killer daemon called oomd and Android's low memory killer daemon called lmkd";no;no;yes 20;MDY6Q29tbWl0MjMyNTI5ODo4ODRhN2U1OTY0ZTA2ZWQ5M2M3NzcxYzBkN2NmMTljMDlhODk0NmYx;Suren Baghdasaryan;Linus Torvalds;"mm: introduce process_mrelease system call In modern systems it's not unusual to have a system component monitoring memory conditions of the system and tasked with keeping system memory pressure under control. One way to accomplish that is to kill non-essential processes to free up memory for more important ones. Examples of this are Facebook's OOM killer daemon called oomd and Android's low memory killer daemon called lmkd. For such system component it's important to be able to free memory quickly and efficiently. Unfortunately the time process takes to free up its memory after receiving a SIGKILL might vary based on the state of the process (uninterruptible sleep), size and OPP level of the core the process is running. A mechanism to free resources of the target process in a more predictable way would improve system's ability to control its memory pressure. Introduce process_mrelease system call that releases memory of a dying process from the context of the caller. This way the memory is freed in a more controllable way with CPU affinity and priority of the caller. The workload of freeing the memory will also be charged to the caller. The operation is allowed only on a dying process. After previous discussions [1, 2, 3] the decision was made [4] to introduce a dedicated system call to cover this use case. The API is as follows, int process_mrelease(int pidfd, unsigned int flags); DESCRIPTION The process_mrelease() system call is used to free the memory of an exiting process. The pidfd selects the process referred to by the PID file descriptor. (See pidfd_open(2) for further information) The flags argument is reserved for future use; currently, this argument must be specified as 0. RETURN VALUE On success, process_mrelease() returns 0. On error, -1 is returned and errno is set to indicate the error. ERRORS EBADF pidfd is not a valid PID file descriptor. EAGAIN Failed to release part of the address space. EINTR The call was interrupted by a signal; see signal(7). EINVAL flags is not 0. EINVAL The memory of the task cannot be released because the process is not exiting, the address space is shared with another live process or there is a core dump in progress. ENOSYS This system call is not supported, for example, without MMU support built into Linux. ESRCH The target process does not exist (i.e., it has terminated and been waited on). [1] https://lore.kernel.org/lkml/20190411014353.113252-3-surenb@google.com/ [2] https://lore.kernel.org/linux-api/20201113173448.1863419-1-surenb@google.com/ [3] https://lore.kernel.org/linux-api/20201124053943.1684874-3-surenb@google.com/ [4] https://lore.kernel.org/linux-api/20201223075712.GA4719@lst.de/ Link: https://lkml.kernel.org/r/20210809185259.405936-1-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Cc: David Rientjes <rientjes@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <guro@fb.com> Cc: Rik van Riel <riel@surriel.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Florian Weimer <fweimer@redhat.com> Cc: Jan Engelhardt <jengelh@inai.de> Cc: Tim Murray <timmurray@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/884a7e5964e06ed93c7771c0d7cf19c09a8946f1;"For such system component it's important to be able to free memory quickly and efficiently";no;no;yes 20;MDY6Q29tbWl0MjMyNTI5ODo4ODRhN2U1OTY0ZTA2ZWQ5M2M3NzcxYzBkN2NmMTljMDlhODk0NmYx;Suren Baghdasaryan;Linus Torvalds;"mm: introduce process_mrelease system call In modern systems it's not unusual to have a system component monitoring memory conditions of the system and tasked with keeping system memory pressure under control. One way to accomplish that is to kill non-essential processes to free up memory for more important ones. Examples of this are Facebook's OOM killer daemon called oomd and Android's low memory killer daemon called lmkd. For such system component it's important to be able to free memory quickly and efficiently. Unfortunately the time process takes to free up its memory after receiving a SIGKILL might vary based on the state of the process (uninterruptible sleep), size and OPP level of the core the process is running. A mechanism to free resources of the target process in a more predictable way would improve system's ability to control its memory pressure. Introduce process_mrelease system call that releases memory of a dying process from the context of the caller. This way the memory is freed in a more controllable way with CPU affinity and priority of the caller. The workload of freeing the memory will also be charged to the caller. The operation is allowed only on a dying process. After previous discussions [1, 2, 3] the decision was made [4] to introduce a dedicated system call to cover this use case. The API is as follows, int process_mrelease(int pidfd, unsigned int flags); DESCRIPTION The process_mrelease() system call is used to free the memory of an exiting process. The pidfd selects the process referred to by the PID file descriptor. (See pidfd_open(2) for further information) The flags argument is reserved for future use; currently, this argument must be specified as 0. RETURN VALUE On success, process_mrelease() returns 0. On error, -1 is returned and errno is set to indicate the error. ERRORS EBADF pidfd is not a valid PID file descriptor. EAGAIN Failed to release part of the address space. EINTR The call was interrupted by a signal; see signal(7). EINVAL flags is not 0. EINVAL The memory of the task cannot be released because the process is not exiting, the address space is shared with another live process or there is a core dump in progress. ENOSYS This system call is not supported, for example, without MMU support built into Linux. ESRCH The target process does not exist (i.e., it has terminated and been waited on). [1] https://lore.kernel.org/lkml/20190411014353.113252-3-surenb@google.com/ [2] https://lore.kernel.org/linux-api/20201113173448.1863419-1-surenb@google.com/ [3] https://lore.kernel.org/linux-api/20201124053943.1684874-3-surenb@google.com/ [4] https://lore.kernel.org/linux-api/20201223075712.GA4719@lst.de/ Link: https://lkml.kernel.org/r/20210809185259.405936-1-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Cc: David Rientjes <rientjes@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <guro@fb.com> Cc: Rik van Riel <riel@surriel.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Florian Weimer <fweimer@redhat.com> Cc: Jan Engelhardt <jengelh@inai.de> Cc: Tim Murray <timmurray@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/884a7e5964e06ed93c7771c0d7cf19c09a8946f1;" Unfortunately the time process takes to free up its memory after receiving a SIGKILL might vary based on the state of the process (uninterruptible sleep), size and OPP level of the core the process is running";no;no;yes 20;MDY6Q29tbWl0MjMyNTI5ODo4ODRhN2U1OTY0ZTA2ZWQ5M2M3NzcxYzBkN2NmMTljMDlhODk0NmYx;Suren Baghdasaryan;Linus Torvalds;"mm: introduce process_mrelease system call In modern systems it's not unusual to have a system component monitoring memory conditions of the system and tasked with keeping system memory pressure under control. One way to accomplish that is to kill non-essential processes to free up memory for more important ones. Examples of this are Facebook's OOM killer daemon called oomd and Android's low memory killer daemon called lmkd. For such system component it's important to be able to free memory quickly and efficiently. Unfortunately the time process takes to free up its memory after receiving a SIGKILL might vary based on the state of the process (uninterruptible sleep), size and OPP level of the core the process is running. A mechanism to free resources of the target process in a more predictable way would improve system's ability to control its memory pressure. Introduce process_mrelease system call that releases memory of a dying process from the context of the caller. This way the memory is freed in a more controllable way with CPU affinity and priority of the caller. The workload of freeing the memory will also be charged to the caller. The operation is allowed only on a dying process. After previous discussions [1, 2, 3] the decision was made [4] to introduce a dedicated system call to cover this use case. The API is as follows, int process_mrelease(int pidfd, unsigned int flags); DESCRIPTION The process_mrelease() system call is used to free the memory of an exiting process. The pidfd selects the process referred to by the PID file descriptor. (See pidfd_open(2) for further information) The flags argument is reserved for future use; currently, this argument must be specified as 0. RETURN VALUE On success, process_mrelease() returns 0. On error, -1 is returned and errno is set to indicate the error. ERRORS EBADF pidfd is not a valid PID file descriptor. EAGAIN Failed to release part of the address space. EINTR The call was interrupted by a signal; see signal(7). EINVAL flags is not 0. EINVAL The memory of the task cannot be released because the process is not exiting, the address space is shared with another live process or there is a core dump in progress. ENOSYS This system call is not supported, for example, without MMU support built into Linux. ESRCH The target process does not exist (i.e., it has terminated and been waited on). [1] https://lore.kernel.org/lkml/20190411014353.113252-3-surenb@google.com/ [2] https://lore.kernel.org/linux-api/20201113173448.1863419-1-surenb@google.com/ [3] https://lore.kernel.org/linux-api/20201124053943.1684874-3-surenb@google.com/ [4] https://lore.kernel.org/linux-api/20201223075712.GA4719@lst.de/ Link: https://lkml.kernel.org/r/20210809185259.405936-1-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Cc: David Rientjes <rientjes@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <guro@fb.com> Cc: Rik van Riel <riel@surriel.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Florian Weimer <fweimer@redhat.com> Cc: Jan Engelhardt <jengelh@inai.de> Cc: Tim Murray <timmurray@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/884a7e5964e06ed93c7771c0d7cf19c09a8946f1;" A mechanism to free resources of the target process in a more predictable way would improve system's ability to control its memory pressure";no;yes;yes 20;MDY6Q29tbWl0MjMyNTI5ODo4ODRhN2U1OTY0ZTA2ZWQ5M2M3NzcxYzBkN2NmMTljMDlhODk0NmYx;Suren Baghdasaryan;Linus Torvalds;"mm: introduce process_mrelease system call In modern systems it's not unusual to have a system component monitoring memory conditions of the system and tasked with keeping system memory pressure under control. One way to accomplish that is to kill non-essential processes to free up memory for more important ones. Examples of this are Facebook's OOM killer daemon called oomd and Android's low memory killer daemon called lmkd. For such system component it's important to be able to free memory quickly and efficiently. Unfortunately the time process takes to free up its memory after receiving a SIGKILL might vary based on the state of the process (uninterruptible sleep), size and OPP level of the core the process is running. A mechanism to free resources of the target process in a more predictable way would improve system's ability to control its memory pressure. Introduce process_mrelease system call that releases memory of a dying process from the context of the caller. This way the memory is freed in a more controllable way with CPU affinity and priority of the caller. The workload of freeing the memory will also be charged to the caller. The operation is allowed only on a dying process. After previous discussions [1, 2, 3] the decision was made [4] to introduce a dedicated system call to cover this use case. The API is as follows, int process_mrelease(int pidfd, unsigned int flags); DESCRIPTION The process_mrelease() system call is used to free the memory of an exiting process. The pidfd selects the process referred to by the PID file descriptor. (See pidfd_open(2) for further information) The flags argument is reserved for future use; currently, this argument must be specified as 0. RETURN VALUE On success, process_mrelease() returns 0. On error, -1 is returned and errno is set to indicate the error. ERRORS EBADF pidfd is not a valid PID file descriptor. EAGAIN Failed to release part of the address space. EINTR The call was interrupted by a signal; see signal(7). EINVAL flags is not 0. EINVAL The memory of the task cannot be released because the process is not exiting, the address space is shared with another live process or there is a core dump in progress. ENOSYS This system call is not supported, for example, without MMU support built into Linux. ESRCH The target process does not exist (i.e., it has terminated and been waited on). [1] https://lore.kernel.org/lkml/20190411014353.113252-3-surenb@google.com/ [2] https://lore.kernel.org/linux-api/20201113173448.1863419-1-surenb@google.com/ [3] https://lore.kernel.org/linux-api/20201124053943.1684874-3-surenb@google.com/ [4] https://lore.kernel.org/linux-api/20201223075712.GA4719@lst.de/ Link: https://lkml.kernel.org/r/20210809185259.405936-1-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Cc: David Rientjes <rientjes@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <guro@fb.com> Cc: Rik van Riel <riel@surriel.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Florian Weimer <fweimer@redhat.com> Cc: Jan Engelhardt <jengelh@inai.de> Cc: Tim Murray <timmurray@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/884a7e5964e06ed93c7771c0d7cf19c09a8946f1;"Introduce process_mrelease system call that releases memory of a dying process from the context of the caller";yes;no;no 20;MDY6Q29tbWl0MjMyNTI5ODo4ODRhN2U1OTY0ZTA2ZWQ5M2M3NzcxYzBkN2NmMTljMDlhODk0NmYx;Suren Baghdasaryan;Linus Torvalds;"mm: introduce process_mrelease system call In modern systems it's not unusual to have a system component monitoring memory conditions of the system and tasked with keeping system memory pressure under control. One way to accomplish that is to kill non-essential processes to free up memory for more important ones. Examples of this are Facebook's OOM killer daemon called oomd and Android's low memory killer daemon called lmkd. For such system component it's important to be able to free memory quickly and efficiently. Unfortunately the time process takes to free up its memory after receiving a SIGKILL might vary based on the state of the process (uninterruptible sleep), size and OPP level of the core the process is running. A mechanism to free resources of the target process in a more predictable way would improve system's ability to control its memory pressure. Introduce process_mrelease system call that releases memory of a dying process from the context of the caller. This way the memory is freed in a more controllable way with CPU affinity and priority of the caller. The workload of freeing the memory will also be charged to the caller. The operation is allowed only on a dying process. After previous discussions [1, 2, 3] the decision was made [4] to introduce a dedicated system call to cover this use case. The API is as follows, int process_mrelease(int pidfd, unsigned int flags); DESCRIPTION The process_mrelease() system call is used to free the memory of an exiting process. The pidfd selects the process referred to by the PID file descriptor. (See pidfd_open(2) for further information) The flags argument is reserved for future use; currently, this argument must be specified as 0. RETURN VALUE On success, process_mrelease() returns 0. On error, -1 is returned and errno is set to indicate the error. ERRORS EBADF pidfd is not a valid PID file descriptor. EAGAIN Failed to release part of the address space. EINTR The call was interrupted by a signal; see signal(7). EINVAL flags is not 0. EINVAL The memory of the task cannot be released because the process is not exiting, the address space is shared with another live process or there is a core dump in progress. ENOSYS This system call is not supported, for example, without MMU support built into Linux. ESRCH The target process does not exist (i.e., it has terminated and been waited on). [1] https://lore.kernel.org/lkml/20190411014353.113252-3-surenb@google.com/ [2] https://lore.kernel.org/linux-api/20201113173448.1863419-1-surenb@google.com/ [3] https://lore.kernel.org/linux-api/20201124053943.1684874-3-surenb@google.com/ [4] https://lore.kernel.org/linux-api/20201223075712.GA4719@lst.de/ Link: https://lkml.kernel.org/r/20210809185259.405936-1-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Cc: David Rientjes <rientjes@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <guro@fb.com> Cc: Rik van Riel <riel@surriel.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Florian Weimer <fweimer@redhat.com> Cc: Jan Engelhardt <jengelh@inai.de> Cc: Tim Murray <timmurray@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/884a7e5964e06ed93c7771c0d7cf19c09a8946f1;" This way the memory is freed in a more controllable way with CPU affinity and priority of the caller";yes;yes;no 20;MDY6Q29tbWl0MjMyNTI5ODo4ODRhN2U1OTY0ZTA2ZWQ5M2M3NzcxYzBkN2NmMTljMDlhODk0NmYx;Suren Baghdasaryan;Linus Torvalds;"mm: introduce process_mrelease system call In modern systems it's not unusual to have a system component monitoring memory conditions of the system and tasked with keeping system memory pressure under control. One way to accomplish that is to kill non-essential processes to free up memory for more important ones. Examples of this are Facebook's OOM killer daemon called oomd and Android's low memory killer daemon called lmkd. For such system component it's important to be able to free memory quickly and efficiently. Unfortunately the time process takes to free up its memory after receiving a SIGKILL might vary based on the state of the process (uninterruptible sleep), size and OPP level of the core the process is running. A mechanism to free resources of the target process in a more predictable way would improve system's ability to control its memory pressure. Introduce process_mrelease system call that releases memory of a dying process from the context of the caller. This way the memory is freed in a more controllable way with CPU affinity and priority of the caller. The workload of freeing the memory will also be charged to the caller. The operation is allowed only on a dying process. After previous discussions [1, 2, 3] the decision was made [4] to introduce a dedicated system call to cover this use case. The API is as follows, int process_mrelease(int pidfd, unsigned int flags); DESCRIPTION The process_mrelease() system call is used to free the memory of an exiting process. The pidfd selects the process referred to by the PID file descriptor. (See pidfd_open(2) for further information) The flags argument is reserved for future use; currently, this argument must be specified as 0. RETURN VALUE On success, process_mrelease() returns 0. On error, -1 is returned and errno is set to indicate the error. ERRORS EBADF pidfd is not a valid PID file descriptor. EAGAIN Failed to release part of the address space. EINTR The call was interrupted by a signal; see signal(7). EINVAL flags is not 0. EINVAL The memory of the task cannot be released because the process is not exiting, the address space is shared with another live process or there is a core dump in progress. ENOSYS This system call is not supported, for example, without MMU support built into Linux. ESRCH The target process does not exist (i.e., it has terminated and been waited on). [1] https://lore.kernel.org/lkml/20190411014353.113252-3-surenb@google.com/ [2] https://lore.kernel.org/linux-api/20201113173448.1863419-1-surenb@google.com/ [3] https://lore.kernel.org/linux-api/20201124053943.1684874-3-surenb@google.com/ [4] https://lore.kernel.org/linux-api/20201223075712.GA4719@lst.de/ Link: https://lkml.kernel.org/r/20210809185259.405936-1-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Cc: David Rientjes <rientjes@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <guro@fb.com> Cc: Rik van Riel <riel@surriel.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Florian Weimer <fweimer@redhat.com> Cc: Jan Engelhardt <jengelh@inai.de> Cc: Tim Murray <timmurray@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/884a7e5964e06ed93c7771c0d7cf19c09a8946f1;" The workload of freeing the memory will also be charged to the caller";yes;no;no 20;MDY6Q29tbWl0MjMyNTI5ODo4ODRhN2U1OTY0ZTA2ZWQ5M2M3NzcxYzBkN2NmMTljMDlhODk0NmYx;Suren Baghdasaryan;Linus Torvalds;"mm: introduce process_mrelease system call In modern systems it's not unusual to have a system component monitoring memory conditions of the system and tasked with keeping system memory pressure under control. One way to accomplish that is to kill non-essential processes to free up memory for more important ones. Examples of this are Facebook's OOM killer daemon called oomd and Android's low memory killer daemon called lmkd. For such system component it's important to be able to free memory quickly and efficiently. Unfortunately the time process takes to free up its memory after receiving a SIGKILL might vary based on the state of the process (uninterruptible sleep), size and OPP level of the core the process is running. A mechanism to free resources of the target process in a more predictable way would improve system's ability to control its memory pressure. Introduce process_mrelease system call that releases memory of a dying process from the context of the caller. This way the memory is freed in a more controllable way with CPU affinity and priority of the caller. The workload of freeing the memory will also be charged to the caller. The operation is allowed only on a dying process. After previous discussions [1, 2, 3] the decision was made [4] to introduce a dedicated system call to cover this use case. The API is as follows, int process_mrelease(int pidfd, unsigned int flags); DESCRIPTION The process_mrelease() system call is used to free the memory of an exiting process. The pidfd selects the process referred to by the PID file descriptor. (See pidfd_open(2) for further information) The flags argument is reserved for future use; currently, this argument must be specified as 0. RETURN VALUE On success, process_mrelease() returns 0. On error, -1 is returned and errno is set to indicate the error. ERRORS EBADF pidfd is not a valid PID file descriptor. EAGAIN Failed to release part of the address space. EINTR The call was interrupted by a signal; see signal(7). EINVAL flags is not 0. EINVAL The memory of the task cannot be released because the process is not exiting, the address space is shared with another live process or there is a core dump in progress. ENOSYS This system call is not supported, for example, without MMU support built into Linux. ESRCH The target process does not exist (i.e., it has terminated and been waited on). [1] https://lore.kernel.org/lkml/20190411014353.113252-3-surenb@google.com/ [2] https://lore.kernel.org/linux-api/20201113173448.1863419-1-surenb@google.com/ [3] https://lore.kernel.org/linux-api/20201124053943.1684874-3-surenb@google.com/ [4] https://lore.kernel.org/linux-api/20201223075712.GA4719@lst.de/ Link: https://lkml.kernel.org/r/20210809185259.405936-1-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Cc: David Rientjes <rientjes@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <guro@fb.com> Cc: Rik van Riel <riel@surriel.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Florian Weimer <fweimer@redhat.com> Cc: Jan Engelhardt <jengelh@inai.de> Cc: Tim Murray <timmurray@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/884a7e5964e06ed93c7771c0d7cf19c09a8946f1;" The operation is allowed only on a dying process";yes;no;no 20;MDY6Q29tbWl0MjMyNTI5ODo4ODRhN2U1OTY0ZTA2ZWQ5M2M3NzcxYzBkN2NmMTljMDlhODk0NmYx;Suren Baghdasaryan;Linus Torvalds;"mm: introduce process_mrelease system call In modern systems it's not unusual to have a system component monitoring memory conditions of the system and tasked with keeping system memory pressure under control. One way to accomplish that is to kill non-essential processes to free up memory for more important ones. Examples of this are Facebook's OOM killer daemon called oomd and Android's low memory killer daemon called lmkd. For such system component it's important to be able to free memory quickly and efficiently. Unfortunately the time process takes to free up its memory after receiving a SIGKILL might vary based on the state of the process (uninterruptible sleep), size and OPP level of the core the process is running. A mechanism to free resources of the target process in a more predictable way would improve system's ability to control its memory pressure. Introduce process_mrelease system call that releases memory of a dying process from the context of the caller. This way the memory is freed in a more controllable way with CPU affinity and priority of the caller. The workload of freeing the memory will also be charged to the caller. The operation is allowed only on a dying process. After previous discussions [1, 2, 3] the decision was made [4] to introduce a dedicated system call to cover this use case. The API is as follows, int process_mrelease(int pidfd, unsigned int flags); DESCRIPTION The process_mrelease() system call is used to free the memory of an exiting process. The pidfd selects the process referred to by the PID file descriptor. (See pidfd_open(2) for further information) The flags argument is reserved for future use; currently, this argument must be specified as 0. RETURN VALUE On success, process_mrelease() returns 0. On error, -1 is returned and errno is set to indicate the error. ERRORS EBADF pidfd is not a valid PID file descriptor. EAGAIN Failed to release part of the address space. EINTR The call was interrupted by a signal; see signal(7). EINVAL flags is not 0. EINVAL The memory of the task cannot be released because the process is not exiting, the address space is shared with another live process or there is a core dump in progress. ENOSYS This system call is not supported, for example, without MMU support built into Linux. ESRCH The target process does not exist (i.e., it has terminated and been waited on). [1] https://lore.kernel.org/lkml/20190411014353.113252-3-surenb@google.com/ [2] https://lore.kernel.org/linux-api/20201113173448.1863419-1-surenb@google.com/ [3] https://lore.kernel.org/linux-api/20201124053943.1684874-3-surenb@google.com/ [4] https://lore.kernel.org/linux-api/20201223075712.GA4719@lst.de/ Link: https://lkml.kernel.org/r/20210809185259.405936-1-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Cc: David Rientjes <rientjes@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <guro@fb.com> Cc: Rik van Riel <riel@surriel.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Florian Weimer <fweimer@redhat.com> Cc: Jan Engelhardt <jengelh@inai.de> Cc: Tim Murray <timmurray@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/884a7e5964e06ed93c7771c0d7cf19c09a8946f1;"After previous discussions [1, 2, 3] the decision was made [4] to introduce a dedicated system call to cover this use case";yes;yes;no 21;MDY6Q29tbWl0MjMyNTI5ODpiMjZlNTE3YTA1OGJkNDBjNzkwYTFkOTg2OGM4OTY4NDJmMmU0MTU1;Feng Tang;Linus Torvalds;"mm/mempolicy: cleanup nodemask intersection check for oom Patch series ""mm/mempolicy: some fix and semantics cleanup"", v4. Current memory policy code has some confusing and ambiguous part about MPOL_LOCAL policy, as it is handled as a faked MPOL_PREFERRED one, and there are many places having to distinguish them. Also the nodemask intersection check needs cleanup to be more explicit for OOM use, and handle MPOL_INTERLEAVE correctly. This patchset cleans up these and unifies the parameter sanity check for mbind() and set_mempolicy(). This patch (of 3): mempolicy_nodemask_intersects seem to be a general purpose mempolicy function. In fact it is partially tailored for the OOM purpose instead. The oom proper is the only existing user so rename the function to make that purpose explicit. While at it drop the MPOL_INTERLEAVE as those allocations never has a nodemask defined (see alloc_page_interleave) so this is a dead code and a confusing one because MPOL_INTERLEAVE is a hint rather than a hard requirement so it shouldn't be considered during the OOM. The final code can be reduced to a check for MPOL_BIND which is the only memory policy that is a hard requirement and thus relevant to a constrained OOM logic. [mhocko@suse.com: changelog edits] Link: https://lkml.kernel.org/r/1622560492-1294-1-git-send-email-feng.tang@intel.com Link: https://lkml.kernel.org/r/1622560492-1294-2-git-send-email-feng.tang@intel.com Link: https://lkml.kernel.org/r/1622469956-82897-1-git-send-email-feng.tang@intel.com Link: https://lkml.kernel.org/r/1622469956-82897-2-git-send-email-feng.tang@intel.com Signed-off-by: Feng Tang <feng.tang@intel.com> Suggested-by: Michal Hocko <mhocko@suse.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Ben Widawsky <ben.widawsky@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/b26e517a058bd40c790a1d9868c896842f2e4155;mm/mempolicy: cleanup nodemask intersection check for oom;yes;yes;no 21;MDY6Q29tbWl0MjMyNTI5ODpiMjZlNTE3YTA1OGJkNDBjNzkwYTFkOTg2OGM4OTY4NDJmMmU0MTU1;Feng Tang;Linus Torvalds;"mm/mempolicy: cleanup nodemask intersection check for oom Patch series ""mm/mempolicy: some fix and semantics cleanup"", v4. Current memory policy code has some confusing and ambiguous part about MPOL_LOCAL policy, as it is handled as a faked MPOL_PREFERRED one, and there are many places having to distinguish them. Also the nodemask intersection check needs cleanup to be more explicit for OOM use, and handle MPOL_INTERLEAVE correctly. This patchset cleans up these and unifies the parameter sanity check for mbind() and set_mempolicy(). This patch (of 3): mempolicy_nodemask_intersects seem to be a general purpose mempolicy function. In fact it is partially tailored for the OOM purpose instead. The oom proper is the only existing user so rename the function to make that purpose explicit. While at it drop the MPOL_INTERLEAVE as those allocations never has a nodemask defined (see alloc_page_interleave) so this is a dead code and a confusing one because MPOL_INTERLEAVE is a hint rather than a hard requirement so it shouldn't be considered during the OOM. The final code can be reduced to a check for MPOL_BIND which is the only memory policy that is a hard requirement and thus relevant to a constrained OOM logic. [mhocko@suse.com: changelog edits] Link: https://lkml.kernel.org/r/1622560492-1294-1-git-send-email-feng.tang@intel.com Link: https://lkml.kernel.org/r/1622560492-1294-2-git-send-email-feng.tang@intel.com Link: https://lkml.kernel.org/r/1622469956-82897-1-git-send-email-feng.tang@intel.com Link: https://lkml.kernel.org/r/1622469956-82897-2-git-send-email-feng.tang@intel.com Signed-off-by: Feng Tang <feng.tang@intel.com> Suggested-by: Michal Hocko <mhocko@suse.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Ben Widawsky <ben.widawsky@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/b26e517a058bd40c790a1d9868c896842f2e4155;"Current memory policy code has some confusing and ambiguous part about MPOL_LOCAL policy, as it is handled as a faked MPOL_PREFERRED one, and there are many places having to distinguish them";no;yes;yes 21;MDY6Q29tbWl0MjMyNTI5ODpiMjZlNTE3YTA1OGJkNDBjNzkwYTFkOTg2OGM4OTY4NDJmMmU0MTU1;Feng Tang;Linus Torvalds;"mm/mempolicy: cleanup nodemask intersection check for oom Patch series ""mm/mempolicy: some fix and semantics cleanup"", v4. Current memory policy code has some confusing and ambiguous part about MPOL_LOCAL policy, as it is handled as a faked MPOL_PREFERRED one, and there are many places having to distinguish them. Also the nodemask intersection check needs cleanup to be more explicit for OOM use, and handle MPOL_INTERLEAVE correctly. This patchset cleans up these and unifies the parameter sanity check for mbind() and set_mempolicy(). This patch (of 3): mempolicy_nodemask_intersects seem to be a general purpose mempolicy function. In fact it is partially tailored for the OOM purpose instead. The oom proper is the only existing user so rename the function to make that purpose explicit. While at it drop the MPOL_INTERLEAVE as those allocations never has a nodemask defined (see alloc_page_interleave) so this is a dead code and a confusing one because MPOL_INTERLEAVE is a hint rather than a hard requirement so it shouldn't be considered during the OOM. The final code can be reduced to a check for MPOL_BIND which is the only memory policy that is a hard requirement and thus relevant to a constrained OOM logic. [mhocko@suse.com: changelog edits] Link: https://lkml.kernel.org/r/1622560492-1294-1-git-send-email-feng.tang@intel.com Link: https://lkml.kernel.org/r/1622560492-1294-2-git-send-email-feng.tang@intel.com Link: https://lkml.kernel.org/r/1622469956-82897-1-git-send-email-feng.tang@intel.com Link: https://lkml.kernel.org/r/1622469956-82897-2-git-send-email-feng.tang@intel.com Signed-off-by: Feng Tang <feng.tang@intel.com> Suggested-by: Michal Hocko <mhocko@suse.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Ben Widawsky <ben.widawsky@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/b26e517a058bd40c790a1d9868c896842f2e4155;" Also the nodemask intersection check needs cleanup to be more explicit for OOM use, and handle MPOL_INTERLEAVE correctly";no;yes;no 21;MDY6Q29tbWl0MjMyNTI5ODpiMjZlNTE3YTA1OGJkNDBjNzkwYTFkOTg2OGM4OTY4NDJmMmU0MTU1;Feng Tang;Linus Torvalds;"mm/mempolicy: cleanup nodemask intersection check for oom Patch series ""mm/mempolicy: some fix and semantics cleanup"", v4. Current memory policy code has some confusing and ambiguous part about MPOL_LOCAL policy, as it is handled as a faked MPOL_PREFERRED one, and there are many places having to distinguish them. Also the nodemask intersection check needs cleanup to be more explicit for OOM use, and handle MPOL_INTERLEAVE correctly. This patchset cleans up these and unifies the parameter sanity check for mbind() and set_mempolicy(). This patch (of 3): mempolicy_nodemask_intersects seem to be a general purpose mempolicy function. In fact it is partially tailored for the OOM purpose instead. The oom proper is the only existing user so rename the function to make that purpose explicit. While at it drop the MPOL_INTERLEAVE as those allocations never has a nodemask defined (see alloc_page_interleave) so this is a dead code and a confusing one because MPOL_INTERLEAVE is a hint rather than a hard requirement so it shouldn't be considered during the OOM. The final code can be reduced to a check for MPOL_BIND which is the only memory policy that is a hard requirement and thus relevant to a constrained OOM logic. [mhocko@suse.com: changelog edits] Link: https://lkml.kernel.org/r/1622560492-1294-1-git-send-email-feng.tang@intel.com Link: https://lkml.kernel.org/r/1622560492-1294-2-git-send-email-feng.tang@intel.com Link: https://lkml.kernel.org/r/1622469956-82897-1-git-send-email-feng.tang@intel.com Link: https://lkml.kernel.org/r/1622469956-82897-2-git-send-email-feng.tang@intel.com Signed-off-by: Feng Tang <feng.tang@intel.com> Suggested-by: Michal Hocko <mhocko@suse.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Ben Widawsky <ben.widawsky@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/b26e517a058bd40c790a1d9868c896842f2e4155;" This patchset cleans up these and unifies the parameter sanity check for mbind() and set_mempolicy()";yes;yes;no 21;MDY6Q29tbWl0MjMyNTI5ODpiMjZlNTE3YTA1OGJkNDBjNzkwYTFkOTg2OGM4OTY4NDJmMmU0MTU1;Feng Tang;Linus Torvalds;"mm/mempolicy: cleanup nodemask intersection check for oom Patch series ""mm/mempolicy: some fix and semantics cleanup"", v4. Current memory policy code has some confusing and ambiguous part about MPOL_LOCAL policy, as it is handled as a faked MPOL_PREFERRED one, and there are many places having to distinguish them. Also the nodemask intersection check needs cleanup to be more explicit for OOM use, and handle MPOL_INTERLEAVE correctly. This patchset cleans up these and unifies the parameter sanity check for mbind() and set_mempolicy(). This patch (of 3): mempolicy_nodemask_intersects seem to be a general purpose mempolicy function. In fact it is partially tailored for the OOM purpose instead. The oom proper is the only existing user so rename the function to make that purpose explicit. While at it drop the MPOL_INTERLEAVE as those allocations never has a nodemask defined (see alloc_page_interleave) so this is a dead code and a confusing one because MPOL_INTERLEAVE is a hint rather than a hard requirement so it shouldn't be considered during the OOM. The final code can be reduced to a check for MPOL_BIND which is the only memory policy that is a hard requirement and thus relevant to a constrained OOM logic. [mhocko@suse.com: changelog edits] Link: https://lkml.kernel.org/r/1622560492-1294-1-git-send-email-feng.tang@intel.com Link: https://lkml.kernel.org/r/1622560492-1294-2-git-send-email-feng.tang@intel.com Link: https://lkml.kernel.org/r/1622469956-82897-1-git-send-email-feng.tang@intel.com Link: https://lkml.kernel.org/r/1622469956-82897-2-git-send-email-feng.tang@intel.com Signed-off-by: Feng Tang <feng.tang@intel.com> Suggested-by: Michal Hocko <mhocko@suse.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Ben Widawsky <ben.widawsky@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/b26e517a058bd40c790a1d9868c896842f2e4155;"mempolicy_nodemask_intersects seem to be a general purpose mempolicy function";no;no;yes 21;MDY6Q29tbWl0MjMyNTI5ODpiMjZlNTE3YTA1OGJkNDBjNzkwYTFkOTg2OGM4OTY4NDJmMmU0MTU1;Feng Tang;Linus Torvalds;"mm/mempolicy: cleanup nodemask intersection check for oom Patch series ""mm/mempolicy: some fix and semantics cleanup"", v4. Current memory policy code has some confusing and ambiguous part about MPOL_LOCAL policy, as it is handled as a faked MPOL_PREFERRED one, and there are many places having to distinguish them. Also the nodemask intersection check needs cleanup to be more explicit for OOM use, and handle MPOL_INTERLEAVE correctly. This patchset cleans up these and unifies the parameter sanity check for mbind() and set_mempolicy(). This patch (of 3): mempolicy_nodemask_intersects seem to be a general purpose mempolicy function. In fact it is partially tailored for the OOM purpose instead. The oom proper is the only existing user so rename the function to make that purpose explicit. While at it drop the MPOL_INTERLEAVE as those allocations never has a nodemask defined (see alloc_page_interleave) so this is a dead code and a confusing one because MPOL_INTERLEAVE is a hint rather than a hard requirement so it shouldn't be considered during the OOM. The final code can be reduced to a check for MPOL_BIND which is the only memory policy that is a hard requirement and thus relevant to a constrained OOM logic. [mhocko@suse.com: changelog edits] Link: https://lkml.kernel.org/r/1622560492-1294-1-git-send-email-feng.tang@intel.com Link: https://lkml.kernel.org/r/1622560492-1294-2-git-send-email-feng.tang@intel.com Link: https://lkml.kernel.org/r/1622469956-82897-1-git-send-email-feng.tang@intel.com Link: https://lkml.kernel.org/r/1622469956-82897-2-git-send-email-feng.tang@intel.com Signed-off-by: Feng Tang <feng.tang@intel.com> Suggested-by: Michal Hocko <mhocko@suse.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Ben Widawsky <ben.widawsky@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/b26e517a058bd40c790a1d9868c896842f2e4155;" In fact it is partially tailored for the OOM purpose instead";no;no;yes 21;MDY6Q29tbWl0MjMyNTI5ODpiMjZlNTE3YTA1OGJkNDBjNzkwYTFkOTg2OGM4OTY4NDJmMmU0MTU1;Feng Tang;Linus Torvalds;"mm/mempolicy: cleanup nodemask intersection check for oom Patch series ""mm/mempolicy: some fix and semantics cleanup"", v4. Current memory policy code has some confusing and ambiguous part about MPOL_LOCAL policy, as it is handled as a faked MPOL_PREFERRED one, and there are many places having to distinguish them. Also the nodemask intersection check needs cleanup to be more explicit for OOM use, and handle MPOL_INTERLEAVE correctly. This patchset cleans up these and unifies the parameter sanity check for mbind() and set_mempolicy(). This patch (of 3): mempolicy_nodemask_intersects seem to be a general purpose mempolicy function. In fact it is partially tailored for the OOM purpose instead. The oom proper is the only existing user so rename the function to make that purpose explicit. While at it drop the MPOL_INTERLEAVE as those allocations never has a nodemask defined (see alloc_page_interleave) so this is a dead code and a confusing one because MPOL_INTERLEAVE is a hint rather than a hard requirement so it shouldn't be considered during the OOM. The final code can be reduced to a check for MPOL_BIND which is the only memory policy that is a hard requirement and thus relevant to a constrained OOM logic. [mhocko@suse.com: changelog edits] Link: https://lkml.kernel.org/r/1622560492-1294-1-git-send-email-feng.tang@intel.com Link: https://lkml.kernel.org/r/1622560492-1294-2-git-send-email-feng.tang@intel.com Link: https://lkml.kernel.org/r/1622469956-82897-1-git-send-email-feng.tang@intel.com Link: https://lkml.kernel.org/r/1622469956-82897-2-git-send-email-feng.tang@intel.com Signed-off-by: Feng Tang <feng.tang@intel.com> Suggested-by: Michal Hocko <mhocko@suse.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Ben Widawsky <ben.widawsky@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/b26e517a058bd40c790a1d9868c896842f2e4155;" The oom proper is the only existing user so rename the function to make that purpose explicit";yes;yes;yes 21;MDY6Q29tbWl0MjMyNTI5ODpiMjZlNTE3YTA1OGJkNDBjNzkwYTFkOTg2OGM4OTY4NDJmMmU0MTU1;Feng Tang;Linus Torvalds;"mm/mempolicy: cleanup nodemask intersection check for oom Patch series ""mm/mempolicy: some fix and semantics cleanup"", v4. Current memory policy code has some confusing and ambiguous part about MPOL_LOCAL policy, as it is handled as a faked MPOL_PREFERRED one, and there are many places having to distinguish them. Also the nodemask intersection check needs cleanup to be more explicit for OOM use, and handle MPOL_INTERLEAVE correctly. This patchset cleans up these and unifies the parameter sanity check for mbind() and set_mempolicy(). This patch (of 3): mempolicy_nodemask_intersects seem to be a general purpose mempolicy function. In fact it is partially tailored for the OOM purpose instead. The oom proper is the only existing user so rename the function to make that purpose explicit. While at it drop the MPOL_INTERLEAVE as those allocations never has a nodemask defined (see alloc_page_interleave) so this is a dead code and a confusing one because MPOL_INTERLEAVE is a hint rather than a hard requirement so it shouldn't be considered during the OOM. The final code can be reduced to a check for MPOL_BIND which is the only memory policy that is a hard requirement and thus relevant to a constrained OOM logic. [mhocko@suse.com: changelog edits] Link: https://lkml.kernel.org/r/1622560492-1294-1-git-send-email-feng.tang@intel.com Link: https://lkml.kernel.org/r/1622560492-1294-2-git-send-email-feng.tang@intel.com Link: https://lkml.kernel.org/r/1622469956-82897-1-git-send-email-feng.tang@intel.com Link: https://lkml.kernel.org/r/1622469956-82897-2-git-send-email-feng.tang@intel.com Signed-off-by: Feng Tang <feng.tang@intel.com> Suggested-by: Michal Hocko <mhocko@suse.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Ben Widawsky <ben.widawsky@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/b26e517a058bd40c790a1d9868c896842f2e4155;"While at it drop the MPOL_INTERLEAVE as those allocations never has a nodemask defined (see alloc_page_interleave) so this is a dead code and a confusing one because MPOL_INTERLEAVE is a hint rather than a hard requirement so it shouldn't be considered during the OOM";yes;yes;no 21;MDY6Q29tbWl0MjMyNTI5ODpiMjZlNTE3YTA1OGJkNDBjNzkwYTFkOTg2OGM4OTY4NDJmMmU0MTU1;Feng Tang;Linus Torvalds;"mm/mempolicy: cleanup nodemask intersection check for oom Patch series ""mm/mempolicy: some fix and semantics cleanup"", v4. Current memory policy code has some confusing and ambiguous part about MPOL_LOCAL policy, as it is handled as a faked MPOL_PREFERRED one, and there are many places having to distinguish them. Also the nodemask intersection check needs cleanup to be more explicit for OOM use, and handle MPOL_INTERLEAVE correctly. This patchset cleans up these and unifies the parameter sanity check for mbind() and set_mempolicy(). This patch (of 3): mempolicy_nodemask_intersects seem to be a general purpose mempolicy function. In fact it is partially tailored for the OOM purpose instead. The oom proper is the only existing user so rename the function to make that purpose explicit. While at it drop the MPOL_INTERLEAVE as those allocations never has a nodemask defined (see alloc_page_interleave) so this is a dead code and a confusing one because MPOL_INTERLEAVE is a hint rather than a hard requirement so it shouldn't be considered during the OOM. The final code can be reduced to a check for MPOL_BIND which is the only memory policy that is a hard requirement and thus relevant to a constrained OOM logic. [mhocko@suse.com: changelog edits] Link: https://lkml.kernel.org/r/1622560492-1294-1-git-send-email-feng.tang@intel.com Link: https://lkml.kernel.org/r/1622560492-1294-2-git-send-email-feng.tang@intel.com Link: https://lkml.kernel.org/r/1622469956-82897-1-git-send-email-feng.tang@intel.com Link: https://lkml.kernel.org/r/1622469956-82897-2-git-send-email-feng.tang@intel.com Signed-off-by: Feng Tang <feng.tang@intel.com> Suggested-by: Michal Hocko <mhocko@suse.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Ben Widawsky <ben.widawsky@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/b26e517a058bd40c790a1d9868c896842f2e4155;"The final code can be reduced to a check for MPOL_BIND which is the only memory policy that is a hard requirement and thus relevant to a constrained OOM logic.";yes;yes;no 22;MDY6Q29tbWl0MjMyNTI5ODo0YzljMzgwOWFlMmVjZmNlY2U5YWNiM2Y1MTQyN2U2MTdkMjFmYWZi;Rolf Eike Beer;Paul E. McKenney;"rcu: Fix typo in comment: kthead -> kthread Signed-off-by: Rolf Eike Beer <eb@emlix.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>";https://api.github.com/repos/torvalds/linux/git/commits/4c9c3809ae2ecfcece9acb3f51427e617d21fafb;rcu: Fix typo in comment: kthead -> kthread;yes;yes;no 23;MDY6Q29tbWl0MjMyNTI5ODpmMDk1M2ExYmJhY2E3MWUxZWJiY2I5ODY0ZWIxYjI3MzE1NjE1N2Vk;Ingo Molnar;Linus Torvalds;"mm: fix typos in comments Fix ~94 single-word typos in locking code comments, plus a few very obvious grammar mistakes. Link: https://lkml.kernel.org/r/20210322212624.GA1963421@gmail.com Link: https://lore.kernel.org/r/20210322205203.GB1959563@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Cc: Bhaskar Chowdhury <unixbhaskar@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/f0953a1bbaca71e1ebbcb9864eb1b273156157ed;mm: fix typos in comments;yes;yes;no 23;MDY6Q29tbWl0MjMyNTI5ODpmMDk1M2ExYmJhY2E3MWUxZWJiY2I5ODY0ZWIxYjI3MzE1NjE1N2Vk;Ingo Molnar;Linus Torvalds;"mm: fix typos in comments Fix ~94 single-word typos in locking code comments, plus a few very obvious grammar mistakes. Link: https://lkml.kernel.org/r/20210322212624.GA1963421@gmail.com Link: https://lore.kernel.org/r/20210322205203.GB1959563@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Cc: Bhaskar Chowdhury <unixbhaskar@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/f0953a1bbaca71e1ebbcb9864eb1b273156157ed;"Fix ~94 single-word typos in locking code comments, plus a few very obvious grammar mistakes.";yes;yes;no 24;MDY6Q29tbWl0MjMyNTI5ODo2OGQ2OGZmNmViYmY2OWQwMjUxMWRkNDhmMTZiMzc5NTY3MWM5YjBi;Zhiyuan Dai;Linus Torvalds;"mm/mempool: minor coding style tweaks Various coding style tweaks to various files under mm/ [daizhiyuan@phytium.com.cn: mm/swapfile: minor coding style tweaks] Link: https://lkml.kernel.org/r/1614223624-16055-1-git-send-email-daizhiyuan@phytium.com.cn [daizhiyuan@phytium.com.cn: mm/sparse: minor coding style tweaks] Link: https://lkml.kernel.org/r/1614227288-19363-1-git-send-email-daizhiyuan@phytium.com.cn [daizhiyuan@phytium.com.cn: mm/vmscan: minor coding style tweaks] Link: https://lkml.kernel.org/r/1614227649-19853-1-git-send-email-daizhiyuan@phytium.com.cn [daizhiyuan@phytium.com.cn: mm/compaction: minor coding style tweaks] Link: https://lkml.kernel.org/r/1614228218-20770-1-git-send-email-daizhiyuan@phytium.com.cn [daizhiyuan@phytium.com.cn: mm/oom_kill: minor coding style tweaks] Link: https://lkml.kernel.org/r/1614228360-21168-1-git-send-email-daizhiyuan@phytium.com.cn [daizhiyuan@phytium.com.cn: mm/shmem: minor coding style tweaks] Link: https://lkml.kernel.org/r/1614228504-21491-1-git-send-email-daizhiyuan@phytium.com.cn [daizhiyuan@phytium.com.cn: mm/page_alloc: minor coding style tweaks] Link: https://lkml.kernel.org/r/1614228613-21754-1-git-send-email-daizhiyuan@phytium.com.cn [daizhiyuan@phytium.com.cn: mm/filemap: minor coding style tweaks] Link: https://lkml.kernel.org/r/1614228936-22337-1-git-send-email-daizhiyuan@phytium.com.cn [daizhiyuan@phytium.com.cn: mm/mlock: minor coding style tweaks] Link: https://lkml.kernel.org/r/1613956588-2453-1-git-send-email-daizhiyuan@phytium.com.cn [daizhiyuan@phytium.com.cn: mm/frontswap: minor coding style tweaks] Link: https://lkml.kernel.org/r/1613962668-15045-1-git-send-email-daizhiyuan@phytium.com.cn [daizhiyuan@phytium.com.cn: mm/vmalloc: minor coding style tweaks] Link: https://lkml.kernel.org/r/1613963379-15988-1-git-send-email-daizhiyuan@phytium.com.cn [daizhiyuan@phytium.com.cn: mm/memory_hotplug: minor coding style tweaks] Link: https://lkml.kernel.org/r/1613971784-24878-1-git-send-email-daizhiyuan@phytium.com.cn [daizhiyuan@phytium.com.cn: mm/mempolicy: minor coding style tweaks] Link: https://lkml.kernel.org/r/1613972228-25501-1-git-send-email-daizhiyuan@phytium.com.cn Link: https://lkml.kernel.org/r/1614222374-13805-1-git-send-email-daizhiyuan@phytium.com.cn Signed-off-by: Zhiyuan Dai <daizhiyuan@phytium.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/68d68ff6ebbf69d02511dd48f16b3795671c9b0b;mm/mempool: minor coding style tweaks;yes;yes;no 24;MDY6Q29tbWl0MjMyNTI5ODo2OGQ2OGZmNmViYmY2OWQwMjUxMWRkNDhmMTZiMzc5NTY3MWM5YjBi;Zhiyuan Dai;Linus Torvalds;"mm/mempool: minor coding style tweaks Various coding style tweaks to various files under mm/ [daizhiyuan@phytium.com.cn: mm/swapfile: minor coding style tweaks] Link: https://lkml.kernel.org/r/1614223624-16055-1-git-send-email-daizhiyuan@phytium.com.cn [daizhiyuan@phytium.com.cn: mm/sparse: minor coding style tweaks] Link: https://lkml.kernel.org/r/1614227288-19363-1-git-send-email-daizhiyuan@phytium.com.cn [daizhiyuan@phytium.com.cn: mm/vmscan: minor coding style tweaks] Link: https://lkml.kernel.org/r/1614227649-19853-1-git-send-email-daizhiyuan@phytium.com.cn [daizhiyuan@phytium.com.cn: mm/compaction: minor coding style tweaks] Link: https://lkml.kernel.org/r/1614228218-20770-1-git-send-email-daizhiyuan@phytium.com.cn [daizhiyuan@phytium.com.cn: mm/oom_kill: minor coding style tweaks] Link: https://lkml.kernel.org/r/1614228360-21168-1-git-send-email-daizhiyuan@phytium.com.cn [daizhiyuan@phytium.com.cn: mm/shmem: minor coding style tweaks] Link: https://lkml.kernel.org/r/1614228504-21491-1-git-send-email-daizhiyuan@phytium.com.cn [daizhiyuan@phytium.com.cn: mm/page_alloc: minor coding style tweaks] Link: https://lkml.kernel.org/r/1614228613-21754-1-git-send-email-daizhiyuan@phytium.com.cn [daizhiyuan@phytium.com.cn: mm/filemap: minor coding style tweaks] Link: https://lkml.kernel.org/r/1614228936-22337-1-git-send-email-daizhiyuan@phytium.com.cn [daizhiyuan@phytium.com.cn: mm/mlock: minor coding style tweaks] Link: https://lkml.kernel.org/r/1613956588-2453-1-git-send-email-daizhiyuan@phytium.com.cn [daizhiyuan@phytium.com.cn: mm/frontswap: minor coding style tweaks] Link: https://lkml.kernel.org/r/1613962668-15045-1-git-send-email-daizhiyuan@phytium.com.cn [daizhiyuan@phytium.com.cn: mm/vmalloc: minor coding style tweaks] Link: https://lkml.kernel.org/r/1613963379-15988-1-git-send-email-daizhiyuan@phytium.com.cn [daizhiyuan@phytium.com.cn: mm/memory_hotplug: minor coding style tweaks] Link: https://lkml.kernel.org/r/1613971784-24878-1-git-send-email-daizhiyuan@phytium.com.cn [daizhiyuan@phytium.com.cn: mm/mempolicy: minor coding style tweaks] Link: https://lkml.kernel.org/r/1613972228-25501-1-git-send-email-daizhiyuan@phytium.com.cn Link: https://lkml.kernel.org/r/1614222374-13805-1-git-send-email-daizhiyuan@phytium.com.cn Signed-off-by: Zhiyuan Dai <daizhiyuan@phytium.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/68d68ff6ebbf69d02511dd48f16b3795671c9b0b;Various coding style tweaks to various files under mm/;yes;yes;no 25;MDY6Q29tbWl0MjMyNTI5ODo4NDViZTFjZDM0NDY0NjIwODYxYjQ1N2I4MDhlNWZiMjExNWQwNmIw;Randy Dunlap;Linus Torvalds;"mm: eliminate ""expecting prototype"" kernel-doc warnings Fix stray kernel-doc warnings in mm/ due to mis-typed or missing function names. Quietens these kernel-doc warnings: mm/mmu_gather.c:264: warning: expecting prototype for tlb_gather_mmu(). Prototype was for __tlb_gather_mmu() instead mm/oom_kill.c:180: warning: expecting prototype for Check whether unreclaimable slab amount is greater than(). Prototype was for should_dump_unreclaim_slab() instead mm/shuffle.c:155: warning: expecting prototype for shuffle_free_memory(). Prototype was for __shuffle_free_memory() instead Link: https://lkml.kernel.org/r/20210411210642.11362-1-rdunlap@infradead.org Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/845be1cd34464620861b457b808e5fb2115d06b0;"mm: eliminate ""expecting prototype"" kernel-doc warnings";yes;no;no 25;MDY6Q29tbWl0MjMyNTI5ODo4NDViZTFjZDM0NDY0NjIwODYxYjQ1N2I4MDhlNWZiMjExNWQwNmIw;Randy Dunlap;Linus Torvalds;"mm: eliminate ""expecting prototype"" kernel-doc warnings Fix stray kernel-doc warnings in mm/ due to mis-typed or missing function names. Quietens these kernel-doc warnings: mm/mmu_gather.c:264: warning: expecting prototype for tlb_gather_mmu(). Prototype was for __tlb_gather_mmu() instead mm/oom_kill.c:180: warning: expecting prototype for Check whether unreclaimable slab amount is greater than(). Prototype was for should_dump_unreclaim_slab() instead mm/shuffle.c:155: warning: expecting prototype for shuffle_free_memory(). Prototype was for __shuffle_free_memory() instead Link: https://lkml.kernel.org/r/20210411210642.11362-1-rdunlap@infradead.org Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/845be1cd34464620861b457b808e5fb2115d06b0;"Fix stray kernel-doc warnings in mm/ due to mis-typed or missing function names";yes;yes;no 26;MDY6Q29tbWl0MjMyNTI5ODpmODE1OWMxMzkwNWJiYTI2ZjNlMTc4MmE1MjFkYWNmN2E2NmZjMWNl;Tang Yizhou;Linus Torvalds;"mm, oom: fix a comment in dump_task() If p is a kthread, it will be checked in oom_unkillable_task() so we can delete the corresponding comment. Link: https://lkml.kernel.org/r/20210125133006.7242-1-tangyizhou@huawei.com Signed-off-by: Tang Yizhou <tangyizhou@huawei.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/f8159c13905bba26f3e1782a521dacf7a66fc1ce;mm, oom: fix a comment in dump_task();yes;yes;no 26;MDY6Q29tbWl0MjMyNTI5ODpmODE1OWMxMzkwNWJiYTI2ZjNlMTc4MmE1MjFkYWNmN2E2NmZjMWNl;Tang Yizhou;Linus Torvalds;"mm, oom: fix a comment in dump_task() If p is a kthread, it will be checked in oom_unkillable_task() so we can delete the corresponding comment. Link: https://lkml.kernel.org/r/20210125133006.7242-1-tangyizhou@huawei.com Signed-off-by: Tang Yizhou <tangyizhou@huawei.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/f8159c13905bba26f3e1782a521dacf7a66fc1ce;"If p is a kthread, it will be checked in oom_unkillable_task() so we can delete the corresponding comment.";yes;yes;no 27;MDY6Q29tbWl0MjMyNTI5ODphNzJhZmQ4NzMwODljNjk3MDUzZTlkYWE4NWZmMzQzYjMxNDBkMmU3;Will Deacon;Peter Zijlstra;"tlb: mmu_gather: Remove start/end arguments from tlb_gather_mmu() The 'start' and 'end' arguments to tlb_gather_mmu() are no longer needed now that there is a separate function for 'fullmm' flushing. Remove the unused arguments and update all callers. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Yu Zhao <yuzhao@google.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/CAHk-=wjQWa14_4UpfDf=fiineNP+RH74kZeDMo_f1D35xNzq9w@mail.gmail.com";https://api.github.com/repos/torvalds/linux/git/commits/a72afd873089c697053e9daa85ff343b3140d2e7;tlb: mmu_gather: Remove start/end arguments from tlb_gather_mmu();yes;no;no 27;MDY6Q29tbWl0MjMyNTI5ODphNzJhZmQ4NzMwODljNjk3MDUzZTlkYWE4NWZmMzQzYjMxNDBkMmU3;Will Deacon;Peter Zijlstra;"tlb: mmu_gather: Remove start/end arguments from tlb_gather_mmu() The 'start' and 'end' arguments to tlb_gather_mmu() are no longer needed now that there is a separate function for 'fullmm' flushing. Remove the unused arguments and update all callers. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Yu Zhao <yuzhao@google.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/CAHk-=wjQWa14_4UpfDf=fiineNP+RH74kZeDMo_f1D35xNzq9w@mail.gmail.com";https://api.github.com/repos/torvalds/linux/git/commits/a72afd873089c697053e9daa85ff343b3140d2e7;"The 'start' and 'end' arguments to tlb_gather_mmu() are no longer needed now that there is a separate function for 'fullmm' flushing";no;yes;yes 27;MDY6Q29tbWl0MjMyNTI5ODphNzJhZmQ4NzMwODljNjk3MDUzZTlkYWE4NWZmMzQzYjMxNDBkMmU3;Will Deacon;Peter Zijlstra;"tlb: mmu_gather: Remove start/end arguments from tlb_gather_mmu() The 'start' and 'end' arguments to tlb_gather_mmu() are no longer needed now that there is a separate function for 'fullmm' flushing. Remove the unused arguments and update all callers. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Yu Zhao <yuzhao@google.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/CAHk-=wjQWa14_4UpfDf=fiineNP+RH74kZeDMo_f1D35xNzq9w@mail.gmail.com";https://api.github.com/repos/torvalds/linux/git/commits/a72afd873089c697053e9daa85ff343b3140d2e7;Remove the unused arguments and update all callers.;yes;yes;no 28;MDY6Q29tbWl0MjMyNTI5ODphZThlYmE4YjVkNzIzYTRjYTU0MzAyNGI2ZTUxZjRkMGY0ZmI2YjZi;Will Deacon;Peter Zijlstra;"tlb: mmu_gather: Remove unused start/end arguments from tlb_finish_mmu() Since commit 7a30df49f63a (""mm: mmu_gather: remove __tlb_reset_range() for force flush""), the 'start' and 'end' arguments to tlb_finish_mmu() are no longer used, since we flush the whole mm in case of a nested invalidation. Remove the unused arguments and update all callers. Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Yu Zhao <yuzhao@google.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lkml.kernel.org/r/20210127235347.1402-3-will@kernel.org";https://api.github.com/repos/torvalds/linux/git/commits/ae8eba8b5d723a4ca543024b6e51f4d0f4fb6b6b;tlb: mmu_gather: Remove unused start/end arguments from tlb_finish_mmu();yes;yes;no 28;MDY6Q29tbWl0MjMyNTI5ODphZThlYmE4YjVkNzIzYTRjYTU0MzAyNGI2ZTUxZjRkMGY0ZmI2YjZi;Will Deacon;Peter Zijlstra;"tlb: mmu_gather: Remove unused start/end arguments from tlb_finish_mmu() Since commit 7a30df49f63a (""mm: mmu_gather: remove __tlb_reset_range() for force flush""), the 'start' and 'end' arguments to tlb_finish_mmu() are no longer used, since we flush the whole mm in case of a nested invalidation. Remove the unused arguments and update all callers. Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Yu Zhao <yuzhao@google.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lkml.kernel.org/r/20210127235347.1402-3-will@kernel.org";https://api.github.com/repos/torvalds/linux/git/commits/ae8eba8b5d723a4ca543024b6e51f4d0f4fb6b6b;"Since commit 7a30df49f63a (""mm: mmu_gather: remove __tlb_reset_range() are no longer used, since we flush the whole mm in case of a nested invalidation";no;yes;yes 28;MDY6Q29tbWl0MjMyNTI5ODphZThlYmE4YjVkNzIzYTRjYTU0MzAyNGI2ZTUxZjRkMGY0ZmI2YjZi;Will Deacon;Peter Zijlstra;"tlb: mmu_gather: Remove unused start/end arguments from tlb_finish_mmu() Since commit 7a30df49f63a (""mm: mmu_gather: remove __tlb_reset_range() for force flush""), the 'start' and 'end' arguments to tlb_finish_mmu() are no longer used, since we flush the whole mm in case of a nested invalidation. Remove the unused arguments and update all callers. Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Yu Zhao <yuzhao@google.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lkml.kernel.org/r/20210127235347.1402-3-will@kernel.org";https://api.github.com/repos/torvalds/linux/git/commits/ae8eba8b5d723a4ca543024b6e51f4d0f4fb6b6b;Remove the unused arguments and update all callers.;yes;yes;no 31;MDY6Q29tbWl0MjMyNTI5ODo2MTliNWI0NjliY2FiODRlYTNiZWUxZDhkMDQ0NTFjNzgxZDIzZmVi;Yafang Shao;Linus Torvalds;"mm, oom: show process exiting information in __oom_kill_process() When the OOM killer finds a victim and tryies to kill it, if the victim is already exiting, the task mm will be NULL and no process will be killed. But the dump_header() has been already executed, so it will be strange to dump so much information without killing a process. We'd better show some helpful information to indicate why this happens. Suggested-by: David Rientjes <rientjes@google.com> Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Qian Cai <cai@lca.pw> Link: http://lkml.kernel.org/r/20200721010127.17238-1-laoar.shao@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/619b5b469bcab84ea3bee1d8d04451c781d23feb;mm, oom: show process exiting information in __oom_kill_process();yes;no;no 31;MDY6Q29tbWl0MjMyNTI5ODo2MTliNWI0NjliY2FiODRlYTNiZWUxZDhkMDQ0NTFjNzgxZDIzZmVi;Yafang Shao;Linus Torvalds;"mm, oom: show process exiting information in __oom_kill_process() When the OOM killer finds a victim and tryies to kill it, if the victim is already exiting, the task mm will be NULL and no process will be killed. But the dump_header() has been already executed, so it will be strange to dump so much information without killing a process. We'd better show some helpful information to indicate why this happens. Suggested-by: David Rientjes <rientjes@google.com> Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Qian Cai <cai@lca.pw> Link: http://lkml.kernel.org/r/20200721010127.17238-1-laoar.shao@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/619b5b469bcab84ea3bee1d8d04451c781d23feb;"When the OOM killer finds a victim and tryies to kill it, if the victim is already exiting, the task mm will be NULL and no process will be killed";no;no;yes 31;MDY6Q29tbWl0MjMyNTI5ODo2MTliNWI0NjliY2FiODRlYTNiZWUxZDhkMDQ0NTFjNzgxZDIzZmVi;Yafang Shao;Linus Torvalds;"mm, oom: show process exiting information in __oom_kill_process() When the OOM killer finds a victim and tryies to kill it, if the victim is already exiting, the task mm will be NULL and no process will be killed. But the dump_header() has been already executed, so it will be strange to dump so much information without killing a process. We'd better show some helpful information to indicate why this happens. Suggested-by: David Rientjes <rientjes@google.com> Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Qian Cai <cai@lca.pw> Link: http://lkml.kernel.org/r/20200721010127.17238-1-laoar.shao@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/619b5b469bcab84ea3bee1d8d04451c781d23feb;"But the dump_header() has been already executed, so it will be strange to dump so much information without killing a process";no;yes;yes 31;MDY6Q29tbWl0MjMyNTI5ODo2MTliNWI0NjliY2FiODRlYTNiZWUxZDhkMDQ0NTFjNzgxZDIzZmVi;Yafang Shao;Linus Torvalds;"mm, oom: show process exiting information in __oom_kill_process() When the OOM killer finds a victim and tryies to kill it, if the victim is already exiting, the task mm will be NULL and no process will be killed. But the dump_header() has been already executed, so it will be strange to dump so much information without killing a process. We'd better show some helpful information to indicate why this happens. Suggested-by: David Rientjes <rientjes@google.com> Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Qian Cai <cai@lca.pw> Link: http://lkml.kernel.org/r/20200721010127.17238-1-laoar.shao@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/619b5b469bcab84ea3bee1d8d04451c781d23feb;" We'd better show some helpful information to indicate why this happens.";yes;yes;no 32;MDY6Q29tbWl0MjMyNTI5ODo5MDY2ZTVjZmI3M2NkYmNkYmI0OWU4Nzk5OTQ4MmFiNjE1ZTlmYzc2;Yafang Shao;Linus Torvalds;"mm, oom: make the calculation of oom badness more accurate Recently we found an issue on our production environment that when memcg oom is triggered the oom killer doesn't chose the process with largest resident memory but chose the first scanned process. Note that all processes in this memcg have the same oom_score_adj, so the oom killer should chose the process with largest resident memory. Bellow is part of the oom info, which is enough to analyze this issue. [7516987.983223] memory: usage 16777216kB, limit 16777216kB, failcnt 52843037 [7516987.983224] memory+swap: usage 16777216kB, limit 9007199254740988kB, failcnt 0 [7516987.983225] kmem: usage 301464kB, limit 9007199254740988kB, failcnt 0 [...] [7516987.983293] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name [7516987.983510] [ 5740] 0 5740 257 1 32768 0 -998 pause [7516987.983574] [58804] 0 58804 4594 771 81920 0 -998 entry_point.bas [7516987.983577] [58908] 0 58908 7089 689 98304 0 -998 cron [7516987.983580] [58910] 0 58910 16235 5576 163840 0 -998 supervisord [7516987.983590] [59620] 0 59620 18074 1395 188416 0 -998 sshd [7516987.983594] [59622] 0 59622 18680 6679 188416 0 -998 python [7516987.983598] [59624] 0 59624 1859266 5161 548864 0 -998 odin-agent [7516987.983600] [59625] 0 59625 707223 9248 983040 0 -998 filebeat [7516987.983604] [59627] 0 59627 416433 64239 774144 0 -998 odin-log-agent [7516987.983607] [59631] 0 59631 180671 15012 385024 0 -998 python3 [7516987.983612] [61396] 0 61396 791287 3189 352256 0 -998 client [7516987.983615] [61641] 0 61641 1844642 29089 946176 0 -998 client [7516987.983765] [ 9236] 0 9236 2642 467 53248 0 -998 php_scanner [7516987.983911] [42898] 0 42898 15543 838 167936 0 -998 su [7516987.983915] [42900] 1000 42900 3673 867 77824 0 -998 exec_script_vr2 [7516987.983918] [42925] 1000 42925 36475 19033 335872 0 -998 python [7516987.983921] [57146] 1000 57146 3673 848 73728 0 -998 exec_script_J2p [7516987.983925] [57195] 1000 57195 186359 22958 491520 0 -998 python2 [7516987.983928] [58376] 1000 58376 275764 14402 290816 0 -998 rosmaster [7516987.983931] [58395] 1000 58395 155166 4449 245760 0 -998 rosout [7516987.983935] [58406] 1000 58406 18285584 3967322 37101568 0 -998 data_sim [7516987.984221] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=3aa16c9482ae3a6f6b78bda68a55d32c87c99b985e0f11331cddf05af6c4d753,mems_allowed=0-1,oom_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184,task_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184/1f246a3eeea8f70bf91141eeaf1805346a666e225f823906485ea0b6c37dfc3d,task=pause,pid=5740,uid=0 [7516987.984254] Memory cgroup out of memory: Killed process 5740 (pause) total-vm:1028kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB [7516988.092344] oom_reaper: reaped process 5740 (pause), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB We can find that the first scanned process 5740 (pause) was killed, but its rss is only one page. That is because, when we calculate the oom badness in oom_badness(), we always ignore the negtive point and convert all of these negtive points to 1. Now as oom_score_adj of all the processes in this targeted memcg have the same value -998, the points of these processes are all negtive value. As a result, the first scanned process will be killed. The oom_socre_adj (-998) in this memcg is set by kubelet, because it is a a Guaranteed pod, which has higher priority to prevent from being killed by system oom. To fix this issue, we should make the calculation of oom point more accurate. We can achieve it by convert the chosen_point from 'unsigned long' to 'long'. [cai@lca.pw: reported a issue in the previous version] [mhocko@suse.com: fixed the issue reported by Cai] [mhocko@suse.com: add the comment in proc_oom_score()] [laoar.shao@gmail.com: v3] Link: http://lkml.kernel.org/r/1594396651-9931-1-git-send-email-laoar.shao@gmail.com Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Qian Cai <cai@lca.pw> Link: http://lkml.kernel.org/r/1594309987-9919-1-git-send-email-laoar.shao@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9066e5cfb73cdbcdbb49e87999482ab615e9fc76;mm, oom: make the calculation of oom badness more accurate;yes;yes;no 32;MDY6Q29tbWl0MjMyNTI5ODo5MDY2ZTVjZmI3M2NkYmNkYmI0OWU4Nzk5OTQ4MmFiNjE1ZTlmYzc2;Yafang Shao;Linus Torvalds;"mm, oom: make the calculation of oom badness more accurate Recently we found an issue on our production environment that when memcg oom is triggered the oom killer doesn't chose the process with largest resident memory but chose the first scanned process. Note that all processes in this memcg have the same oom_score_adj, so the oom killer should chose the process with largest resident memory. Bellow is part of the oom info, which is enough to analyze this issue. [7516987.983223] memory: usage 16777216kB, limit 16777216kB, failcnt 52843037 [7516987.983224] memory+swap: usage 16777216kB, limit 9007199254740988kB, failcnt 0 [7516987.983225] kmem: usage 301464kB, limit 9007199254740988kB, failcnt 0 [...] [7516987.983293] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name [7516987.983510] [ 5740] 0 5740 257 1 32768 0 -998 pause [7516987.983574] [58804] 0 58804 4594 771 81920 0 -998 entry_point.bas [7516987.983577] [58908] 0 58908 7089 689 98304 0 -998 cron [7516987.983580] [58910] 0 58910 16235 5576 163840 0 -998 supervisord [7516987.983590] [59620] 0 59620 18074 1395 188416 0 -998 sshd [7516987.983594] [59622] 0 59622 18680 6679 188416 0 -998 python [7516987.983598] [59624] 0 59624 1859266 5161 548864 0 -998 odin-agent [7516987.983600] [59625] 0 59625 707223 9248 983040 0 -998 filebeat [7516987.983604] [59627] 0 59627 416433 64239 774144 0 -998 odin-log-agent [7516987.983607] [59631] 0 59631 180671 15012 385024 0 -998 python3 [7516987.983612] [61396] 0 61396 791287 3189 352256 0 -998 client [7516987.983615] [61641] 0 61641 1844642 29089 946176 0 -998 client [7516987.983765] [ 9236] 0 9236 2642 467 53248 0 -998 php_scanner [7516987.983911] [42898] 0 42898 15543 838 167936 0 -998 su [7516987.983915] [42900] 1000 42900 3673 867 77824 0 -998 exec_script_vr2 [7516987.983918] [42925] 1000 42925 36475 19033 335872 0 -998 python [7516987.983921] [57146] 1000 57146 3673 848 73728 0 -998 exec_script_J2p [7516987.983925] [57195] 1000 57195 186359 22958 491520 0 -998 python2 [7516987.983928] [58376] 1000 58376 275764 14402 290816 0 -998 rosmaster [7516987.983931] [58395] 1000 58395 155166 4449 245760 0 -998 rosout [7516987.983935] [58406] 1000 58406 18285584 3967322 37101568 0 -998 data_sim [7516987.984221] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=3aa16c9482ae3a6f6b78bda68a55d32c87c99b985e0f11331cddf05af6c4d753,mems_allowed=0-1,oom_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184,task_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184/1f246a3eeea8f70bf91141eeaf1805346a666e225f823906485ea0b6c37dfc3d,task=pause,pid=5740,uid=0 [7516987.984254] Memory cgroup out of memory: Killed process 5740 (pause) total-vm:1028kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB [7516988.092344] oom_reaper: reaped process 5740 (pause), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB We can find that the first scanned process 5740 (pause) was killed, but its rss is only one page. That is because, when we calculate the oom badness in oom_badness(), we always ignore the negtive point and convert all of these negtive points to 1. Now as oom_score_adj of all the processes in this targeted memcg have the same value -998, the points of these processes are all negtive value. As a result, the first scanned process will be killed. The oom_socre_adj (-998) in this memcg is set by kubelet, because it is a a Guaranteed pod, which has higher priority to prevent from being killed by system oom. To fix this issue, we should make the calculation of oom point more accurate. We can achieve it by convert the chosen_point from 'unsigned long' to 'long'. [cai@lca.pw: reported a issue in the previous version] [mhocko@suse.com: fixed the issue reported by Cai] [mhocko@suse.com: add the comment in proc_oom_score()] [laoar.shao@gmail.com: v3] Link: http://lkml.kernel.org/r/1594396651-9931-1-git-send-email-laoar.shao@gmail.com Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Qian Cai <cai@lca.pw> Link: http://lkml.kernel.org/r/1594309987-9919-1-git-send-email-laoar.shao@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9066e5cfb73cdbcdbb49e87999482ab615e9fc76;"Recently we found an issue on our production environment that when memcg oom is triggered the oom killer doesn't chose the process with largest resident memory but chose the first scanned process";no;yes;yes 32;MDY6Q29tbWl0MjMyNTI5ODo5MDY2ZTVjZmI3M2NkYmNkYmI0OWU4Nzk5OTQ4MmFiNjE1ZTlmYzc2;Yafang Shao;Linus Torvalds;"mm, oom: make the calculation of oom badness more accurate Recently we found an issue on our production environment that when memcg oom is triggered the oom killer doesn't chose the process with largest resident memory but chose the first scanned process. Note that all processes in this memcg have the same oom_score_adj, so the oom killer should chose the process with largest resident memory. Bellow is part of the oom info, which is enough to analyze this issue. [7516987.983223] memory: usage 16777216kB, limit 16777216kB, failcnt 52843037 [7516987.983224] memory+swap: usage 16777216kB, limit 9007199254740988kB, failcnt 0 [7516987.983225] kmem: usage 301464kB, limit 9007199254740988kB, failcnt 0 [...] [7516987.983293] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name [7516987.983510] [ 5740] 0 5740 257 1 32768 0 -998 pause [7516987.983574] [58804] 0 58804 4594 771 81920 0 -998 entry_point.bas [7516987.983577] [58908] 0 58908 7089 689 98304 0 -998 cron [7516987.983580] [58910] 0 58910 16235 5576 163840 0 -998 supervisord [7516987.983590] [59620] 0 59620 18074 1395 188416 0 -998 sshd [7516987.983594] [59622] 0 59622 18680 6679 188416 0 -998 python [7516987.983598] [59624] 0 59624 1859266 5161 548864 0 -998 odin-agent [7516987.983600] [59625] 0 59625 707223 9248 983040 0 -998 filebeat [7516987.983604] [59627] 0 59627 416433 64239 774144 0 -998 odin-log-agent [7516987.983607] [59631] 0 59631 180671 15012 385024 0 -998 python3 [7516987.983612] [61396] 0 61396 791287 3189 352256 0 -998 client [7516987.983615] [61641] 0 61641 1844642 29089 946176 0 -998 client [7516987.983765] [ 9236] 0 9236 2642 467 53248 0 -998 php_scanner [7516987.983911] [42898] 0 42898 15543 838 167936 0 -998 su [7516987.983915] [42900] 1000 42900 3673 867 77824 0 -998 exec_script_vr2 [7516987.983918] [42925] 1000 42925 36475 19033 335872 0 -998 python [7516987.983921] [57146] 1000 57146 3673 848 73728 0 -998 exec_script_J2p [7516987.983925] [57195] 1000 57195 186359 22958 491520 0 -998 python2 [7516987.983928] [58376] 1000 58376 275764 14402 290816 0 -998 rosmaster [7516987.983931] [58395] 1000 58395 155166 4449 245760 0 -998 rosout [7516987.983935] [58406] 1000 58406 18285584 3967322 37101568 0 -998 data_sim [7516987.984221] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=3aa16c9482ae3a6f6b78bda68a55d32c87c99b985e0f11331cddf05af6c4d753,mems_allowed=0-1,oom_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184,task_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184/1f246a3eeea8f70bf91141eeaf1805346a666e225f823906485ea0b6c37dfc3d,task=pause,pid=5740,uid=0 [7516987.984254] Memory cgroup out of memory: Killed process 5740 (pause) total-vm:1028kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB [7516988.092344] oom_reaper: reaped process 5740 (pause), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB We can find that the first scanned process 5740 (pause) was killed, but its rss is only one page. That is because, when we calculate the oom badness in oom_badness(), we always ignore the negtive point and convert all of these negtive points to 1. Now as oom_score_adj of all the processes in this targeted memcg have the same value -998, the points of these processes are all negtive value. As a result, the first scanned process will be killed. The oom_socre_adj (-998) in this memcg is set by kubelet, because it is a a Guaranteed pod, which has higher priority to prevent from being killed by system oom. To fix this issue, we should make the calculation of oom point more accurate. We can achieve it by convert the chosen_point from 'unsigned long' to 'long'. [cai@lca.pw: reported a issue in the previous version] [mhocko@suse.com: fixed the issue reported by Cai] [mhocko@suse.com: add the comment in proc_oom_score()] [laoar.shao@gmail.com: v3] Link: http://lkml.kernel.org/r/1594396651-9931-1-git-send-email-laoar.shao@gmail.com Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Qian Cai <cai@lca.pw> Link: http://lkml.kernel.org/r/1594309987-9919-1-git-send-email-laoar.shao@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9066e5cfb73cdbcdbb49e87999482ab615e9fc76;" Note that all processes in this memcg have the same oom_score_adj, so the oom killer should chose the process with largest resident memory";no;no;yes 32;MDY6Q29tbWl0MjMyNTI5ODo5MDY2ZTVjZmI3M2NkYmNkYmI0OWU4Nzk5OTQ4MmFiNjE1ZTlmYzc2;Yafang Shao;Linus Torvalds;"mm, oom: make the calculation of oom badness more accurate Recently we found an issue on our production environment that when memcg oom is triggered the oom killer doesn't chose the process with largest resident memory but chose the first scanned process. Note that all processes in this memcg have the same oom_score_adj, so the oom killer should chose the process with largest resident memory. Bellow is part of the oom info, which is enough to analyze this issue. [7516987.983223] memory: usage 16777216kB, limit 16777216kB, failcnt 52843037 [7516987.983224] memory+swap: usage 16777216kB, limit 9007199254740988kB, failcnt 0 [7516987.983225] kmem: usage 301464kB, limit 9007199254740988kB, failcnt 0 [...] [7516987.983293] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name [7516987.983510] [ 5740] 0 5740 257 1 32768 0 -998 pause [7516987.983574] [58804] 0 58804 4594 771 81920 0 -998 entry_point.bas [7516987.983577] [58908] 0 58908 7089 689 98304 0 -998 cron [7516987.983580] [58910] 0 58910 16235 5576 163840 0 -998 supervisord [7516987.983590] [59620] 0 59620 18074 1395 188416 0 -998 sshd [7516987.983594] [59622] 0 59622 18680 6679 188416 0 -998 python [7516987.983598] [59624] 0 59624 1859266 5161 548864 0 -998 odin-agent [7516987.983600] [59625] 0 59625 707223 9248 983040 0 -998 filebeat [7516987.983604] [59627] 0 59627 416433 64239 774144 0 -998 odin-log-agent [7516987.983607] [59631] 0 59631 180671 15012 385024 0 -998 python3 [7516987.983612] [61396] 0 61396 791287 3189 352256 0 -998 client [7516987.983615] [61641] 0 61641 1844642 29089 946176 0 -998 client [7516987.983765] [ 9236] 0 9236 2642 467 53248 0 -998 php_scanner [7516987.983911] [42898] 0 42898 15543 838 167936 0 -998 su [7516987.983915] [42900] 1000 42900 3673 867 77824 0 -998 exec_script_vr2 [7516987.983918] [42925] 1000 42925 36475 19033 335872 0 -998 python [7516987.983921] [57146] 1000 57146 3673 848 73728 0 -998 exec_script_J2p [7516987.983925] [57195] 1000 57195 186359 22958 491520 0 -998 python2 [7516987.983928] [58376] 1000 58376 275764 14402 290816 0 -998 rosmaster [7516987.983931] [58395] 1000 58395 155166 4449 245760 0 -998 rosout [7516987.983935] [58406] 1000 58406 18285584 3967322 37101568 0 -998 data_sim [7516987.984221] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=3aa16c9482ae3a6f6b78bda68a55d32c87c99b985e0f11331cddf05af6c4d753,mems_allowed=0-1,oom_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184,task_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184/1f246a3eeea8f70bf91141eeaf1805346a666e225f823906485ea0b6c37dfc3d,task=pause,pid=5740,uid=0 [7516987.984254] Memory cgroup out of memory: Killed process 5740 (pause) total-vm:1028kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB [7516988.092344] oom_reaper: reaped process 5740 (pause), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB We can find that the first scanned process 5740 (pause) was killed, but its rss is only one page. That is because, when we calculate the oom badness in oom_badness(), we always ignore the negtive point and convert all of these negtive points to 1. Now as oom_score_adj of all the processes in this targeted memcg have the same value -998, the points of these processes are all negtive value. As a result, the first scanned process will be killed. The oom_socre_adj (-998) in this memcg is set by kubelet, because it is a a Guaranteed pod, which has higher priority to prevent from being killed by system oom. To fix this issue, we should make the calculation of oom point more accurate. We can achieve it by convert the chosen_point from 'unsigned long' to 'long'. [cai@lca.pw: reported a issue in the previous version] [mhocko@suse.com: fixed the issue reported by Cai] [mhocko@suse.com: add the comment in proc_oom_score()] [laoar.shao@gmail.com: v3] Link: http://lkml.kernel.org/r/1594396651-9931-1-git-send-email-laoar.shao@gmail.com Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Qian Cai <cai@lca.pw> Link: http://lkml.kernel.org/r/1594309987-9919-1-git-send-email-laoar.shao@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9066e5cfb73cdbcdbb49e87999482ab615e9fc76;Bellow is part of the oom info, which is enough to analyze this issue;no;no;yes 32;MDY6Q29tbWl0MjMyNTI5ODo5MDY2ZTVjZmI3M2NkYmNkYmI0OWU4Nzk5OTQ4MmFiNjE1ZTlmYzc2;Yafang Shao;Linus Torvalds;"mm, oom: make the calculation of oom badness more accurate Recently we found an issue on our production environment that when memcg oom is triggered the oom killer doesn't chose the process with largest resident memory but chose the first scanned process. Note that all processes in this memcg have the same oom_score_adj, so the oom killer should chose the process with largest resident memory. Bellow is part of the oom info, which is enough to analyze this issue. [7516987.983223] memory: usage 16777216kB, limit 16777216kB, failcnt 52843037 [7516987.983224] memory+swap: usage 16777216kB, limit 9007199254740988kB, failcnt 0 [7516987.983225] kmem: usage 301464kB, limit 9007199254740988kB, failcnt 0 [...] [7516987.983293] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name [7516987.983510] [ 5740] 0 5740 257 1 32768 0 -998 pause [7516987.983574] [58804] 0 58804 4594 771 81920 0 -998 entry_point.bas [7516987.983577] [58908] 0 58908 7089 689 98304 0 -998 cron [7516987.983580] [58910] 0 58910 16235 5576 163840 0 -998 supervisord [7516987.983590] [59620] 0 59620 18074 1395 188416 0 -998 sshd [7516987.983594] [59622] 0 59622 18680 6679 188416 0 -998 python [7516987.983598] [59624] 0 59624 1859266 5161 548864 0 -998 odin-agent [7516987.983600] [59625] 0 59625 707223 9248 983040 0 -998 filebeat [7516987.983604] [59627] 0 59627 416433 64239 774144 0 -998 odin-log-agent [7516987.983607] [59631] 0 59631 180671 15012 385024 0 -998 python3 [7516987.983612] [61396] 0 61396 791287 3189 352256 0 -998 client [7516987.983615] [61641] 0 61641 1844642 29089 946176 0 -998 client [7516987.983765] [ 9236] 0 9236 2642 467 53248 0 -998 php_scanner [7516987.983911] [42898] 0 42898 15543 838 167936 0 -998 su [7516987.983915] [42900] 1000 42900 3673 867 77824 0 -998 exec_script_vr2 [7516987.983918] [42925] 1000 42925 36475 19033 335872 0 -998 python [7516987.983921] [57146] 1000 57146 3673 848 73728 0 -998 exec_script_J2p [7516987.983925] [57195] 1000 57195 186359 22958 491520 0 -998 python2 [7516987.983928] [58376] 1000 58376 275764 14402 290816 0 -998 rosmaster [7516987.983931] [58395] 1000 58395 155166 4449 245760 0 -998 rosout [7516987.983935] [58406] 1000 58406 18285584 3967322 37101568 0 -998 data_sim [7516987.984221] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=3aa16c9482ae3a6f6b78bda68a55d32c87c99b985e0f11331cddf05af6c4d753,mems_allowed=0-1,oom_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184,task_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184/1f246a3eeea8f70bf91141eeaf1805346a666e225f823906485ea0b6c37dfc3d,task=pause,pid=5740,uid=0 [7516987.984254] Memory cgroup out of memory: Killed process 5740 (pause) total-vm:1028kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB [7516988.092344] oom_reaper: reaped process 5740 (pause), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB We can find that the first scanned process 5740 (pause) was killed, but its rss is only one page. That is because, when we calculate the oom badness in oom_badness(), we always ignore the negtive point and convert all of these negtive points to 1. Now as oom_score_adj of all the processes in this targeted memcg have the same value -998, the points of these processes are all negtive value. As a result, the first scanned process will be killed. The oom_socre_adj (-998) in this memcg is set by kubelet, because it is a a Guaranteed pod, which has higher priority to prevent from being killed by system oom. To fix this issue, we should make the calculation of oom point more accurate. We can achieve it by convert the chosen_point from 'unsigned long' to 'long'. [cai@lca.pw: reported a issue in the previous version] [mhocko@suse.com: fixed the issue reported by Cai] [mhocko@suse.com: add the comment in proc_oom_score()] [laoar.shao@gmail.com: v3] Link: http://lkml.kernel.org/r/1594396651-9931-1-git-send-email-laoar.shao@gmail.com Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Qian Cai <cai@lca.pw> Link: http://lkml.kernel.org/r/1594309987-9919-1-git-send-email-laoar.shao@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9066e5cfb73cdbcdbb49e87999482ab615e9fc76;"We can find that the first scanned process 5740 (pause) was killed, but its rss is only one page";no;no;yes 32;MDY6Q29tbWl0MjMyNTI5ODo5MDY2ZTVjZmI3M2NkYmNkYmI0OWU4Nzk5OTQ4MmFiNjE1ZTlmYzc2;Yafang Shao;Linus Torvalds;"mm, oom: make the calculation of oom badness more accurate Recently we found an issue on our production environment that when memcg oom is triggered the oom killer doesn't chose the process with largest resident memory but chose the first scanned process. Note that all processes in this memcg have the same oom_score_adj, so the oom killer should chose the process with largest resident memory. Bellow is part of the oom info, which is enough to analyze this issue. [7516987.983223] memory: usage 16777216kB, limit 16777216kB, failcnt 52843037 [7516987.983224] memory+swap: usage 16777216kB, limit 9007199254740988kB, failcnt 0 [7516987.983225] kmem: usage 301464kB, limit 9007199254740988kB, failcnt 0 [...] [7516987.983293] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name [7516987.983510] [ 5740] 0 5740 257 1 32768 0 -998 pause [7516987.983574] [58804] 0 58804 4594 771 81920 0 -998 entry_point.bas [7516987.983577] [58908] 0 58908 7089 689 98304 0 -998 cron [7516987.983580] [58910] 0 58910 16235 5576 163840 0 -998 supervisord [7516987.983590] [59620] 0 59620 18074 1395 188416 0 -998 sshd [7516987.983594] [59622] 0 59622 18680 6679 188416 0 -998 python [7516987.983598] [59624] 0 59624 1859266 5161 548864 0 -998 odin-agent [7516987.983600] [59625] 0 59625 707223 9248 983040 0 -998 filebeat [7516987.983604] [59627] 0 59627 416433 64239 774144 0 -998 odin-log-agent [7516987.983607] [59631] 0 59631 180671 15012 385024 0 -998 python3 [7516987.983612] [61396] 0 61396 791287 3189 352256 0 -998 client [7516987.983615] [61641] 0 61641 1844642 29089 946176 0 -998 client [7516987.983765] [ 9236] 0 9236 2642 467 53248 0 -998 php_scanner [7516987.983911] [42898] 0 42898 15543 838 167936 0 -998 su [7516987.983915] [42900] 1000 42900 3673 867 77824 0 -998 exec_script_vr2 [7516987.983918] [42925] 1000 42925 36475 19033 335872 0 -998 python [7516987.983921] [57146] 1000 57146 3673 848 73728 0 -998 exec_script_J2p [7516987.983925] [57195] 1000 57195 186359 22958 491520 0 -998 python2 [7516987.983928] [58376] 1000 58376 275764 14402 290816 0 -998 rosmaster [7516987.983931] [58395] 1000 58395 155166 4449 245760 0 -998 rosout [7516987.983935] [58406] 1000 58406 18285584 3967322 37101568 0 -998 data_sim [7516987.984221] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=3aa16c9482ae3a6f6b78bda68a55d32c87c99b985e0f11331cddf05af6c4d753,mems_allowed=0-1,oom_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184,task_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184/1f246a3eeea8f70bf91141eeaf1805346a666e225f823906485ea0b6c37dfc3d,task=pause,pid=5740,uid=0 [7516987.984254] Memory cgroup out of memory: Killed process 5740 (pause) total-vm:1028kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB [7516988.092344] oom_reaper: reaped process 5740 (pause), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB We can find that the first scanned process 5740 (pause) was killed, but its rss is only one page. That is because, when we calculate the oom badness in oom_badness(), we always ignore the negtive point and convert all of these negtive points to 1. Now as oom_score_adj of all the processes in this targeted memcg have the same value -998, the points of these processes are all negtive value. As a result, the first scanned process will be killed. The oom_socre_adj (-998) in this memcg is set by kubelet, because it is a a Guaranteed pod, which has higher priority to prevent from being killed by system oom. To fix this issue, we should make the calculation of oom point more accurate. We can achieve it by convert the chosen_point from 'unsigned long' to 'long'. [cai@lca.pw: reported a issue in the previous version] [mhocko@suse.com: fixed the issue reported by Cai] [mhocko@suse.com: add the comment in proc_oom_score()] [laoar.shao@gmail.com: v3] Link: http://lkml.kernel.org/r/1594396651-9931-1-git-send-email-laoar.shao@gmail.com Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Qian Cai <cai@lca.pw> Link: http://lkml.kernel.org/r/1594309987-9919-1-git-send-email-laoar.shao@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9066e5cfb73cdbcdbb49e87999482ab615e9fc76;" That is because, when we calculate the oom badness in oom_badness(), we always ignore the negtive point and convert all of these negtive points to 1";no;no;yes 32;MDY6Q29tbWl0MjMyNTI5ODo5MDY2ZTVjZmI3M2NkYmNkYmI0OWU4Nzk5OTQ4MmFiNjE1ZTlmYzc2;Yafang Shao;Linus Torvalds;"mm, oom: make the calculation of oom badness more accurate Recently we found an issue on our production environment that when memcg oom is triggered the oom killer doesn't chose the process with largest resident memory but chose the first scanned process. Note that all processes in this memcg have the same oom_score_adj, so the oom killer should chose the process with largest resident memory. Bellow is part of the oom info, which is enough to analyze this issue. [7516987.983223] memory: usage 16777216kB, limit 16777216kB, failcnt 52843037 [7516987.983224] memory+swap: usage 16777216kB, limit 9007199254740988kB, failcnt 0 [7516987.983225] kmem: usage 301464kB, limit 9007199254740988kB, failcnt 0 [...] [7516987.983293] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name [7516987.983510] [ 5740] 0 5740 257 1 32768 0 -998 pause [7516987.983574] [58804] 0 58804 4594 771 81920 0 -998 entry_point.bas [7516987.983577] [58908] 0 58908 7089 689 98304 0 -998 cron [7516987.983580] [58910] 0 58910 16235 5576 163840 0 -998 supervisord [7516987.983590] [59620] 0 59620 18074 1395 188416 0 -998 sshd [7516987.983594] [59622] 0 59622 18680 6679 188416 0 -998 python [7516987.983598] [59624] 0 59624 1859266 5161 548864 0 -998 odin-agent [7516987.983600] [59625] 0 59625 707223 9248 983040 0 -998 filebeat [7516987.983604] [59627] 0 59627 416433 64239 774144 0 -998 odin-log-agent [7516987.983607] [59631] 0 59631 180671 15012 385024 0 -998 python3 [7516987.983612] [61396] 0 61396 791287 3189 352256 0 -998 client [7516987.983615] [61641] 0 61641 1844642 29089 946176 0 -998 client [7516987.983765] [ 9236] 0 9236 2642 467 53248 0 -998 php_scanner [7516987.983911] [42898] 0 42898 15543 838 167936 0 -998 su [7516987.983915] [42900] 1000 42900 3673 867 77824 0 -998 exec_script_vr2 [7516987.983918] [42925] 1000 42925 36475 19033 335872 0 -998 python [7516987.983921] [57146] 1000 57146 3673 848 73728 0 -998 exec_script_J2p [7516987.983925] [57195] 1000 57195 186359 22958 491520 0 -998 python2 [7516987.983928] [58376] 1000 58376 275764 14402 290816 0 -998 rosmaster [7516987.983931] [58395] 1000 58395 155166 4449 245760 0 -998 rosout [7516987.983935] [58406] 1000 58406 18285584 3967322 37101568 0 -998 data_sim [7516987.984221] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=3aa16c9482ae3a6f6b78bda68a55d32c87c99b985e0f11331cddf05af6c4d753,mems_allowed=0-1,oom_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184,task_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184/1f246a3eeea8f70bf91141eeaf1805346a666e225f823906485ea0b6c37dfc3d,task=pause,pid=5740,uid=0 [7516987.984254] Memory cgroup out of memory: Killed process 5740 (pause) total-vm:1028kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB [7516988.092344] oom_reaper: reaped process 5740 (pause), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB We can find that the first scanned process 5740 (pause) was killed, but its rss is only one page. That is because, when we calculate the oom badness in oom_badness(), we always ignore the negtive point and convert all of these negtive points to 1. Now as oom_score_adj of all the processes in this targeted memcg have the same value -998, the points of these processes are all negtive value. As a result, the first scanned process will be killed. The oom_socre_adj (-998) in this memcg is set by kubelet, because it is a a Guaranteed pod, which has higher priority to prevent from being killed by system oom. To fix this issue, we should make the calculation of oom point more accurate. We can achieve it by convert the chosen_point from 'unsigned long' to 'long'. [cai@lca.pw: reported a issue in the previous version] [mhocko@suse.com: fixed the issue reported by Cai] [mhocko@suse.com: add the comment in proc_oom_score()] [laoar.shao@gmail.com: v3] Link: http://lkml.kernel.org/r/1594396651-9931-1-git-send-email-laoar.shao@gmail.com Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Qian Cai <cai@lca.pw> Link: http://lkml.kernel.org/r/1594309987-9919-1-git-send-email-laoar.shao@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9066e5cfb73cdbcdbb49e87999482ab615e9fc76;" Now as oom_score_adj of all the processes in this targeted memcg have the same value -998, the points of these processes are all negtive value";no;no;yes 32;MDY6Q29tbWl0MjMyNTI5ODo5MDY2ZTVjZmI3M2NkYmNkYmI0OWU4Nzk5OTQ4MmFiNjE1ZTlmYzc2;Yafang Shao;Linus Torvalds;"mm, oom: make the calculation of oom badness more accurate Recently we found an issue on our production environment that when memcg oom is triggered the oom killer doesn't chose the process with largest resident memory but chose the first scanned process. Note that all processes in this memcg have the same oom_score_adj, so the oom killer should chose the process with largest resident memory. Bellow is part of the oom info, which is enough to analyze this issue. [7516987.983223] memory: usage 16777216kB, limit 16777216kB, failcnt 52843037 [7516987.983224] memory+swap: usage 16777216kB, limit 9007199254740988kB, failcnt 0 [7516987.983225] kmem: usage 301464kB, limit 9007199254740988kB, failcnt 0 [...] [7516987.983293] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name [7516987.983510] [ 5740] 0 5740 257 1 32768 0 -998 pause [7516987.983574] [58804] 0 58804 4594 771 81920 0 -998 entry_point.bas [7516987.983577] [58908] 0 58908 7089 689 98304 0 -998 cron [7516987.983580] [58910] 0 58910 16235 5576 163840 0 -998 supervisord [7516987.983590] [59620] 0 59620 18074 1395 188416 0 -998 sshd [7516987.983594] [59622] 0 59622 18680 6679 188416 0 -998 python [7516987.983598] [59624] 0 59624 1859266 5161 548864 0 -998 odin-agent [7516987.983600] [59625] 0 59625 707223 9248 983040 0 -998 filebeat [7516987.983604] [59627] 0 59627 416433 64239 774144 0 -998 odin-log-agent [7516987.983607] [59631] 0 59631 180671 15012 385024 0 -998 python3 [7516987.983612] [61396] 0 61396 791287 3189 352256 0 -998 client [7516987.983615] [61641] 0 61641 1844642 29089 946176 0 -998 client [7516987.983765] [ 9236] 0 9236 2642 467 53248 0 -998 php_scanner [7516987.983911] [42898] 0 42898 15543 838 167936 0 -998 su [7516987.983915] [42900] 1000 42900 3673 867 77824 0 -998 exec_script_vr2 [7516987.983918] [42925] 1000 42925 36475 19033 335872 0 -998 python [7516987.983921] [57146] 1000 57146 3673 848 73728 0 -998 exec_script_J2p [7516987.983925] [57195] 1000 57195 186359 22958 491520 0 -998 python2 [7516987.983928] [58376] 1000 58376 275764 14402 290816 0 -998 rosmaster [7516987.983931] [58395] 1000 58395 155166 4449 245760 0 -998 rosout [7516987.983935] [58406] 1000 58406 18285584 3967322 37101568 0 -998 data_sim [7516987.984221] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=3aa16c9482ae3a6f6b78bda68a55d32c87c99b985e0f11331cddf05af6c4d753,mems_allowed=0-1,oom_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184,task_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184/1f246a3eeea8f70bf91141eeaf1805346a666e225f823906485ea0b6c37dfc3d,task=pause,pid=5740,uid=0 [7516987.984254] Memory cgroup out of memory: Killed process 5740 (pause) total-vm:1028kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB [7516988.092344] oom_reaper: reaped process 5740 (pause), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB We can find that the first scanned process 5740 (pause) was killed, but its rss is only one page. That is because, when we calculate the oom badness in oom_badness(), we always ignore the negtive point and convert all of these negtive points to 1. Now as oom_score_adj of all the processes in this targeted memcg have the same value -998, the points of these processes are all negtive value. As a result, the first scanned process will be killed. The oom_socre_adj (-998) in this memcg is set by kubelet, because it is a a Guaranteed pod, which has higher priority to prevent from being killed by system oom. To fix this issue, we should make the calculation of oom point more accurate. We can achieve it by convert the chosen_point from 'unsigned long' to 'long'. [cai@lca.pw: reported a issue in the previous version] [mhocko@suse.com: fixed the issue reported by Cai] [mhocko@suse.com: add the comment in proc_oom_score()] [laoar.shao@gmail.com: v3] Link: http://lkml.kernel.org/r/1594396651-9931-1-git-send-email-laoar.shao@gmail.com Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Qian Cai <cai@lca.pw> Link: http://lkml.kernel.org/r/1594309987-9919-1-git-send-email-laoar.shao@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9066e5cfb73cdbcdbb49e87999482ab615e9fc76;" As a result, the first scanned process will be killed";no;no;yes 32;MDY6Q29tbWl0MjMyNTI5ODo5MDY2ZTVjZmI3M2NkYmNkYmI0OWU4Nzk5OTQ4MmFiNjE1ZTlmYzc2;Yafang Shao;Linus Torvalds;"mm, oom: make the calculation of oom badness more accurate Recently we found an issue on our production environment that when memcg oom is triggered the oom killer doesn't chose the process with largest resident memory but chose the first scanned process. Note that all processes in this memcg have the same oom_score_adj, so the oom killer should chose the process with largest resident memory. Bellow is part of the oom info, which is enough to analyze this issue. [7516987.983223] memory: usage 16777216kB, limit 16777216kB, failcnt 52843037 [7516987.983224] memory+swap: usage 16777216kB, limit 9007199254740988kB, failcnt 0 [7516987.983225] kmem: usage 301464kB, limit 9007199254740988kB, failcnt 0 [...] [7516987.983293] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name [7516987.983510] [ 5740] 0 5740 257 1 32768 0 -998 pause [7516987.983574] [58804] 0 58804 4594 771 81920 0 -998 entry_point.bas [7516987.983577] [58908] 0 58908 7089 689 98304 0 -998 cron [7516987.983580] [58910] 0 58910 16235 5576 163840 0 -998 supervisord [7516987.983590] [59620] 0 59620 18074 1395 188416 0 -998 sshd [7516987.983594] [59622] 0 59622 18680 6679 188416 0 -998 python [7516987.983598] [59624] 0 59624 1859266 5161 548864 0 -998 odin-agent [7516987.983600] [59625] 0 59625 707223 9248 983040 0 -998 filebeat [7516987.983604] [59627] 0 59627 416433 64239 774144 0 -998 odin-log-agent [7516987.983607] [59631] 0 59631 180671 15012 385024 0 -998 python3 [7516987.983612] [61396] 0 61396 791287 3189 352256 0 -998 client [7516987.983615] [61641] 0 61641 1844642 29089 946176 0 -998 client [7516987.983765] [ 9236] 0 9236 2642 467 53248 0 -998 php_scanner [7516987.983911] [42898] 0 42898 15543 838 167936 0 -998 su [7516987.983915] [42900] 1000 42900 3673 867 77824 0 -998 exec_script_vr2 [7516987.983918] [42925] 1000 42925 36475 19033 335872 0 -998 python [7516987.983921] [57146] 1000 57146 3673 848 73728 0 -998 exec_script_J2p [7516987.983925] [57195] 1000 57195 186359 22958 491520 0 -998 python2 [7516987.983928] [58376] 1000 58376 275764 14402 290816 0 -998 rosmaster [7516987.983931] [58395] 1000 58395 155166 4449 245760 0 -998 rosout [7516987.983935] [58406] 1000 58406 18285584 3967322 37101568 0 -998 data_sim [7516987.984221] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=3aa16c9482ae3a6f6b78bda68a55d32c87c99b985e0f11331cddf05af6c4d753,mems_allowed=0-1,oom_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184,task_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184/1f246a3eeea8f70bf91141eeaf1805346a666e225f823906485ea0b6c37dfc3d,task=pause,pid=5740,uid=0 [7516987.984254] Memory cgroup out of memory: Killed process 5740 (pause) total-vm:1028kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB [7516988.092344] oom_reaper: reaped process 5740 (pause), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB We can find that the first scanned process 5740 (pause) was killed, but its rss is only one page. That is because, when we calculate the oom badness in oom_badness(), we always ignore the negtive point and convert all of these negtive points to 1. Now as oom_score_adj of all the processes in this targeted memcg have the same value -998, the points of these processes are all negtive value. As a result, the first scanned process will be killed. The oom_socre_adj (-998) in this memcg is set by kubelet, because it is a a Guaranteed pod, which has higher priority to prevent from being killed by system oom. To fix this issue, we should make the calculation of oom point more accurate. We can achieve it by convert the chosen_point from 'unsigned long' to 'long'. [cai@lca.pw: reported a issue in the previous version] [mhocko@suse.com: fixed the issue reported by Cai] [mhocko@suse.com: add the comment in proc_oom_score()] [laoar.shao@gmail.com: v3] Link: http://lkml.kernel.org/r/1594396651-9931-1-git-send-email-laoar.shao@gmail.com Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Qian Cai <cai@lca.pw> Link: http://lkml.kernel.org/r/1594309987-9919-1-git-send-email-laoar.shao@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9066e5cfb73cdbcdbb49e87999482ab615e9fc76;"The oom_socre_adj (-998) in this memcg is set by kubelet, because it is a a Guaranteed pod, which has higher priority to prevent from being killed by system oom";no;no;yes 32;MDY6Q29tbWl0MjMyNTI5ODo5MDY2ZTVjZmI3M2NkYmNkYmI0OWU4Nzk5OTQ4MmFiNjE1ZTlmYzc2;Yafang Shao;Linus Torvalds;"mm, oom: make the calculation of oom badness more accurate Recently we found an issue on our production environment that when memcg oom is triggered the oom killer doesn't chose the process with largest resident memory but chose the first scanned process. Note that all processes in this memcg have the same oom_score_adj, so the oom killer should chose the process with largest resident memory. Bellow is part of the oom info, which is enough to analyze this issue. [7516987.983223] memory: usage 16777216kB, limit 16777216kB, failcnt 52843037 [7516987.983224] memory+swap: usage 16777216kB, limit 9007199254740988kB, failcnt 0 [7516987.983225] kmem: usage 301464kB, limit 9007199254740988kB, failcnt 0 [...] [7516987.983293] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name [7516987.983510] [ 5740] 0 5740 257 1 32768 0 -998 pause [7516987.983574] [58804] 0 58804 4594 771 81920 0 -998 entry_point.bas [7516987.983577] [58908] 0 58908 7089 689 98304 0 -998 cron [7516987.983580] [58910] 0 58910 16235 5576 163840 0 -998 supervisord [7516987.983590] [59620] 0 59620 18074 1395 188416 0 -998 sshd [7516987.983594] [59622] 0 59622 18680 6679 188416 0 -998 python [7516987.983598] [59624] 0 59624 1859266 5161 548864 0 -998 odin-agent [7516987.983600] [59625] 0 59625 707223 9248 983040 0 -998 filebeat [7516987.983604] [59627] 0 59627 416433 64239 774144 0 -998 odin-log-agent [7516987.983607] [59631] 0 59631 180671 15012 385024 0 -998 python3 [7516987.983612] [61396] 0 61396 791287 3189 352256 0 -998 client [7516987.983615] [61641] 0 61641 1844642 29089 946176 0 -998 client [7516987.983765] [ 9236] 0 9236 2642 467 53248 0 -998 php_scanner [7516987.983911] [42898] 0 42898 15543 838 167936 0 -998 su [7516987.983915] [42900] 1000 42900 3673 867 77824 0 -998 exec_script_vr2 [7516987.983918] [42925] 1000 42925 36475 19033 335872 0 -998 python [7516987.983921] [57146] 1000 57146 3673 848 73728 0 -998 exec_script_J2p [7516987.983925] [57195] 1000 57195 186359 22958 491520 0 -998 python2 [7516987.983928] [58376] 1000 58376 275764 14402 290816 0 -998 rosmaster [7516987.983931] [58395] 1000 58395 155166 4449 245760 0 -998 rosout [7516987.983935] [58406] 1000 58406 18285584 3967322 37101568 0 -998 data_sim [7516987.984221] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=3aa16c9482ae3a6f6b78bda68a55d32c87c99b985e0f11331cddf05af6c4d753,mems_allowed=0-1,oom_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184,task_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184/1f246a3eeea8f70bf91141eeaf1805346a666e225f823906485ea0b6c37dfc3d,task=pause,pid=5740,uid=0 [7516987.984254] Memory cgroup out of memory: Killed process 5740 (pause) total-vm:1028kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB [7516988.092344] oom_reaper: reaped process 5740 (pause), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB We can find that the first scanned process 5740 (pause) was killed, but its rss is only one page. That is because, when we calculate the oom badness in oom_badness(), we always ignore the negtive point and convert all of these negtive points to 1. Now as oom_score_adj of all the processes in this targeted memcg have the same value -998, the points of these processes are all negtive value. As a result, the first scanned process will be killed. The oom_socre_adj (-998) in this memcg is set by kubelet, because it is a a Guaranteed pod, which has higher priority to prevent from being killed by system oom. To fix this issue, we should make the calculation of oom point more accurate. We can achieve it by convert the chosen_point from 'unsigned long' to 'long'. [cai@lca.pw: reported a issue in the previous version] [mhocko@suse.com: fixed the issue reported by Cai] [mhocko@suse.com: add the comment in proc_oom_score()] [laoar.shao@gmail.com: v3] Link: http://lkml.kernel.org/r/1594396651-9931-1-git-send-email-laoar.shao@gmail.com Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Qian Cai <cai@lca.pw> Link: http://lkml.kernel.org/r/1594309987-9919-1-git-send-email-laoar.shao@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9066e5cfb73cdbcdbb49e87999482ab615e9fc76;"To fix this issue, we should make the calculation of oom point more accurate";yes;yes;no 32;MDY6Q29tbWl0MjMyNTI5ODo5MDY2ZTVjZmI3M2NkYmNkYmI0OWU4Nzk5OTQ4MmFiNjE1ZTlmYzc2;Yafang Shao;Linus Torvalds;"mm, oom: make the calculation of oom badness more accurate Recently we found an issue on our production environment that when memcg oom is triggered the oom killer doesn't chose the process with largest resident memory but chose the first scanned process. Note that all processes in this memcg have the same oom_score_adj, so the oom killer should chose the process with largest resident memory. Bellow is part of the oom info, which is enough to analyze this issue. [7516987.983223] memory: usage 16777216kB, limit 16777216kB, failcnt 52843037 [7516987.983224] memory+swap: usage 16777216kB, limit 9007199254740988kB, failcnt 0 [7516987.983225] kmem: usage 301464kB, limit 9007199254740988kB, failcnt 0 [...] [7516987.983293] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name [7516987.983510] [ 5740] 0 5740 257 1 32768 0 -998 pause [7516987.983574] [58804] 0 58804 4594 771 81920 0 -998 entry_point.bas [7516987.983577] [58908] 0 58908 7089 689 98304 0 -998 cron [7516987.983580] [58910] 0 58910 16235 5576 163840 0 -998 supervisord [7516987.983590] [59620] 0 59620 18074 1395 188416 0 -998 sshd [7516987.983594] [59622] 0 59622 18680 6679 188416 0 -998 python [7516987.983598] [59624] 0 59624 1859266 5161 548864 0 -998 odin-agent [7516987.983600] [59625] 0 59625 707223 9248 983040 0 -998 filebeat [7516987.983604] [59627] 0 59627 416433 64239 774144 0 -998 odin-log-agent [7516987.983607] [59631] 0 59631 180671 15012 385024 0 -998 python3 [7516987.983612] [61396] 0 61396 791287 3189 352256 0 -998 client [7516987.983615] [61641] 0 61641 1844642 29089 946176 0 -998 client [7516987.983765] [ 9236] 0 9236 2642 467 53248 0 -998 php_scanner [7516987.983911] [42898] 0 42898 15543 838 167936 0 -998 su [7516987.983915] [42900] 1000 42900 3673 867 77824 0 -998 exec_script_vr2 [7516987.983918] [42925] 1000 42925 36475 19033 335872 0 -998 python [7516987.983921] [57146] 1000 57146 3673 848 73728 0 -998 exec_script_J2p [7516987.983925] [57195] 1000 57195 186359 22958 491520 0 -998 python2 [7516987.983928] [58376] 1000 58376 275764 14402 290816 0 -998 rosmaster [7516987.983931] [58395] 1000 58395 155166 4449 245760 0 -998 rosout [7516987.983935] [58406] 1000 58406 18285584 3967322 37101568 0 -998 data_sim [7516987.984221] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=3aa16c9482ae3a6f6b78bda68a55d32c87c99b985e0f11331cddf05af6c4d753,mems_allowed=0-1,oom_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184,task_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184/1f246a3eeea8f70bf91141eeaf1805346a666e225f823906485ea0b6c37dfc3d,task=pause,pid=5740,uid=0 [7516987.984254] Memory cgroup out of memory: Killed process 5740 (pause) total-vm:1028kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB [7516988.092344] oom_reaper: reaped process 5740 (pause), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB We can find that the first scanned process 5740 (pause) was killed, but its rss is only one page. That is because, when we calculate the oom badness in oom_badness(), we always ignore the negtive point and convert all of these negtive points to 1. Now as oom_score_adj of all the processes in this targeted memcg have the same value -998, the points of these processes are all negtive value. As a result, the first scanned process will be killed. The oom_socre_adj (-998) in this memcg is set by kubelet, because it is a a Guaranteed pod, which has higher priority to prevent from being killed by system oom. To fix this issue, we should make the calculation of oom point more accurate. We can achieve it by convert the chosen_point from 'unsigned long' to 'long'. [cai@lca.pw: reported a issue in the previous version] [mhocko@suse.com: fixed the issue reported by Cai] [mhocko@suse.com: add the comment in proc_oom_score()] [laoar.shao@gmail.com: v3] Link: http://lkml.kernel.org/r/1594396651-9931-1-git-send-email-laoar.shao@gmail.com Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Qian Cai <cai@lca.pw> Link: http://lkml.kernel.org/r/1594309987-9919-1-git-send-email-laoar.shao@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9066e5cfb73cdbcdbb49e87999482ab615e9fc76;" We can achieve it by convert the chosen_point from 'unsigned long' to 'long'.";yes;no;no 34;MDY6Q29tbWl0MjMyNTI5ODpmNTY3OGU3ZjJhYzMxYzI3MDMzNGI5MzYzNTJmMGVmMmZlN2RkMmIz;Christoph Hellwig;Linus Torvalds;"kernel: better document the use_mm/unuse_mm API contract Switch the function documentation to kerneldoc comments, and add WARN_ON_ONCE asserts that the calling thread is a kernel thread and does not have ->mm set (or has ->mm set in the case of unuse_mm). Also give the functions a kthread_ prefix to better document the use case. [hch@lst.de: fix a comment typo, cover the newly merged use_mm/unuse_mm caller in vfio] Link: http://lkml.kernel.org/r/20200416053158.586887-3-hch@lst.de [sfr@canb.auug.org.au: powerpc/vas: fix up for {un}use_mm() rename] Link: http://lkml.kernel.org/r/20200422163935.5aa93ba5@canb.auug.org.au Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Jens Axboe <axboe@kernel.dk> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> [usb] Acked-by: Haren Myneni <haren@linux.ibm.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Felipe Balbi <balbi@kernel.org> Cc: Jason Wang <jasowang@redhat.com> Cc: ""Michael S. Tsirkin"" <mst@redhat.com> Cc: Zhenyu Wang <zhenyuw@linux.intel.com> Cc: Zhi Wang <zhi.a.wang@intel.com> Link: http://lkml.kernel.org/r/20200404094101.672954-6-hch@lst.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/f5678e7f2ac31c270334b936352f0ef2fe7dd2b3;kernel: better document the use_mm/unuse_mm API contract;yes;yes;no 34;MDY6Q29tbWl0MjMyNTI5ODpmNTY3OGU3ZjJhYzMxYzI3MDMzNGI5MzYzNTJmMGVmMmZlN2RkMmIz;Christoph Hellwig;Linus Torvalds;"kernel: better document the use_mm/unuse_mm API contract Switch the function documentation to kerneldoc comments, and add WARN_ON_ONCE asserts that the calling thread is a kernel thread and does not have ->mm set (or has ->mm set in the case of unuse_mm). Also give the functions a kthread_ prefix to better document the use case. [hch@lst.de: fix a comment typo, cover the newly merged use_mm/unuse_mm caller in vfio] Link: http://lkml.kernel.org/r/20200416053158.586887-3-hch@lst.de [sfr@canb.auug.org.au: powerpc/vas: fix up for {un}use_mm() rename] Link: http://lkml.kernel.org/r/20200422163935.5aa93ba5@canb.auug.org.au Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Jens Axboe <axboe@kernel.dk> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> [usb] Acked-by: Haren Myneni <haren@linux.ibm.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Felipe Balbi <balbi@kernel.org> Cc: Jason Wang <jasowang@redhat.com> Cc: ""Michael S. Tsirkin"" <mst@redhat.com> Cc: Zhenyu Wang <zhenyuw@linux.intel.com> Cc: Zhi Wang <zhi.a.wang@intel.com> Link: http://lkml.kernel.org/r/20200404094101.672954-6-hch@lst.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/f5678e7f2ac31c270334b936352f0ef2fe7dd2b3;"Switch the function documentation to kerneldoc comments, and add WARN_ON_ONCE asserts that the calling thread is a kernel thread and does not have ->mm set (or has ->mm set in the case of unuse_mm)";yes;no;no 34;MDY6Q29tbWl0MjMyNTI5ODpmNTY3OGU3ZjJhYzMxYzI3MDMzNGI5MzYzNTJmMGVmMmZlN2RkMmIz;Christoph Hellwig;Linus Torvalds;"kernel: better document the use_mm/unuse_mm API contract Switch the function documentation to kerneldoc comments, and add WARN_ON_ONCE asserts that the calling thread is a kernel thread and does not have ->mm set (or has ->mm set in the case of unuse_mm). Also give the functions a kthread_ prefix to better document the use case. [hch@lst.de: fix a comment typo, cover the newly merged use_mm/unuse_mm caller in vfio] Link: http://lkml.kernel.org/r/20200416053158.586887-3-hch@lst.de [sfr@canb.auug.org.au: powerpc/vas: fix up for {un}use_mm() rename] Link: http://lkml.kernel.org/r/20200422163935.5aa93ba5@canb.auug.org.au Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Jens Axboe <axboe@kernel.dk> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> [usb] Acked-by: Haren Myneni <haren@linux.ibm.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Felipe Balbi <balbi@kernel.org> Cc: Jason Wang <jasowang@redhat.com> Cc: ""Michael S. Tsirkin"" <mst@redhat.com> Cc: Zhenyu Wang <zhenyuw@linux.intel.com> Cc: Zhi Wang <zhi.a.wang@intel.com> Link: http://lkml.kernel.org/r/20200404094101.672954-6-hch@lst.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/f5678e7f2ac31c270334b936352f0ef2fe7dd2b3;Also give the functions a kthread_ prefix to better document the use case.;yes;yes;no 35;MDY6Q29tbWl0MjMyNTI5ODpjMWU4ZDdjNmE3YTY4MmUxNDA1ZTNlMjQyZDMyZmMzNzdmZDE5NmZm;Michel Lespinasse;Linus Torvalds;"mmap locking API: convert mmap_sem comments Convert comments that reference mmap_sem to reference mmap_lock instead. [akpm@linux-foundation.org: fix up linux-next leftovers] [akpm@linux-foundation.org: s/lockaphore/lock/, per Vlastimil] [akpm@linux-foundation.org: more linux-next fixups, per Michel] Signed-off-by: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Liam Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ying Han <yinghan@google.com> Link: http://lkml.kernel.org/r/20200520052908.204642-13-walken@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/c1e8d7c6a7a682e1405e3e242d32fc377fd196ff;mmap locking API: convert mmap_sem comments;yes;no;no 35;MDY6Q29tbWl0MjMyNTI5ODpjMWU4ZDdjNmE3YTY4MmUxNDA1ZTNlMjQyZDMyZmMzNzdmZDE5NmZm;Michel Lespinasse;Linus Torvalds;"mmap locking API: convert mmap_sem comments Convert comments that reference mmap_sem to reference mmap_lock instead. [akpm@linux-foundation.org: fix up linux-next leftovers] [akpm@linux-foundation.org: s/lockaphore/lock/, per Vlastimil] [akpm@linux-foundation.org: more linux-next fixups, per Michel] Signed-off-by: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Liam Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ying Han <yinghan@google.com> Link: http://lkml.kernel.org/r/20200520052908.204642-13-walken@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/c1e8d7c6a7a682e1405e3e242d32fc377fd196ff;Convert comments that reference mmap_sem to reference mmap_lock instead.;yes;no;no 36;MDY6Q29tbWl0MjMyNTI5ODozZTRlMjhjNWE4ZjAxZWU0MTc0ZDYzOWUzNmVkMTU1YWRlNDg5YTZm;Michel Lespinasse;Linus Torvalds;"mmap locking API: convert mmap_sem API comments Convert comments that reference old mmap_sem APIs to reference corresponding new mmap locking APIs instead. Signed-off-by: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Davidlohr Bueso <dbueso@suse.de> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Liam Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ying Han <yinghan@google.com> Link: http://lkml.kernel.org/r/20200520052908.204642-12-walken@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3e4e28c5a8f01ee4174d639e36ed155ade489a6f;mmap locking API: convert mmap_sem API comments;yes;no;no 36;MDY6Q29tbWl0MjMyNTI5ODozZTRlMjhjNWE4ZjAxZWU0MTc0ZDYzOWUzNmVkMTU1YWRlNDg5YTZm;Michel Lespinasse;Linus Torvalds;"mmap locking API: convert mmap_sem API comments Convert comments that reference old mmap_sem APIs to reference corresponding new mmap locking APIs instead. Signed-off-by: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Davidlohr Bueso <dbueso@suse.de> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Liam Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ying Han <yinghan@google.com> Link: http://lkml.kernel.org/r/20200520052908.204642-12-walken@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3e4e28c5a8f01ee4174d639e36ed155ade489a6f;"Convert comments that reference old mmap_sem APIs to reference corresponding new mmap locking APIs instead.";yes;yes;no 37;MDY6Q29tbWl0MjMyNTI5ODpkOGVkNDVjNWRjZDQ1NWZjNTg0OGQ0N2Y4Njg4M2ExYjg3MmFjMGQw;Michel Lespinasse;Linus Torvalds;"mmap locking API: use coccinelle to convert mmap_sem rwsem call sites This change converts the existing mmap_sem rwsem calls to use the new mmap locking API instead. The change is generated using coccinelle with the following rule: // spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir . @@ expression mm; @@ ( -init_rwsem +mmap_init_lock | -down_write +mmap_write_lock | -down_write_killable +mmap_write_lock_killable | -down_write_trylock +mmap_write_trylock | -up_write +mmap_write_unlock | -downgrade_write +mmap_write_downgrade | -down_read +mmap_read_lock | -down_read_killable +mmap_read_lock_killable | -down_read_trylock +mmap_read_trylock | -up_read +mmap_read_unlock ) -(&mm->mmap_sem) +(mm) Signed-off-by: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Liam Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ying Han <yinghan@google.com> Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/d8ed45c5dcd455fc5848d47f86883a1b872ac0d0;mmap locking API: use coccinelle to convert mmap_sem rwsem call sites;yes;no;no 37;MDY6Q29tbWl0MjMyNTI5ODpkOGVkNDVjNWRjZDQ1NWZjNTg0OGQ0N2Y4Njg4M2ExYjg3MmFjMGQw;Michel Lespinasse;Linus Torvalds;"mmap locking API: use coccinelle to convert mmap_sem rwsem call sites This change converts the existing mmap_sem rwsem calls to use the new mmap locking API instead. The change is generated using coccinelle with the following rule: // spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir . @@ expression mm; @@ ( -init_rwsem +mmap_init_lock | -down_write +mmap_write_lock | -down_write_killable +mmap_write_lock_killable | -down_write_trylock +mmap_write_trylock | -up_write +mmap_write_unlock | -downgrade_write +mmap_write_downgrade | -down_read +mmap_read_lock | -down_read_killable +mmap_read_lock_killable | -down_read_trylock +mmap_read_trylock | -up_read +mmap_read_unlock ) -(&mm->mmap_sem) +(mm) Signed-off-by: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Liam Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ying Han <yinghan@google.com> Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/d8ed45c5dcd455fc5848d47f86883a1b872ac0d0;"This change converts the existing mmap_sem rwsem calls to use the new mmap locking API instead";yes;yes;no 37;MDY6Q29tbWl0MjMyNTI5ODpkOGVkNDVjNWRjZDQ1NWZjNTg0OGQ0N2Y4Njg4M2ExYjg3MmFjMGQw;Michel Lespinasse;Linus Torvalds;"mmap locking API: use coccinelle to convert mmap_sem rwsem call sites This change converts the existing mmap_sem rwsem calls to use the new mmap locking API instead. The change is generated using coccinelle with the following rule: // spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir . @@ expression mm; @@ ( -init_rwsem +mmap_init_lock | -down_write +mmap_write_lock | -down_write_killable +mmap_write_lock_killable | -down_write_trylock +mmap_write_trylock | -up_write +mmap_write_unlock | -downgrade_write +mmap_write_downgrade | -down_read +mmap_read_lock | -down_read_killable +mmap_read_lock_killable | -down_read_trylock +mmap_read_trylock | -up_read +mmap_read_unlock ) -(&mm->mmap_sem) +(mm) Signed-off-by: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Liam Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ying Han <yinghan@google.com> Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/d8ed45c5dcd455fc5848d47f86883a1b872ac0d0;The change is generated using coccinelle with the following rule;yes;no;no 38;MDY6Q29tbWl0MjMyNTI5ODo5N2EyMjVlNjlhMWY4ODA4ODZmMzNkMmU2NWE3YWNlMTNmMTUyY2Fh;Joonsoo Kim;Linus Torvalds;"mm/page_alloc: integrate classzone_idx and high_zoneidx classzone_idx is just different name for high_zoneidx now. So, integrate them and add some comment to struct alloc_context in order to reduce future confusion about the meaning of this variable. The accessor, ac_classzone_idx() is also removed since it isn't needed after integration. In addition to integration, this patch also renames high_zoneidx to highest_zoneidx since it represents more precise meaning. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Baoquan He <bhe@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Ye Xiaolong <xiaolong.ye@intel.com> Link: http://lkml.kernel.org/r/1587095923-7515-3-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/97a225e69a1f880886f33d2e65a7ace13f152caa;mm/page_alloc: integrate classzone_idx and high_zoneidx;yes;yes;no 38;MDY6Q29tbWl0MjMyNTI5ODo5N2EyMjVlNjlhMWY4ODA4ODZmMzNkMmU2NWE3YWNlMTNmMTUyY2Fh;Joonsoo Kim;Linus Torvalds;"mm/page_alloc: integrate classzone_idx and high_zoneidx classzone_idx is just different name for high_zoneidx now. So, integrate them and add some comment to struct alloc_context in order to reduce future confusion about the meaning of this variable. The accessor, ac_classzone_idx() is also removed since it isn't needed after integration. In addition to integration, this patch also renames high_zoneidx to highest_zoneidx since it represents more precise meaning. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Baoquan He <bhe@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Ye Xiaolong <xiaolong.ye@intel.com> Link: http://lkml.kernel.org/r/1587095923-7515-3-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/97a225e69a1f880886f33d2e65a7ace13f152caa;classzone_idx is just different name for high_zoneidx now;no;yes;yes 38;MDY6Q29tbWl0MjMyNTI5ODo5N2EyMjVlNjlhMWY4ODA4ODZmMzNkMmU2NWE3YWNlMTNmMTUyY2Fh;Joonsoo Kim;Linus Torvalds;"mm/page_alloc: integrate classzone_idx and high_zoneidx classzone_idx is just different name for high_zoneidx now. So, integrate them and add some comment to struct alloc_context in order to reduce future confusion about the meaning of this variable. The accessor, ac_classzone_idx() is also removed since it isn't needed after integration. In addition to integration, this patch also renames high_zoneidx to highest_zoneidx since it represents more precise meaning. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Baoquan He <bhe@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Ye Xiaolong <xiaolong.ye@intel.com> Link: http://lkml.kernel.org/r/1587095923-7515-3-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/97a225e69a1f880886f33d2e65a7ace13f152caa;" So, integrate them and add some comment to struct alloc_context in order to reduce future confusion about the meaning of this variable";yes;yes;no 38;MDY6Q29tbWl0MjMyNTI5ODo5N2EyMjVlNjlhMWY4ODA4ODZmMzNkMmU2NWE3YWNlMTNmMTUyY2Fh;Joonsoo Kim;Linus Torvalds;"mm/page_alloc: integrate classzone_idx and high_zoneidx classzone_idx is just different name for high_zoneidx now. So, integrate them and add some comment to struct alloc_context in order to reduce future confusion about the meaning of this variable. The accessor, ac_classzone_idx() is also removed since it isn't needed after integration. In addition to integration, this patch also renames high_zoneidx to highest_zoneidx since it represents more precise meaning. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Baoquan He <bhe@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Ye Xiaolong <xiaolong.ye@intel.com> Link: http://lkml.kernel.org/r/1587095923-7515-3-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/97a225e69a1f880886f33d2e65a7ace13f152caa;"The accessor, ac_classzone_idx() is also removed since it isn't needed after integration";yes;yes;no 38;MDY6Q29tbWl0MjMyNTI5ODo5N2EyMjVlNjlhMWY4ODA4ODZmMzNkMmU2NWE3YWNlMTNmMTUyY2Fh;Joonsoo Kim;Linus Torvalds;"mm/page_alloc: integrate classzone_idx and high_zoneidx classzone_idx is just different name for high_zoneidx now. So, integrate them and add some comment to struct alloc_context in order to reduce future confusion about the meaning of this variable. The accessor, ac_classzone_idx() is also removed since it isn't needed after integration. In addition to integration, this patch also renames high_zoneidx to highest_zoneidx since it represents more precise meaning. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Baoquan He <bhe@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Ye Xiaolong <xiaolong.ye@intel.com> Link: http://lkml.kernel.org/r/1587095923-7515-3-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/97a225e69a1f880886f33d2e65a7ace13f152caa;"In addition to integration, this patch also renames high_zoneidx to highest_zoneidx since it represents more precise meaning.";yes;yes;no 39;MDY6Q29tbWl0MjMyNTI5ODo4YTdmZjAyYWNhYmJkODc3NjY5ZmVjYjBhMmU3NWQwOTMwYjYyYzg1;David Rientjes;Linus Torvalds;"mm, oom: dump stack of victim when reaping failed When a process cannot be oom reaped, for whatever reason, currently the list of locks that are held is currently dumped to the kernel log. Much more interesting is the stack trace of the victim that cannot be reaped. If the stack trace is dumped, we have the ability to find related occurrences in the same kernel code and hopefully solve the issue that is making it wedged. Dump the stack trace when a process fails to be oom reaped. Link: http://lkml.kernel.org/r/alpine.DEB.2.21.2001141519280.200484@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/8a7ff02acabbd877669fecb0a2e75d0930b62c85;mm, oom: dump stack of victim when reaping failed;yes;yes;no 39;MDY6Q29tbWl0MjMyNTI5ODo4YTdmZjAyYWNhYmJkODc3NjY5ZmVjYjBhMmU3NWQwOTMwYjYyYzg1;David Rientjes;Linus Torvalds;"mm, oom: dump stack of victim when reaping failed When a process cannot be oom reaped, for whatever reason, currently the list of locks that are held is currently dumped to the kernel log. Much more interesting is the stack trace of the victim that cannot be reaped. If the stack trace is dumped, we have the ability to find related occurrences in the same kernel code and hopefully solve the issue that is making it wedged. Dump the stack trace when a process fails to be oom reaped. Link: http://lkml.kernel.org/r/alpine.DEB.2.21.2001141519280.200484@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/8a7ff02acabbd877669fecb0a2e75d0930b62c85;"When a process cannot be oom reaped, for whatever reason, currently the list of locks that are held is currently dumped to the kernel log";no;no;yes 39;MDY6Q29tbWl0MjMyNTI5ODo4YTdmZjAyYWNhYmJkODc3NjY5ZmVjYjBhMmU3NWQwOTMwYjYyYzg1;David Rientjes;Linus Torvalds;"mm, oom: dump stack of victim when reaping failed When a process cannot be oom reaped, for whatever reason, currently the list of locks that are held is currently dumped to the kernel log. Much more interesting is the stack trace of the victim that cannot be reaped. If the stack trace is dumped, we have the ability to find related occurrences in the same kernel code and hopefully solve the issue that is making it wedged. Dump the stack trace when a process fails to be oom reaped. Link: http://lkml.kernel.org/r/alpine.DEB.2.21.2001141519280.200484@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/8a7ff02acabbd877669fecb0a2e75d0930b62c85;"Much more interesting is the stack trace of the victim that cannot be reaped";no;no;yes 39;MDY6Q29tbWl0MjMyNTI5ODo4YTdmZjAyYWNhYmJkODc3NjY5ZmVjYjBhMmU3NWQwOTMwYjYyYzg1;David Rientjes;Linus Torvalds;"mm, oom: dump stack of victim when reaping failed When a process cannot be oom reaped, for whatever reason, currently the list of locks that are held is currently dumped to the kernel log. Much more interesting is the stack trace of the victim that cannot be reaped. If the stack trace is dumped, we have the ability to find related occurrences in the same kernel code and hopefully solve the issue that is making it wedged. Dump the stack trace when a process fails to be oom reaped. Link: http://lkml.kernel.org/r/alpine.DEB.2.21.2001141519280.200484@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/8a7ff02acabbd877669fecb0a2e75d0930b62c85;" If the stack trace is dumped, we have the ability to find related occurrences in the same kernel code and hopefully solve the issue that is making it wedged";yes;yes;yes 39;MDY6Q29tbWl0MjMyNTI5ODo4YTdmZjAyYWNhYmJkODc3NjY5ZmVjYjBhMmU3NWQwOTMwYjYyYzg1;David Rientjes;Linus Torvalds;"mm, oom: dump stack of victim when reaping failed When a process cannot be oom reaped, for whatever reason, currently the list of locks that are held is currently dumped to the kernel log. Much more interesting is the stack trace of the victim that cannot be reaped. If the stack trace is dumped, we have the ability to find related occurrences in the same kernel code and hopefully solve the issue that is making it wedged. Dump the stack trace when a process fails to be oom reaped. Link: http://lkml.kernel.org/r/alpine.DEB.2.21.2001141519280.200484@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/8a7ff02acabbd877669fecb0a2e75d0930b62c85;Dump the stack trace when a process fails to be oom reaped.;yes;no;no 40;MDY6Q29tbWl0MjMyNTI5ODo5NDFmNzYyYmNiMjc2MjU5YTc4ZTc5MzE2NzQ2Njg4NzRjY2JkYTU5;Ilya Dryomov;Linus Torvalds;"mm/oom: fix pgtables units mismatch in Killed process message pr_err() expects kB, but mm_pgtables_bytes() returns the number of bytes. As everything else is printed in kB, I chose to fix the value rather than the string. Before: [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name ... [ 1878] 1000 1878 217253 151144 1269760 0 0 python ... Out of memory: Killed process 1878 (python) total-vm:869012kB, anon-rss:604572kB, file-rss:4kB, shmem-rss:0kB, UID:1000 pgtables:1269760kB oom_score_adj:0 After: [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name ... [ 1436] 1000 1436 217253 151890 1294336 0 0 python ... Out of memory: Killed process 1436 (python) total-vm:869012kB, anon-rss:607516kB, file-rss:44kB, shmem-rss:0kB, UID:1000 pgtables:1264kB oom_score_adj:0 Link: http://lkml.kernel.org/r/20191211202830.1600-1-idryomov@gmail.com Fixes: 70cb6d267790 (""mm/oom: add oom_score_adj and pgtables to Killed process message"") Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Edward Chron <echron@arista.com> Cc: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/941f762bcb276259a78e7931674668874ccbda59;mm/oom: fix pgtables units mismatch in Killed process message;yes;yes;no 40;MDY6Q29tbWl0MjMyNTI5ODo5NDFmNzYyYmNiMjc2MjU5YTc4ZTc5MzE2NzQ2Njg4NzRjY2JkYTU5;Ilya Dryomov;Linus Torvalds;"mm/oom: fix pgtables units mismatch in Killed process message pr_err() expects kB, but mm_pgtables_bytes() returns the number of bytes. As everything else is printed in kB, I chose to fix the value rather than the string. Before: [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name ... [ 1878] 1000 1878 217253 151144 1269760 0 0 python ... Out of memory: Killed process 1878 (python) total-vm:869012kB, anon-rss:604572kB, file-rss:4kB, shmem-rss:0kB, UID:1000 pgtables:1269760kB oom_score_adj:0 After: [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name ... [ 1436] 1000 1436 217253 151890 1294336 0 0 python ... Out of memory: Killed process 1436 (python) total-vm:869012kB, anon-rss:607516kB, file-rss:44kB, shmem-rss:0kB, UID:1000 pgtables:1264kB oom_score_adj:0 Link: http://lkml.kernel.org/r/20191211202830.1600-1-idryomov@gmail.com Fixes: 70cb6d267790 (""mm/oom: add oom_score_adj and pgtables to Killed process message"") Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Edward Chron <echron@arista.com> Cc: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/941f762bcb276259a78e7931674668874ccbda59;pr_err() expects kB, but mm_pgtables_bytes() returns the number of bytes;no;yes;yes 40;MDY6Q29tbWl0MjMyNTI5ODo5NDFmNzYyYmNiMjc2MjU5YTc4ZTc5MzE2NzQ2Njg4NzRjY2JkYTU5;Ilya Dryomov;Linus Torvalds;"mm/oom: fix pgtables units mismatch in Killed process message pr_err() expects kB, but mm_pgtables_bytes() returns the number of bytes. As everything else is printed in kB, I chose to fix the value rather than the string. Before: [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name ... [ 1878] 1000 1878 217253 151144 1269760 0 0 python ... Out of memory: Killed process 1878 (python) total-vm:869012kB, anon-rss:604572kB, file-rss:4kB, shmem-rss:0kB, UID:1000 pgtables:1269760kB oom_score_adj:0 After: [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name ... [ 1436] 1000 1436 217253 151890 1294336 0 0 python ... Out of memory: Killed process 1436 (python) total-vm:869012kB, anon-rss:607516kB, file-rss:44kB, shmem-rss:0kB, UID:1000 pgtables:1264kB oom_score_adj:0 Link: http://lkml.kernel.org/r/20191211202830.1600-1-idryomov@gmail.com Fixes: 70cb6d267790 (""mm/oom: add oom_score_adj and pgtables to Killed process message"") Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Edward Chron <echron@arista.com> Cc: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/941f762bcb276259a78e7931674668874ccbda59;"As everything else is printed in kB, I chose to fix the value rather than the string";yes;yes;no 41;MDY6Q29tbWl0MjMyNTI5ODo5YzI3NmNjNjVhNThmYWY5OGJlOGU1Njk2Mjc0NWVjOTlhYjg3NjM2;Minchan Kim;Linus Torvalds;"mm: introduce MADV_COLD Patch series ""Introduce MADV_COLD and MADV_PAGEOUT"", v7. - Background The Android terminology used for forking a new process and starting an app from scratch is a cold start, while resuming an existing app is a hot start. While we continually try to improve the performance of cold starts, hot starts will always be significantly less power hungry as well as faster so we are trying to make hot start more likely than cold start. To increase hot start, Android userspace manages the order that apps should be killed in a process called ActivityManagerService. ActivityManagerService tracks every Android app or service that the user could be interacting with at any time and translates that into a ranked list for lmkd(low memory killer daemon). They are likely to be killed by lmkd if the system has to reclaim memory. In that sense they are similar to entries in any other cache. Those apps are kept alive for opportunistic performance improvements but those performance improvements will vary based on the memory requirements of individual workloads. - Problem Naturally, cached apps were dominant consumers of memory on the system. However, they were not significant consumers of swap even though they are good candidate for swap. Under investigation, swapping out only begins once the low zone watermark is hit and kswapd wakes up, but the overall allocation rate in the system might trip lmkd thresholds and cause a cached process to be killed(we measured performance swapping out vs. zapping the memory by killing a process. Unsurprisingly, zapping is 10x times faster even though we use zram which is much faster than real storage) so kill from lmkd will often satisfy the high zone watermark, resulting in very few pages actually being moved to swap. - Approach The approach we chose was to use a new interface to allow userspace to proactively reclaim entire processes by leveraging platform information. This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages that are known to be cold from userspace and to avoid races with lmkd by reclaiming apps as soon as they entered the cached state. Additionally, it could provide many chances for platform to use much information to optimize memory efficiency. To achieve the goal, the patchset introduce two new options for madvise. One is MADV_COLD which will deactivate activated pages and the other is MADV_PAGEOUT which will reclaim private pages instantly. These new options complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to gain some free memory space. MADV_PAGEOUT is similar to MADV_DONTNEED in a way that it hints the kernel that memory region is not currently needed and should be reclaimed immediately; MADV_COLD is similar to MADV_FREE in a way that it hints the kernel that memory region is not currently needed and should be reclaimed when memory pressure rises. This patch (of 5): When a process expects no accesses to a certain memory range, it could give a hint to kernel that the pages can be reclaimed when memory pressure happens but data should be preserved for future use. This could reduce workingset eviction so it ends up increasing performance. This patch introduces the new MADV_COLD hint to madvise(2) syscall. MADV_COLD can be used by a process to mark a memory range as not expected to be used in the near future. The hint can help kernel in deciding which pages to evict early during memory pressure. It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves active file page -> inactive file LRU active anon page -> inacdtive anon LRU Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file LRU's head because MADV_COLD is a little bit different symantic. MADV_FREE means it's okay to discard when the memory pressure because the content of the page is *garbage* so freeing such pages is almost zero overhead since we don't need to swap out and access afterward causes just minor fault. Thus, it would make sense to put those freeable pages in inactive file LRU to compete other used-once pages. It makes sense for implmentaion point of view, too because it's not swapbacked memory any longer until it would be re-dirtied. Even, it could give a bonus to make them be reclaimed on swapless system. However, MADV_COLD doesn't mean garbage so reclaiming them requires swap-out/in in the end so it's bigger cost. Since we have designed VM LRU aging based on cost-model, anonymous cold pages would be better to position inactive anon's LRU list, not file LRU. Furthermore, it would help to avoid unnecessary scanning if system doesn't have a swap device. Let's start simpler way without adding complexity at this moment. However, keep in mind, too that it's a caveat that workloads with a lot of pages cache are likely to ignore MADV_COLD on anonymous memory because we rarely age anonymous LRU lists. * man-page material MADV_COLD (since Linux x.x) Pages in the specified regions will be treated as less-recently-accessed compared to pages in the system with similar access frequencies. In contrast to MADV_FREE, the contents of the region are preserved regardless of subsequent writes to pages. MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP pages. [akpm@linux-foundation.org: resolve conflicts with hmm.git] Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: kbuild test robot <lkp@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Chris Zankel <chris@zankel.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Daniel Colascione <dancol@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Oleksandr Natalenko <oleksandr@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Sonny Rao <sonnyrao@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tim Murray <timmurray@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9c276cc65a58faf98be8e56962745ec99ab87636;mm: introduce MADV_COLD;yes;no;no 41;MDY6Q29tbWl0MjMyNTI5ODo5YzI3NmNjNjVhNThmYWY5OGJlOGU1Njk2Mjc0NWVjOTlhYjg3NjM2;Minchan Kim;Linus Torvalds;"mm: introduce MADV_COLD Patch series ""Introduce MADV_COLD and MADV_PAGEOUT"", v7. - Background The Android terminology used for forking a new process and starting an app from scratch is a cold start, while resuming an existing app is a hot start. While we continually try to improve the performance of cold starts, hot starts will always be significantly less power hungry as well as faster so we are trying to make hot start more likely than cold start. To increase hot start, Android userspace manages the order that apps should be killed in a process called ActivityManagerService. ActivityManagerService tracks every Android app or service that the user could be interacting with at any time and translates that into a ranked list for lmkd(low memory killer daemon). They are likely to be killed by lmkd if the system has to reclaim memory. In that sense they are similar to entries in any other cache. Those apps are kept alive for opportunistic performance improvements but those performance improvements will vary based on the memory requirements of individual workloads. - Problem Naturally, cached apps were dominant consumers of memory on the system. However, they were not significant consumers of swap even though they are good candidate for swap. Under investigation, swapping out only begins once the low zone watermark is hit and kswapd wakes up, but the overall allocation rate in the system might trip lmkd thresholds and cause a cached process to be killed(we measured performance swapping out vs. zapping the memory by killing a process. Unsurprisingly, zapping is 10x times faster even though we use zram which is much faster than real storage) so kill from lmkd will often satisfy the high zone watermark, resulting in very few pages actually being moved to swap. - Approach The approach we chose was to use a new interface to allow userspace to proactively reclaim entire processes by leveraging platform information. This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages that are known to be cold from userspace and to avoid races with lmkd by reclaiming apps as soon as they entered the cached state. Additionally, it could provide many chances for platform to use much information to optimize memory efficiency. To achieve the goal, the patchset introduce two new options for madvise. One is MADV_COLD which will deactivate activated pages and the other is MADV_PAGEOUT which will reclaim private pages instantly. These new options complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to gain some free memory space. MADV_PAGEOUT is similar to MADV_DONTNEED in a way that it hints the kernel that memory region is not currently needed and should be reclaimed immediately; MADV_COLD is similar to MADV_FREE in a way that it hints the kernel that memory region is not currently needed and should be reclaimed when memory pressure rises. This patch (of 5): When a process expects no accesses to a certain memory range, it could give a hint to kernel that the pages can be reclaimed when memory pressure happens but data should be preserved for future use. This could reduce workingset eviction so it ends up increasing performance. This patch introduces the new MADV_COLD hint to madvise(2) syscall. MADV_COLD can be used by a process to mark a memory range as not expected to be used in the near future. The hint can help kernel in deciding which pages to evict early during memory pressure. It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves active file page -> inactive file LRU active anon page -> inacdtive anon LRU Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file LRU's head because MADV_COLD is a little bit different symantic. MADV_FREE means it's okay to discard when the memory pressure because the content of the page is *garbage* so freeing such pages is almost zero overhead since we don't need to swap out and access afterward causes just minor fault. Thus, it would make sense to put those freeable pages in inactive file LRU to compete other used-once pages. It makes sense for implmentaion point of view, too because it's not swapbacked memory any longer until it would be re-dirtied. Even, it could give a bonus to make them be reclaimed on swapless system. However, MADV_COLD doesn't mean garbage so reclaiming them requires swap-out/in in the end so it's bigger cost. Since we have designed VM LRU aging based on cost-model, anonymous cold pages would be better to position inactive anon's LRU list, not file LRU. Furthermore, it would help to avoid unnecessary scanning if system doesn't have a swap device. Let's start simpler way without adding complexity at this moment. However, keep in mind, too that it's a caveat that workloads with a lot of pages cache are likely to ignore MADV_COLD on anonymous memory because we rarely age anonymous LRU lists. * man-page material MADV_COLD (since Linux x.x) Pages in the specified regions will be treated as less-recently-accessed compared to pages in the system with similar access frequencies. In contrast to MADV_FREE, the contents of the region are preserved regardless of subsequent writes to pages. MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP pages. [akpm@linux-foundation.org: resolve conflicts with hmm.git] Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: kbuild test robot <lkp@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Chris Zankel <chris@zankel.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Daniel Colascione <dancol@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Oleksandr Natalenko <oleksandr@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Sonny Rao <sonnyrao@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tim Murray <timmurray@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9c276cc65a58faf98be8e56962745ec99ab87636;"Patch series ""Introduce MADV_COLD and MADV_PAGEOUT"", v7";yes;yes;no 41;MDY6Q29tbWl0MjMyNTI5ODo5YzI3NmNjNjVhNThmYWY5OGJlOGU1Njk2Mjc0NWVjOTlhYjg3NjM2;Minchan Kim;Linus Torvalds;"mm: introduce MADV_COLD Patch series ""Introduce MADV_COLD and MADV_PAGEOUT"", v7. - Background The Android terminology used for forking a new process and starting an app from scratch is a cold start, while resuming an existing app is a hot start. While we continually try to improve the performance of cold starts, hot starts will always be significantly less power hungry as well as faster so we are trying to make hot start more likely than cold start. To increase hot start, Android userspace manages the order that apps should be killed in a process called ActivityManagerService. ActivityManagerService tracks every Android app or service that the user could be interacting with at any time and translates that into a ranked list for lmkd(low memory killer daemon). They are likely to be killed by lmkd if the system has to reclaim memory. In that sense they are similar to entries in any other cache. Those apps are kept alive for opportunistic performance improvements but those performance improvements will vary based on the memory requirements of individual workloads. - Problem Naturally, cached apps were dominant consumers of memory on the system. However, they were not significant consumers of swap even though they are good candidate for swap. Under investigation, swapping out only begins once the low zone watermark is hit and kswapd wakes up, but the overall allocation rate in the system might trip lmkd thresholds and cause a cached process to be killed(we measured performance swapping out vs. zapping the memory by killing a process. Unsurprisingly, zapping is 10x times faster even though we use zram which is much faster than real storage) so kill from lmkd will often satisfy the high zone watermark, resulting in very few pages actually being moved to swap. - Approach The approach we chose was to use a new interface to allow userspace to proactively reclaim entire processes by leveraging platform information. This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages that are known to be cold from userspace and to avoid races with lmkd by reclaiming apps as soon as they entered the cached state. Additionally, it could provide many chances for platform to use much information to optimize memory efficiency. To achieve the goal, the patchset introduce two new options for madvise. One is MADV_COLD which will deactivate activated pages and the other is MADV_PAGEOUT which will reclaim private pages instantly. These new options complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to gain some free memory space. MADV_PAGEOUT is similar to MADV_DONTNEED in a way that it hints the kernel that memory region is not currently needed and should be reclaimed immediately; MADV_COLD is similar to MADV_FREE in a way that it hints the kernel that memory region is not currently needed and should be reclaimed when memory pressure rises. This patch (of 5): When a process expects no accesses to a certain memory range, it could give a hint to kernel that the pages can be reclaimed when memory pressure happens but data should be preserved for future use. This could reduce workingset eviction so it ends up increasing performance. This patch introduces the new MADV_COLD hint to madvise(2) syscall. MADV_COLD can be used by a process to mark a memory range as not expected to be used in the near future. The hint can help kernel in deciding which pages to evict early during memory pressure. It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves active file page -> inactive file LRU active anon page -> inacdtive anon LRU Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file LRU's head because MADV_COLD is a little bit different symantic. MADV_FREE means it's okay to discard when the memory pressure because the content of the page is *garbage* so freeing such pages is almost zero overhead since we don't need to swap out and access afterward causes just minor fault. Thus, it would make sense to put those freeable pages in inactive file LRU to compete other used-once pages. It makes sense for implmentaion point of view, too because it's not swapbacked memory any longer until it would be re-dirtied. Even, it could give a bonus to make them be reclaimed on swapless system. However, MADV_COLD doesn't mean garbage so reclaiming them requires swap-out/in in the end so it's bigger cost. Since we have designed VM LRU aging based on cost-model, anonymous cold pages would be better to position inactive anon's LRU list, not file LRU. Furthermore, it would help to avoid unnecessary scanning if system doesn't have a swap device. Let's start simpler way without adding complexity at this moment. However, keep in mind, too that it's a caveat that workloads with a lot of pages cache are likely to ignore MADV_COLD on anonymous memory because we rarely age anonymous LRU lists. * man-page material MADV_COLD (since Linux x.x) Pages in the specified regions will be treated as less-recently-accessed compared to pages in the system with similar access frequencies. In contrast to MADV_FREE, the contents of the region are preserved regardless of subsequent writes to pages. MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP pages. [akpm@linux-foundation.org: resolve conflicts with hmm.git] Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: kbuild test robot <lkp@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Chris Zankel <chris@zankel.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Daniel Colascione <dancol@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Oleksandr Natalenko <oleksandr@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Sonny Rao <sonnyrao@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tim Murray <timmurray@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9c276cc65a58faf98be8e56962745ec99ab87636;#NAME?;no;no;yes 41;MDY6Q29tbWl0MjMyNTI5ODo5YzI3NmNjNjVhNThmYWY5OGJlOGU1Njk2Mjc0NWVjOTlhYjg3NjM2;Minchan Kim;Linus Torvalds;"mm: introduce MADV_COLD Patch series ""Introduce MADV_COLD and MADV_PAGEOUT"", v7. - Background The Android terminology used for forking a new process and starting an app from scratch is a cold start, while resuming an existing app is a hot start. While we continually try to improve the performance of cold starts, hot starts will always be significantly less power hungry as well as faster so we are trying to make hot start more likely than cold start. To increase hot start, Android userspace manages the order that apps should be killed in a process called ActivityManagerService. ActivityManagerService tracks every Android app or service that the user could be interacting with at any time and translates that into a ranked list for lmkd(low memory killer daemon). They are likely to be killed by lmkd if the system has to reclaim memory. In that sense they are similar to entries in any other cache. Those apps are kept alive for opportunistic performance improvements but those performance improvements will vary based on the memory requirements of individual workloads. - Problem Naturally, cached apps were dominant consumers of memory on the system. However, they were not significant consumers of swap even though they are good candidate for swap. Under investigation, swapping out only begins once the low zone watermark is hit and kswapd wakes up, but the overall allocation rate in the system might trip lmkd thresholds and cause a cached process to be killed(we measured performance swapping out vs. zapping the memory by killing a process. Unsurprisingly, zapping is 10x times faster even though we use zram which is much faster than real storage) so kill from lmkd will often satisfy the high zone watermark, resulting in very few pages actually being moved to swap. - Approach The approach we chose was to use a new interface to allow userspace to proactively reclaim entire processes by leveraging platform information. This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages that are known to be cold from userspace and to avoid races with lmkd by reclaiming apps as soon as they entered the cached state. Additionally, it could provide many chances for platform to use much information to optimize memory efficiency. To achieve the goal, the patchset introduce two new options for madvise. One is MADV_COLD which will deactivate activated pages and the other is MADV_PAGEOUT which will reclaim private pages instantly. These new options complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to gain some free memory space. MADV_PAGEOUT is similar to MADV_DONTNEED in a way that it hints the kernel that memory region is not currently needed and should be reclaimed immediately; MADV_COLD is similar to MADV_FREE in a way that it hints the kernel that memory region is not currently needed and should be reclaimed when memory pressure rises. This patch (of 5): When a process expects no accesses to a certain memory range, it could give a hint to kernel that the pages can be reclaimed when memory pressure happens but data should be preserved for future use. This could reduce workingset eviction so it ends up increasing performance. This patch introduces the new MADV_COLD hint to madvise(2) syscall. MADV_COLD can be used by a process to mark a memory range as not expected to be used in the near future. The hint can help kernel in deciding which pages to evict early during memory pressure. It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves active file page -> inactive file LRU active anon page -> inacdtive anon LRU Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file LRU's head because MADV_COLD is a little bit different symantic. MADV_FREE means it's okay to discard when the memory pressure because the content of the page is *garbage* so freeing such pages is almost zero overhead since we don't need to swap out and access afterward causes just minor fault. Thus, it would make sense to put those freeable pages in inactive file LRU to compete other used-once pages. It makes sense for implmentaion point of view, too because it's not swapbacked memory any longer until it would be re-dirtied. Even, it could give a bonus to make them be reclaimed on swapless system. However, MADV_COLD doesn't mean garbage so reclaiming them requires swap-out/in in the end so it's bigger cost. Since we have designed VM LRU aging based on cost-model, anonymous cold pages would be better to position inactive anon's LRU list, not file LRU. Furthermore, it would help to avoid unnecessary scanning if system doesn't have a swap device. Let's start simpler way without adding complexity at this moment. However, keep in mind, too that it's a caveat that workloads with a lot of pages cache are likely to ignore MADV_COLD on anonymous memory because we rarely age anonymous LRU lists. * man-page material MADV_COLD (since Linux x.x) Pages in the specified regions will be treated as less-recently-accessed compared to pages in the system with similar access frequencies. In contrast to MADV_FREE, the contents of the region are preserved regardless of subsequent writes to pages. MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP pages. [akpm@linux-foundation.org: resolve conflicts with hmm.git] Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: kbuild test robot <lkp@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Chris Zankel <chris@zankel.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Daniel Colascione <dancol@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Oleksandr Natalenko <oleksandr@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Sonny Rao <sonnyrao@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tim Murray <timmurray@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9c276cc65a58faf98be8e56962745ec99ab87636;" While we continually try to improve the performance of cold starts, hot starts will always be significantly less power hungry as well as faster so we are trying to make hot start more likely than cold start";yes;yes;yes 41;MDY6Q29tbWl0MjMyNTI5ODo5YzI3NmNjNjVhNThmYWY5OGJlOGU1Njk2Mjc0NWVjOTlhYjg3NjM2;Minchan Kim;Linus Torvalds;"mm: introduce MADV_COLD Patch series ""Introduce MADV_COLD and MADV_PAGEOUT"", v7. - Background The Android terminology used for forking a new process and starting an app from scratch is a cold start, while resuming an existing app is a hot start. While we continually try to improve the performance of cold starts, hot starts will always be significantly less power hungry as well as faster so we are trying to make hot start more likely than cold start. To increase hot start, Android userspace manages the order that apps should be killed in a process called ActivityManagerService. ActivityManagerService tracks every Android app or service that the user could be interacting with at any time and translates that into a ranked list for lmkd(low memory killer daemon). They are likely to be killed by lmkd if the system has to reclaim memory. In that sense they are similar to entries in any other cache. Those apps are kept alive for opportunistic performance improvements but those performance improvements will vary based on the memory requirements of individual workloads. - Problem Naturally, cached apps were dominant consumers of memory on the system. However, they were not significant consumers of swap even though they are good candidate for swap. Under investigation, swapping out only begins once the low zone watermark is hit and kswapd wakes up, but the overall allocation rate in the system might trip lmkd thresholds and cause a cached process to be killed(we measured performance swapping out vs. zapping the memory by killing a process. Unsurprisingly, zapping is 10x times faster even though we use zram which is much faster than real storage) so kill from lmkd will often satisfy the high zone watermark, resulting in very few pages actually being moved to swap. - Approach The approach we chose was to use a new interface to allow userspace to proactively reclaim entire processes by leveraging platform information. This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages that are known to be cold from userspace and to avoid races with lmkd by reclaiming apps as soon as they entered the cached state. Additionally, it could provide many chances for platform to use much information to optimize memory efficiency. To achieve the goal, the patchset introduce two new options for madvise. One is MADV_COLD which will deactivate activated pages and the other is MADV_PAGEOUT which will reclaim private pages instantly. These new options complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to gain some free memory space. MADV_PAGEOUT is similar to MADV_DONTNEED in a way that it hints the kernel that memory region is not currently needed and should be reclaimed immediately; MADV_COLD is similar to MADV_FREE in a way that it hints the kernel that memory region is not currently needed and should be reclaimed when memory pressure rises. This patch (of 5): When a process expects no accesses to a certain memory range, it could give a hint to kernel that the pages can be reclaimed when memory pressure happens but data should be preserved for future use. This could reduce workingset eviction so it ends up increasing performance. This patch introduces the new MADV_COLD hint to madvise(2) syscall. MADV_COLD can be used by a process to mark a memory range as not expected to be used in the near future. The hint can help kernel in deciding which pages to evict early during memory pressure. It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves active file page -> inactive file LRU active anon page -> inacdtive anon LRU Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file LRU's head because MADV_COLD is a little bit different symantic. MADV_FREE means it's okay to discard when the memory pressure because the content of the page is *garbage* so freeing such pages is almost zero overhead since we don't need to swap out and access afterward causes just minor fault. Thus, it would make sense to put those freeable pages in inactive file LRU to compete other used-once pages. It makes sense for implmentaion point of view, too because it's not swapbacked memory any longer until it would be re-dirtied. Even, it could give a bonus to make them be reclaimed on swapless system. However, MADV_COLD doesn't mean garbage so reclaiming them requires swap-out/in in the end so it's bigger cost. Since we have designed VM LRU aging based on cost-model, anonymous cold pages would be better to position inactive anon's LRU list, not file LRU. Furthermore, it would help to avoid unnecessary scanning if system doesn't have a swap device. Let's start simpler way without adding complexity at this moment. However, keep in mind, too that it's a caveat that workloads with a lot of pages cache are likely to ignore MADV_COLD on anonymous memory because we rarely age anonymous LRU lists. * man-page material MADV_COLD (since Linux x.x) Pages in the specified regions will be treated as less-recently-accessed compared to pages in the system with similar access frequencies. In contrast to MADV_FREE, the contents of the region are preserved regardless of subsequent writes to pages. MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP pages. [akpm@linux-foundation.org: resolve conflicts with hmm.git] Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: kbuild test robot <lkp@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Chris Zankel <chris@zankel.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Daniel Colascione <dancol@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Oleksandr Natalenko <oleksandr@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Sonny Rao <sonnyrao@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tim Murray <timmurray@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9c276cc65a58faf98be8e56962745ec99ab87636;"To increase hot start, Android userspace manages the order that apps should be killed in a process called ActivityManagerService";no;no;yes 41;MDY6Q29tbWl0MjMyNTI5ODo5YzI3NmNjNjVhNThmYWY5OGJlOGU1Njk2Mjc0NWVjOTlhYjg3NjM2;Minchan Kim;Linus Torvalds;"mm: introduce MADV_COLD Patch series ""Introduce MADV_COLD and MADV_PAGEOUT"", v7. - Background The Android terminology used for forking a new process and starting an app from scratch is a cold start, while resuming an existing app is a hot start. While we continually try to improve the performance of cold starts, hot starts will always be significantly less power hungry as well as faster so we are trying to make hot start more likely than cold start. To increase hot start, Android userspace manages the order that apps should be killed in a process called ActivityManagerService. ActivityManagerService tracks every Android app or service that the user could be interacting with at any time and translates that into a ranked list for lmkd(low memory killer daemon). They are likely to be killed by lmkd if the system has to reclaim memory. In that sense they are similar to entries in any other cache. Those apps are kept alive for opportunistic performance improvements but those performance improvements will vary based on the memory requirements of individual workloads. - Problem Naturally, cached apps were dominant consumers of memory on the system. However, they were not significant consumers of swap even though they are good candidate for swap. Under investigation, swapping out only begins once the low zone watermark is hit and kswapd wakes up, but the overall allocation rate in the system might trip lmkd thresholds and cause a cached process to be killed(we measured performance swapping out vs. zapping the memory by killing a process. Unsurprisingly, zapping is 10x times faster even though we use zram which is much faster than real storage) so kill from lmkd will often satisfy the high zone watermark, resulting in very few pages actually being moved to swap. - Approach The approach we chose was to use a new interface to allow userspace to proactively reclaim entire processes by leveraging platform information. This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages that are known to be cold from userspace and to avoid races with lmkd by reclaiming apps as soon as they entered the cached state. Additionally, it could provide many chances for platform to use much information to optimize memory efficiency. To achieve the goal, the patchset introduce two new options for madvise. One is MADV_COLD which will deactivate activated pages and the other is MADV_PAGEOUT which will reclaim private pages instantly. These new options complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to gain some free memory space. MADV_PAGEOUT is similar to MADV_DONTNEED in a way that it hints the kernel that memory region is not currently needed and should be reclaimed immediately; MADV_COLD is similar to MADV_FREE in a way that it hints the kernel that memory region is not currently needed and should be reclaimed when memory pressure rises. This patch (of 5): When a process expects no accesses to a certain memory range, it could give a hint to kernel that the pages can be reclaimed when memory pressure happens but data should be preserved for future use. This could reduce workingset eviction so it ends up increasing performance. This patch introduces the new MADV_COLD hint to madvise(2) syscall. MADV_COLD can be used by a process to mark a memory range as not expected to be used in the near future. The hint can help kernel in deciding which pages to evict early during memory pressure. It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves active file page -> inactive file LRU active anon page -> inacdtive anon LRU Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file LRU's head because MADV_COLD is a little bit different symantic. MADV_FREE means it's okay to discard when the memory pressure because the content of the page is *garbage* so freeing such pages is almost zero overhead since we don't need to swap out and access afterward causes just minor fault. Thus, it would make sense to put those freeable pages in inactive file LRU to compete other used-once pages. It makes sense for implmentaion point of view, too because it's not swapbacked memory any longer until it would be re-dirtied. Even, it could give a bonus to make them be reclaimed on swapless system. However, MADV_COLD doesn't mean garbage so reclaiming them requires swap-out/in in the end so it's bigger cost. Since we have designed VM LRU aging based on cost-model, anonymous cold pages would be better to position inactive anon's LRU list, not file LRU. Furthermore, it would help to avoid unnecessary scanning if system doesn't have a swap device. Let's start simpler way without adding complexity at this moment. However, keep in mind, too that it's a caveat that workloads with a lot of pages cache are likely to ignore MADV_COLD on anonymous memory because we rarely age anonymous LRU lists. * man-page material MADV_COLD (since Linux x.x) Pages in the specified regions will be treated as less-recently-accessed compared to pages in the system with similar access frequencies. In contrast to MADV_FREE, the contents of the region are preserved regardless of subsequent writes to pages. MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP pages. [akpm@linux-foundation.org: resolve conflicts with hmm.git] Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: kbuild test robot <lkp@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Chris Zankel <chris@zankel.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Daniel Colascione <dancol@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Oleksandr Natalenko <oleksandr@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Sonny Rao <sonnyrao@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tim Murray <timmurray@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9c276cc65a58faf98be8e56962745ec99ab87636;"ActivityManagerService tracks every Android app or service that the user could be interacting with at any time and translates that into a ranked list for lmkd(low memory killer daemon)";no;no;yes 41;MDY6Q29tbWl0MjMyNTI5ODo5YzI3NmNjNjVhNThmYWY5OGJlOGU1Njk2Mjc0NWVjOTlhYjg3NjM2;Minchan Kim;Linus Torvalds;"mm: introduce MADV_COLD Patch series ""Introduce MADV_COLD and MADV_PAGEOUT"", v7. - Background The Android terminology used for forking a new process and starting an app from scratch is a cold start, while resuming an existing app is a hot start. While we continually try to improve the performance of cold starts, hot starts will always be significantly less power hungry as well as faster so we are trying to make hot start more likely than cold start. To increase hot start, Android userspace manages the order that apps should be killed in a process called ActivityManagerService. ActivityManagerService tracks every Android app or service that the user could be interacting with at any time and translates that into a ranked list for lmkd(low memory killer daemon). They are likely to be killed by lmkd if the system has to reclaim memory. In that sense they are similar to entries in any other cache. Those apps are kept alive for opportunistic performance improvements but those performance improvements will vary based on the memory requirements of individual workloads. - Problem Naturally, cached apps were dominant consumers of memory on the system. However, they were not significant consumers of swap even though they are good candidate for swap. Under investigation, swapping out only begins once the low zone watermark is hit and kswapd wakes up, but the overall allocation rate in the system might trip lmkd thresholds and cause a cached process to be killed(we measured performance swapping out vs. zapping the memory by killing a process. Unsurprisingly, zapping is 10x times faster even though we use zram which is much faster than real storage) so kill from lmkd will often satisfy the high zone watermark, resulting in very few pages actually being moved to swap. - Approach The approach we chose was to use a new interface to allow userspace to proactively reclaim entire processes by leveraging platform information. This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages that are known to be cold from userspace and to avoid races with lmkd by reclaiming apps as soon as they entered the cached state. Additionally, it could provide many chances for platform to use much information to optimize memory efficiency. To achieve the goal, the patchset introduce two new options for madvise. One is MADV_COLD which will deactivate activated pages and the other is MADV_PAGEOUT which will reclaim private pages instantly. These new options complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to gain some free memory space. MADV_PAGEOUT is similar to MADV_DONTNEED in a way that it hints the kernel that memory region is not currently needed and should be reclaimed immediately; MADV_COLD is similar to MADV_FREE in a way that it hints the kernel that memory region is not currently needed and should be reclaimed when memory pressure rises. This patch (of 5): When a process expects no accesses to a certain memory range, it could give a hint to kernel that the pages can be reclaimed when memory pressure happens but data should be preserved for future use. This could reduce workingset eviction so it ends up increasing performance. This patch introduces the new MADV_COLD hint to madvise(2) syscall. MADV_COLD can be used by a process to mark a memory range as not expected to be used in the near future. The hint can help kernel in deciding which pages to evict early during memory pressure. It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves active file page -> inactive file LRU active anon page -> inacdtive anon LRU Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file LRU's head because MADV_COLD is a little bit different symantic. MADV_FREE means it's okay to discard when the memory pressure because the content of the page is *garbage* so freeing such pages is almost zero overhead since we don't need to swap out and access afterward causes just minor fault. Thus, it would make sense to put those freeable pages in inactive file LRU to compete other used-once pages. It makes sense for implmentaion point of view, too because it's not swapbacked memory any longer until it would be re-dirtied. Even, it could give a bonus to make them be reclaimed on swapless system. However, MADV_COLD doesn't mean garbage so reclaiming them requires swap-out/in in the end so it's bigger cost. Since we have designed VM LRU aging based on cost-model, anonymous cold pages would be better to position inactive anon's LRU list, not file LRU. Furthermore, it would help to avoid unnecessary scanning if system doesn't have a swap device. Let's start simpler way without adding complexity at this moment. However, keep in mind, too that it's a caveat that workloads with a lot of pages cache are likely to ignore MADV_COLD on anonymous memory because we rarely age anonymous LRU lists. * man-page material MADV_COLD (since Linux x.x) Pages in the specified regions will be treated as less-recently-accessed compared to pages in the system with similar access frequencies. In contrast to MADV_FREE, the contents of the region are preserved regardless of subsequent writes to pages. MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP pages. [akpm@linux-foundation.org: resolve conflicts with hmm.git] Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: kbuild test robot <lkp@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Chris Zankel <chris@zankel.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Daniel Colascione <dancol@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Oleksandr Natalenko <oleksandr@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Sonny Rao <sonnyrao@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tim Murray <timmurray@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9c276cc65a58faf98be8e56962745ec99ab87636;" They are likely to be killed by lmkd if the system has to reclaim memory";no;no;yes 41;MDY6Q29tbWl0MjMyNTI5ODo5YzI3NmNjNjVhNThmYWY5OGJlOGU1Njk2Mjc0NWVjOTlhYjg3NjM2;Minchan Kim;Linus Torvalds;"mm: introduce MADV_COLD Patch series ""Introduce MADV_COLD and MADV_PAGEOUT"", v7. - Background The Android terminology used for forking a new process and starting an app from scratch is a cold start, while resuming an existing app is a hot start. While we continually try to improve the performance of cold starts, hot starts will always be significantly less power hungry as well as faster so we are trying to make hot start more likely than cold start. To increase hot start, Android userspace manages the order that apps should be killed in a process called ActivityManagerService. ActivityManagerService tracks every Android app or service that the user could be interacting with at any time and translates that into a ranked list for lmkd(low memory killer daemon). They are likely to be killed by lmkd if the system has to reclaim memory. In that sense they are similar to entries in any other cache. Those apps are kept alive for opportunistic performance improvements but those performance improvements will vary based on the memory requirements of individual workloads. - Problem Naturally, cached apps were dominant consumers of memory on the system. However, they were not significant consumers of swap even though they are good candidate for swap. Under investigation, swapping out only begins once the low zone watermark is hit and kswapd wakes up, but the overall allocation rate in the system might trip lmkd thresholds and cause a cached process to be killed(we measured performance swapping out vs. zapping the memory by killing a process. Unsurprisingly, zapping is 10x times faster even though we use zram which is much faster than real storage) so kill from lmkd will often satisfy the high zone watermark, resulting in very few pages actually being moved to swap. - Approach The approach we chose was to use a new interface to allow userspace to proactively reclaim entire processes by leveraging platform information. This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages that are known to be cold from userspace and to avoid races with lmkd by reclaiming apps as soon as they entered the cached state. Additionally, it could provide many chances for platform to use much information to optimize memory efficiency. To achieve the goal, the patchset introduce two new options for madvise. One is MADV_COLD which will deactivate activated pages and the other is MADV_PAGEOUT which will reclaim private pages instantly. These new options complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to gain some free memory space. MADV_PAGEOUT is similar to MADV_DONTNEED in a way that it hints the kernel that memory region is not currently needed and should be reclaimed immediately; MADV_COLD is similar to MADV_FREE in a way that it hints the kernel that memory region is not currently needed and should be reclaimed when memory pressure rises. This patch (of 5): When a process expects no accesses to a certain memory range, it could give a hint to kernel that the pages can be reclaimed when memory pressure happens but data should be preserved for future use. This could reduce workingset eviction so it ends up increasing performance. This patch introduces the new MADV_COLD hint to madvise(2) syscall. MADV_COLD can be used by a process to mark a memory range as not expected to be used in the near future. The hint can help kernel in deciding which pages to evict early during memory pressure. It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves active file page -> inactive file LRU active anon page -> inacdtive anon LRU Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file LRU's head because MADV_COLD is a little bit different symantic. MADV_FREE means it's okay to discard when the memory pressure because the content of the page is *garbage* so freeing such pages is almost zero overhead since we don't need to swap out and access afterward causes just minor fault. Thus, it would make sense to put those freeable pages in inactive file LRU to compete other used-once pages. It makes sense for implmentaion point of view, too because it's not swapbacked memory any longer until it would be re-dirtied. Even, it could give a bonus to make them be reclaimed on swapless system. However, MADV_COLD doesn't mean garbage so reclaiming them requires swap-out/in in the end so it's bigger cost. Since we have designed VM LRU aging based on cost-model, anonymous cold pages would be better to position inactive anon's LRU list, not file LRU. Furthermore, it would help to avoid unnecessary scanning if system doesn't have a swap device. Let's start simpler way without adding complexity at this moment. However, keep in mind, too that it's a caveat that workloads with a lot of pages cache are likely to ignore MADV_COLD on anonymous memory because we rarely age anonymous LRU lists. * man-page material MADV_COLD (since Linux x.x) Pages in the specified regions will be treated as less-recently-accessed compared to pages in the system with similar access frequencies. In contrast to MADV_FREE, the contents of the region are preserved regardless of subsequent writes to pages. MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP pages. [akpm@linux-foundation.org: resolve conflicts with hmm.git] Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: kbuild test robot <lkp@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Chris Zankel <chris@zankel.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Daniel Colascione <dancol@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Oleksandr Natalenko <oleksandr@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Sonny Rao <sonnyrao@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tim Murray <timmurray@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9c276cc65a58faf98be8e56962745ec99ab87636;" In that sense they are similar to entries in any other cache";no;no;yes 41;MDY6Q29tbWl0MjMyNTI5ODo5YzI3NmNjNjVhNThmYWY5OGJlOGU1Njk2Mjc0NWVjOTlhYjg3NjM2;Minchan Kim;Linus Torvalds;"mm: introduce MADV_COLD Patch series ""Introduce MADV_COLD and MADV_PAGEOUT"", v7. - Background The Android terminology used for forking a new process and starting an app from scratch is a cold start, while resuming an existing app is a hot start. While we continually try to improve the performance of cold starts, hot starts will always be significantly less power hungry as well as faster so we are trying to make hot start more likely than cold start. To increase hot start, Android userspace manages the order that apps should be killed in a process called ActivityManagerService. ActivityManagerService tracks every Android app or service that the user could be interacting with at any time and translates that into a ranked list for lmkd(low memory killer daemon). They are likely to be killed by lmkd if the system has to reclaim memory. In that sense they are similar to entries in any other cache. Those apps are kept alive for opportunistic performance improvements but those performance improvements will vary based on the memory requirements of individual workloads. - Problem Naturally, cached apps were dominant consumers of memory on the system. However, they were not significant consumers of swap even though they are good candidate for swap. Under investigation, swapping out only begins once the low zone watermark is hit and kswapd wakes up, but the overall allocation rate in the system might trip lmkd thresholds and cause a cached process to be killed(we measured performance swapping out vs. zapping the memory by killing a process. Unsurprisingly, zapping is 10x times faster even though we use zram which is much faster than real storage) so kill from lmkd will often satisfy the high zone watermark, resulting in very few pages actually being moved to swap. - Approach The approach we chose was to use a new interface to allow userspace to proactively reclaim entire processes by leveraging platform information. This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages that are known to be cold from userspace and to avoid races with lmkd by reclaiming apps as soon as they entered the cached state. Additionally, it could provide many chances for platform to use much information to optimize memory efficiency. To achieve the goal, the patchset introduce two new options for madvise. One is MADV_COLD which will deactivate activated pages and the other is MADV_PAGEOUT which will reclaim private pages instantly. These new options complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to gain some free memory space. MADV_PAGEOUT is similar to MADV_DONTNEED in a way that it hints the kernel that memory region is not currently needed and should be reclaimed immediately; MADV_COLD is similar to MADV_FREE in a way that it hints the kernel that memory region is not currently needed and should be reclaimed when memory pressure rises. This patch (of 5): When a process expects no accesses to a certain memory range, it could give a hint to kernel that the pages can be reclaimed when memory pressure happens but data should be preserved for future use. This could reduce workingset eviction so it ends up increasing performance. This patch introduces the new MADV_COLD hint to madvise(2) syscall. MADV_COLD can be used by a process to mark a memory range as not expected to be used in the near future. The hint can help kernel in deciding which pages to evict early during memory pressure. It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves active file page -> inactive file LRU active anon page -> inacdtive anon LRU Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file LRU's head because MADV_COLD is a little bit different symantic. MADV_FREE means it's okay to discard when the memory pressure because the content of the page is *garbage* so freeing such pages is almost zero overhead since we don't need to swap out and access afterward causes just minor fault. Thus, it would make sense to put those freeable pages in inactive file LRU to compete other used-once pages. It makes sense for implmentaion point of view, too because it's not swapbacked memory any longer until it would be re-dirtied. Even, it could give a bonus to make them be reclaimed on swapless system. However, MADV_COLD doesn't mean garbage so reclaiming them requires swap-out/in in the end so it's bigger cost. Since we have designed VM LRU aging based on cost-model, anonymous cold pages would be better to position inactive anon's LRU list, not file LRU. Furthermore, it would help to avoid unnecessary scanning if system doesn't have a swap device. Let's start simpler way without adding complexity at this moment. However, keep in mind, too that it's a caveat that workloads with a lot of pages cache are likely to ignore MADV_COLD on anonymous memory because we rarely age anonymous LRU lists. * man-page material MADV_COLD (since Linux x.x) Pages in the specified regions will be treated as less-recently-accessed compared to pages in the system with similar access frequencies. In contrast to MADV_FREE, the contents of the region are preserved regardless of subsequent writes to pages. MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP pages. [akpm@linux-foundation.org: resolve conflicts with hmm.git] Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: kbuild test robot <lkp@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Chris Zankel <chris@zankel.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Daniel Colascione <dancol@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Oleksandr Natalenko <oleksandr@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Sonny Rao <sonnyrao@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tim Murray <timmurray@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9c276cc65a58faf98be8e56962745ec99ab87636;" Those apps are kept alive for opportunistic performance improvements but those performance improvements will vary based on the memory requirements of individual workloads";no;no;yes 41;MDY6Q29tbWl0MjMyNTI5ODo5YzI3NmNjNjVhNThmYWY5OGJlOGU1Njk2Mjc0NWVjOTlhYjg3NjM2;Minchan Kim;Linus Torvalds;"mm: introduce MADV_COLD Patch series ""Introduce MADV_COLD and MADV_PAGEOUT"", v7. - Background The Android terminology used for forking a new process and starting an app from scratch is a cold start, while resuming an existing app is a hot start. While we continually try to improve the performance of cold starts, hot starts will always be significantly less power hungry as well as faster so we are trying to make hot start more likely than cold start. To increase hot start, Android userspace manages the order that apps should be killed in a process called ActivityManagerService. ActivityManagerService tracks every Android app or service that the user could be interacting with at any time and translates that into a ranked list for lmkd(low memory killer daemon). They are likely to be killed by lmkd if the system has to reclaim memory. In that sense they are similar to entries in any other cache. Those apps are kept alive for opportunistic performance improvements but those performance improvements will vary based on the memory requirements of individual workloads. - Problem Naturally, cached apps were dominant consumers of memory on the system. However, they were not significant consumers of swap even though they are good candidate for swap. Under investigation, swapping out only begins once the low zone watermark is hit and kswapd wakes up, but the overall allocation rate in the system might trip lmkd thresholds and cause a cached process to be killed(we measured performance swapping out vs. zapping the memory by killing a process. Unsurprisingly, zapping is 10x times faster even though we use zram which is much faster than real storage) so kill from lmkd will often satisfy the high zone watermark, resulting in very few pages actually being moved to swap. - Approach The approach we chose was to use a new interface to allow userspace to proactively reclaim entire processes by leveraging platform information. This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages that are known to be cold from userspace and to avoid races with lmkd by reclaiming apps as soon as they entered the cached state. Additionally, it could provide many chances for platform to use much information to optimize memory efficiency. To achieve the goal, the patchset introduce two new options for madvise. One is MADV_COLD which will deactivate activated pages and the other is MADV_PAGEOUT which will reclaim private pages instantly. These new options complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to gain some free memory space. MADV_PAGEOUT is similar to MADV_DONTNEED in a way that it hints the kernel that memory region is not currently needed and should be reclaimed immediately; MADV_COLD is similar to MADV_FREE in a way that it hints the kernel that memory region is not currently needed and should be reclaimed when memory pressure rises. This patch (of 5): When a process expects no accesses to a certain memory range, it could give a hint to kernel that the pages can be reclaimed when memory pressure happens but data should be preserved for future use. This could reduce workingset eviction so it ends up increasing performance. This patch introduces the new MADV_COLD hint to madvise(2) syscall. MADV_COLD can be used by a process to mark a memory range as not expected to be used in the near future. The hint can help kernel in deciding which pages to evict early during memory pressure. It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves active file page -> inactive file LRU active anon page -> inacdtive anon LRU Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file LRU's head because MADV_COLD is a little bit different symantic. MADV_FREE means it's okay to discard when the memory pressure because the content of the page is *garbage* so freeing such pages is almost zero overhead since we don't need to swap out and access afterward causes just minor fault. Thus, it would make sense to put those freeable pages in inactive file LRU to compete other used-once pages. It makes sense for implmentaion point of view, too because it's not swapbacked memory any longer until it would be re-dirtied. Even, it could give a bonus to make them be reclaimed on swapless system. However, MADV_COLD doesn't mean garbage so reclaiming them requires swap-out/in in the end so it's bigger cost. Since we have designed VM LRU aging based on cost-model, anonymous cold pages would be better to position inactive anon's LRU list, not file LRU. Furthermore, it would help to avoid unnecessary scanning if system doesn't have a swap device. Let's start simpler way without adding complexity at this moment. However, keep in mind, too that it's a caveat that workloads with a lot of pages cache are likely to ignore MADV_COLD on anonymous memory because we rarely age anonymous LRU lists. * man-page material MADV_COLD (since Linux x.x) Pages in the specified regions will be treated as less-recently-accessed compared to pages in the system with similar access frequencies. In contrast to MADV_FREE, the contents of the region are preserved regardless of subsequent writes to pages. MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP pages. [akpm@linux-foundation.org: resolve conflicts with hmm.git] Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: kbuild test robot <lkp@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Chris Zankel <chris@zankel.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Daniel Colascione <dancol@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Oleksandr Natalenko <oleksandr@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Sonny Rao <sonnyrao@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tim Murray <timmurray@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9c276cc65a58faf98be8e56962745ec99ab87636;#NAME?;no;yes;yes 41;MDY6Q29tbWl0MjMyNTI5ODo5YzI3NmNjNjVhNThmYWY5OGJlOGU1Njk2Mjc0NWVjOTlhYjg3NjM2;Minchan Kim;Linus Torvalds;"mm: introduce MADV_COLD Patch series ""Introduce MADV_COLD and MADV_PAGEOUT"", v7. - Background The Android terminology used for forking a new process and starting an app from scratch is a cold start, while resuming an existing app is a hot start. While we continually try to improve the performance of cold starts, hot starts will always be significantly less power hungry as well as faster so we are trying to make hot start more likely than cold start. To increase hot start, Android userspace manages the order that apps should be killed in a process called ActivityManagerService. ActivityManagerService tracks every Android app or service that the user could be interacting with at any time and translates that into a ranked list for lmkd(low memory killer daemon). They are likely to be killed by lmkd if the system has to reclaim memory. In that sense they are similar to entries in any other cache. Those apps are kept alive for opportunistic performance improvements but those performance improvements will vary based on the memory requirements of individual workloads. - Problem Naturally, cached apps were dominant consumers of memory on the system. However, they were not significant consumers of swap even though they are good candidate for swap. Under investigation, swapping out only begins once the low zone watermark is hit and kswapd wakes up, but the overall allocation rate in the system might trip lmkd thresholds and cause a cached process to be killed(we measured performance swapping out vs. zapping the memory by killing a process. Unsurprisingly, zapping is 10x times faster even though we use zram which is much faster than real storage) so kill from lmkd will often satisfy the high zone watermark, resulting in very few pages actually being moved to swap. - Approach The approach we chose was to use a new interface to allow userspace to proactively reclaim entire processes by leveraging platform information. This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages that are known to be cold from userspace and to avoid races with lmkd by reclaiming apps as soon as they entered the cached state. Additionally, it could provide many chances for platform to use much information to optimize memory efficiency. To achieve the goal, the patchset introduce two new options for madvise. One is MADV_COLD which will deactivate activated pages and the other is MADV_PAGEOUT which will reclaim private pages instantly. These new options complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to gain some free memory space. MADV_PAGEOUT is similar to MADV_DONTNEED in a way that it hints the kernel that memory region is not currently needed and should be reclaimed immediately; MADV_COLD is similar to MADV_FREE in a way that it hints the kernel that memory region is not currently needed and should be reclaimed when memory pressure rises. This patch (of 5): When a process expects no accesses to a certain memory range, it could give a hint to kernel that the pages can be reclaimed when memory pressure happens but data should be preserved for future use. This could reduce workingset eviction so it ends up increasing performance. This patch introduces the new MADV_COLD hint to madvise(2) syscall. MADV_COLD can be used by a process to mark a memory range as not expected to be used in the near future. The hint can help kernel in deciding which pages to evict early during memory pressure. It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves active file page -> inactive file LRU active anon page -> inacdtive anon LRU Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file LRU's head because MADV_COLD is a little bit different symantic. MADV_FREE means it's okay to discard when the memory pressure because the content of the page is *garbage* so freeing such pages is almost zero overhead since we don't need to swap out and access afterward causes just minor fault. Thus, it would make sense to put those freeable pages in inactive file LRU to compete other used-once pages. It makes sense for implmentaion point of view, too because it's not swapbacked memory any longer until it would be re-dirtied. Even, it could give a bonus to make them be reclaimed on swapless system. However, MADV_COLD doesn't mean garbage so reclaiming them requires swap-out/in in the end so it's bigger cost. Since we have designed VM LRU aging based on cost-model, anonymous cold pages would be better to position inactive anon's LRU list, not file LRU. Furthermore, it would help to avoid unnecessary scanning if system doesn't have a swap device. Let's start simpler way without adding complexity at this moment. However, keep in mind, too that it's a caveat that workloads with a lot of pages cache are likely to ignore MADV_COLD on anonymous memory because we rarely age anonymous LRU lists. * man-page material MADV_COLD (since Linux x.x) Pages in the specified regions will be treated as less-recently-accessed compared to pages in the system with similar access frequencies. In contrast to MADV_FREE, the contents of the region are preserved regardless of subsequent writes to pages. MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP pages. [akpm@linux-foundation.org: resolve conflicts with hmm.git] Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: kbuild test robot <lkp@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Chris Zankel <chris@zankel.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Daniel Colascione <dancol@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Oleksandr Natalenko <oleksandr@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Sonny Rao <sonnyrao@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tim Murray <timmurray@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9c276cc65a58faf98be8e56962745ec99ab87636;"However, they were not significant consumers of swap even though they are good candidate for swap";no;yes;yes 41;MDY6Q29tbWl0MjMyNTI5ODo5YzI3NmNjNjVhNThmYWY5OGJlOGU1Njk2Mjc0NWVjOTlhYjg3NjM2;Minchan Kim;Linus Torvalds;"mm: introduce MADV_COLD Patch series ""Introduce MADV_COLD and MADV_PAGEOUT"", v7. - Background The Android terminology used for forking a new process and starting an app from scratch is a cold start, while resuming an existing app is a hot start. While we continually try to improve the performance of cold starts, hot starts will always be significantly less power hungry as well as faster so we are trying to make hot start more likely than cold start. To increase hot start, Android userspace manages the order that apps should be killed in a process called ActivityManagerService. ActivityManagerService tracks every Android app or service that the user could be interacting with at any time and translates that into a ranked list for lmkd(low memory killer daemon). They are likely to be killed by lmkd if the system has to reclaim memory. In that sense they are similar to entries in any other cache. Those apps are kept alive for opportunistic performance improvements but those performance improvements will vary based on the memory requirements of individual workloads. - Problem Naturally, cached apps were dominant consumers of memory on the system. However, they were not significant consumers of swap even though they are good candidate for swap. Under investigation, swapping out only begins once the low zone watermark is hit and kswapd wakes up, but the overall allocation rate in the system might trip lmkd thresholds and cause a cached process to be killed(we measured performance swapping out vs. zapping the memory by killing a process. Unsurprisingly, zapping is 10x times faster even though we use zram which is much faster than real storage) so kill from lmkd will often satisfy the high zone watermark, resulting in very few pages actually being moved to swap. - Approach The approach we chose was to use a new interface to allow userspace to proactively reclaim entire processes by leveraging platform information. This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages that are known to be cold from userspace and to avoid races with lmkd by reclaiming apps as soon as they entered the cached state. Additionally, it could provide many chances for platform to use much information to optimize memory efficiency. To achieve the goal, the patchset introduce two new options for madvise. One is MADV_COLD which will deactivate activated pages and the other is MADV_PAGEOUT which will reclaim private pages instantly. These new options complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to gain some free memory space. MADV_PAGEOUT is similar to MADV_DONTNEED in a way that it hints the kernel that memory region is not currently needed and should be reclaimed immediately; MADV_COLD is similar to MADV_FREE in a way that it hints the kernel that memory region is not currently needed and should be reclaimed when memory pressure rises. This patch (of 5): When a process expects no accesses to a certain memory range, it could give a hint to kernel that the pages can be reclaimed when memory pressure happens but data should be preserved for future use. This could reduce workingset eviction so it ends up increasing performance. This patch introduces the new MADV_COLD hint to madvise(2) syscall. MADV_COLD can be used by a process to mark a memory range as not expected to be used in the near future. The hint can help kernel in deciding which pages to evict early during memory pressure. It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves active file page -> inactive file LRU active anon page -> inacdtive anon LRU Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file LRU's head because MADV_COLD is a little bit different symantic. MADV_FREE means it's okay to discard when the memory pressure because the content of the page is *garbage* so freeing such pages is almost zero overhead since we don't need to swap out and access afterward causes just minor fault. Thus, it would make sense to put those freeable pages in inactive file LRU to compete other used-once pages. It makes sense for implmentaion point of view, too because it's not swapbacked memory any longer until it would be re-dirtied. Even, it could give a bonus to make them be reclaimed on swapless system. However, MADV_COLD doesn't mean garbage so reclaiming them requires swap-out/in in the end so it's bigger cost. Since we have designed VM LRU aging based on cost-model, anonymous cold pages would be better to position inactive anon's LRU list, not file LRU. Furthermore, it would help to avoid unnecessary scanning if system doesn't have a swap device. Let's start simpler way without adding complexity at this moment. However, keep in mind, too that it's a caveat that workloads with a lot of pages cache are likely to ignore MADV_COLD on anonymous memory because we rarely age anonymous LRU lists. * man-page material MADV_COLD (since Linux x.x) Pages in the specified regions will be treated as less-recently-accessed compared to pages in the system with similar access frequencies. In contrast to MADV_FREE, the contents of the region are preserved regardless of subsequent writes to pages. MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP pages. [akpm@linux-foundation.org: resolve conflicts with hmm.git] Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: kbuild test robot <lkp@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Chris Zankel <chris@zankel.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Daniel Colascione <dancol@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Oleksandr Natalenko <oleksandr@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Sonny Rao <sonnyrao@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tim Murray <timmurray@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9c276cc65a58faf98be8e56962745ec99ab87636;" Under investigation, swapping out only begins once the low zone watermark is hit and kswapd wakes up, but the overall allocation rate in the system might trip lmkd thresholds and cause a cached process to be killed(we measured performance swapping out vs";no;yes;yes 41;MDY6Q29tbWl0MjMyNTI5ODo5YzI3NmNjNjVhNThmYWY5OGJlOGU1Njk2Mjc0NWVjOTlhYjg3NjM2;Minchan Kim;Linus Torvalds;"mm: introduce MADV_COLD Patch series ""Introduce MADV_COLD and MADV_PAGEOUT"", v7. - Background The Android terminology used for forking a new process and starting an app from scratch is a cold start, while resuming an existing app is a hot start. While we continually try to improve the performance of cold starts, hot starts will always be significantly less power hungry as well as faster so we are trying to make hot start more likely than cold start. To increase hot start, Android userspace manages the order that apps should be killed in a process called ActivityManagerService. ActivityManagerService tracks every Android app or service that the user could be interacting with at any time and translates that into a ranked list for lmkd(low memory killer daemon). They are likely to be killed by lmkd if the system has to reclaim memory. In that sense they are similar to entries in any other cache. Those apps are kept alive for opportunistic performance improvements but those performance improvements will vary based on the memory requirements of individual workloads. - Problem Naturally, cached apps were dominant consumers of memory on the system. However, they were not significant consumers of swap even though they are good candidate for swap. Under investigation, swapping out only begins once the low zone watermark is hit and kswapd wakes up, but the overall allocation rate in the system might trip lmkd thresholds and cause a cached process to be killed(we measured performance swapping out vs. zapping the memory by killing a process. Unsurprisingly, zapping is 10x times faster even though we use zram which is much faster than real storage) so kill from lmkd will often satisfy the high zone watermark, resulting in very few pages actually being moved to swap. - Approach The approach we chose was to use a new interface to allow userspace to proactively reclaim entire processes by leveraging platform information. This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages that are known to be cold from userspace and to avoid races with lmkd by reclaiming apps as soon as they entered the cached state. Additionally, it could provide many chances for platform to use much information to optimize memory efficiency. To achieve the goal, the patchset introduce two new options for madvise. One is MADV_COLD which will deactivate activated pages and the other is MADV_PAGEOUT which will reclaim private pages instantly. These new options complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to gain some free memory space. MADV_PAGEOUT is similar to MADV_DONTNEED in a way that it hints the kernel that memory region is not currently needed and should be reclaimed immediately; MADV_COLD is similar to MADV_FREE in a way that it hints the kernel that memory region is not currently needed and should be reclaimed when memory pressure rises. This patch (of 5): When a process expects no accesses to a certain memory range, it could give a hint to kernel that the pages can be reclaimed when memory pressure happens but data should be preserved for future use. This could reduce workingset eviction so it ends up increasing performance. This patch introduces the new MADV_COLD hint to madvise(2) syscall. MADV_COLD can be used by a process to mark a memory range as not expected to be used in the near future. The hint can help kernel in deciding which pages to evict early during memory pressure. It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves active file page -> inactive file LRU active anon page -> inacdtive anon LRU Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file LRU's head because MADV_COLD is a little bit different symantic. MADV_FREE means it's okay to discard when the memory pressure because the content of the page is *garbage* so freeing such pages is almost zero overhead since we don't need to swap out and access afterward causes just minor fault. Thus, it would make sense to put those freeable pages in inactive file LRU to compete other used-once pages. It makes sense for implmentaion point of view, too because it's not swapbacked memory any longer until it would be re-dirtied. Even, it could give a bonus to make them be reclaimed on swapless system. However, MADV_COLD doesn't mean garbage so reclaiming them requires swap-out/in in the end so it's bigger cost. Since we have designed VM LRU aging based on cost-model, anonymous cold pages would be better to position inactive anon's LRU list, not file LRU. Furthermore, it would help to avoid unnecessary scanning if system doesn't have a swap device. Let's start simpler way without adding complexity at this moment. However, keep in mind, too that it's a caveat that workloads with a lot of pages cache are likely to ignore MADV_COLD on anonymous memory because we rarely age anonymous LRU lists. * man-page material MADV_COLD (since Linux x.x) Pages in the specified regions will be treated as less-recently-accessed compared to pages in the system with similar access frequencies. In contrast to MADV_FREE, the contents of the region are preserved regardless of subsequent writes to pages. MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP pages. [akpm@linux-foundation.org: resolve conflicts with hmm.git] Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: kbuild test robot <lkp@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Chris Zankel <chris@zankel.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Daniel Colascione <dancol@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Oleksandr Natalenko <oleksandr@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Sonny Rao <sonnyrao@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tim Murray <timmurray@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9c276cc65a58faf98be8e56962745ec99ab87636;" Unsurprisingly, zapping is 10x times faster even though we use zram which is much faster than real storage) so kill from lmkd will often satisfy the high zone watermark, resulting in very few pages actually being moved to swap";no;yes;yes 41;MDY6Q29tbWl0MjMyNTI5ODo5YzI3NmNjNjVhNThmYWY5OGJlOGU1Njk2Mjc0NWVjOTlhYjg3NjM2;Minchan Kim;Linus Torvalds;"mm: introduce MADV_COLD Patch series ""Introduce MADV_COLD and MADV_PAGEOUT"", v7. - Background The Android terminology used for forking a new process and starting an app from scratch is a cold start, while resuming an existing app is a hot start. While we continually try to improve the performance of cold starts, hot starts will always be significantly less power hungry as well as faster so we are trying to make hot start more likely than cold start. To increase hot start, Android userspace manages the order that apps should be killed in a process called ActivityManagerService. ActivityManagerService tracks every Android app or service that the user could be interacting with at any time and translates that into a ranked list for lmkd(low memory killer daemon). They are likely to be killed by lmkd if the system has to reclaim memory. In that sense they are similar to entries in any other cache. Those apps are kept alive for opportunistic performance improvements but those performance improvements will vary based on the memory requirements of individual workloads. - Problem Naturally, cached apps were dominant consumers of memory on the system. However, they were not significant consumers of swap even though they are good candidate for swap. Under investigation, swapping out only begins once the low zone watermark is hit and kswapd wakes up, but the overall allocation rate in the system might trip lmkd thresholds and cause a cached process to be killed(we measured performance swapping out vs. zapping the memory by killing a process. Unsurprisingly, zapping is 10x times faster even though we use zram which is much faster than real storage) so kill from lmkd will often satisfy the high zone watermark, resulting in very few pages actually being moved to swap. - Approach The approach we chose was to use a new interface to allow userspace to proactively reclaim entire processes by leveraging platform information. This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages that are known to be cold from userspace and to avoid races with lmkd by reclaiming apps as soon as they entered the cached state. Additionally, it could provide many chances for platform to use much information to optimize memory efficiency. To achieve the goal, the patchset introduce two new options for madvise. One is MADV_COLD which will deactivate activated pages and the other is MADV_PAGEOUT which will reclaim private pages instantly. These new options complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to gain some free memory space. MADV_PAGEOUT is similar to MADV_DONTNEED in a way that it hints the kernel that memory region is not currently needed and should be reclaimed immediately; MADV_COLD is similar to MADV_FREE in a way that it hints the kernel that memory region is not currently needed and should be reclaimed when memory pressure rises. This patch (of 5): When a process expects no accesses to a certain memory range, it could give a hint to kernel that the pages can be reclaimed when memory pressure happens but data should be preserved for future use. This could reduce workingset eviction so it ends up increasing performance. This patch introduces the new MADV_COLD hint to madvise(2) syscall. MADV_COLD can be used by a process to mark a memory range as not expected to be used in the near future. The hint can help kernel in deciding which pages to evict early during memory pressure. It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves active file page -> inactive file LRU active anon page -> inacdtive anon LRU Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file LRU's head because MADV_COLD is a little bit different symantic. MADV_FREE means it's okay to discard when the memory pressure because the content of the page is *garbage* so freeing such pages is almost zero overhead since we don't need to swap out and access afterward causes just minor fault. Thus, it would make sense to put those freeable pages in inactive file LRU to compete other used-once pages. It makes sense for implmentaion point of view, too because it's not swapbacked memory any longer until it would be re-dirtied. Even, it could give a bonus to make them be reclaimed on swapless system. However, MADV_COLD doesn't mean garbage so reclaiming them requires swap-out/in in the end so it's bigger cost. Since we have designed VM LRU aging based on cost-model, anonymous cold pages would be better to position inactive anon's LRU list, not file LRU. Furthermore, it would help to avoid unnecessary scanning if system doesn't have a swap device. Let's start simpler way without adding complexity at this moment. However, keep in mind, too that it's a caveat that workloads with a lot of pages cache are likely to ignore MADV_COLD on anonymous memory because we rarely age anonymous LRU lists. * man-page material MADV_COLD (since Linux x.x) Pages in the specified regions will be treated as less-recently-accessed compared to pages in the system with similar access frequencies. In contrast to MADV_FREE, the contents of the region are preserved regardless of subsequent writes to pages. MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP pages. [akpm@linux-foundation.org: resolve conflicts with hmm.git] Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: kbuild test robot <lkp@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Chris Zankel <chris@zankel.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Daniel Colascione <dancol@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Oleksandr Natalenko <oleksandr@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Sonny Rao <sonnyrao@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tim Murray <timmurray@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9c276cc65a58faf98be8e56962745ec99ab87636;#NAME?;yes;yes;no 41;MDY6Q29tbWl0MjMyNTI5ODo5YzI3NmNjNjVhNThmYWY5OGJlOGU1Njk2Mjc0NWVjOTlhYjg3NjM2;Minchan Kim;Linus Torvalds;"mm: introduce MADV_COLD Patch series ""Introduce MADV_COLD and MADV_PAGEOUT"", v7. - Background The Android terminology used for forking a new process and starting an app from scratch is a cold start, while resuming an existing app is a hot start. While we continually try to improve the performance of cold starts, hot starts will always be significantly less power hungry as well as faster so we are trying to make hot start more likely than cold start. To increase hot start, Android userspace manages the order that apps should be killed in a process called ActivityManagerService. ActivityManagerService tracks every Android app or service that the user could be interacting with at any time and translates that into a ranked list for lmkd(low memory killer daemon). They are likely to be killed by lmkd if the system has to reclaim memory. In that sense they are similar to entries in any other cache. Those apps are kept alive for opportunistic performance improvements but those performance improvements will vary based on the memory requirements of individual workloads. - Problem Naturally, cached apps were dominant consumers of memory on the system. However, they were not significant consumers of swap even though they are good candidate for swap. Under investigation, swapping out only begins once the low zone watermark is hit and kswapd wakes up, but the overall allocation rate in the system might trip lmkd thresholds and cause a cached process to be killed(we measured performance swapping out vs. zapping the memory by killing a process. Unsurprisingly, zapping is 10x times faster even though we use zram which is much faster than real storage) so kill from lmkd will often satisfy the high zone watermark, resulting in very few pages actually being moved to swap. - Approach The approach we chose was to use a new interface to allow userspace to proactively reclaim entire processes by leveraging platform information. This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages that are known to be cold from userspace and to avoid races with lmkd by reclaiming apps as soon as they entered the cached state. Additionally, it could provide many chances for platform to use much information to optimize memory efficiency. To achieve the goal, the patchset introduce two new options for madvise. One is MADV_COLD which will deactivate activated pages and the other is MADV_PAGEOUT which will reclaim private pages instantly. These new options complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to gain some free memory space. MADV_PAGEOUT is similar to MADV_DONTNEED in a way that it hints the kernel that memory region is not currently needed and should be reclaimed immediately; MADV_COLD is similar to MADV_FREE in a way that it hints the kernel that memory region is not currently needed and should be reclaimed when memory pressure rises. This patch (of 5): When a process expects no accesses to a certain memory range, it could give a hint to kernel that the pages can be reclaimed when memory pressure happens but data should be preserved for future use. This could reduce workingset eviction so it ends up increasing performance. This patch introduces the new MADV_COLD hint to madvise(2) syscall. MADV_COLD can be used by a process to mark a memory range as not expected to be used in the near future. The hint can help kernel in deciding which pages to evict early during memory pressure. It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves active file page -> inactive file LRU active anon page -> inacdtive anon LRU Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file LRU's head because MADV_COLD is a little bit different symantic. MADV_FREE means it's okay to discard when the memory pressure because the content of the page is *garbage* so freeing such pages is almost zero overhead since we don't need to swap out and access afterward causes just minor fault. Thus, it would make sense to put those freeable pages in inactive file LRU to compete other used-once pages. It makes sense for implmentaion point of view, too because it's not swapbacked memory any longer until it would be re-dirtied. Even, it could give a bonus to make them be reclaimed on swapless system. However, MADV_COLD doesn't mean garbage so reclaiming them requires swap-out/in in the end so it's bigger cost. Since we have designed VM LRU aging based on cost-model, anonymous cold pages would be better to position inactive anon's LRU list, not file LRU. Furthermore, it would help to avoid unnecessary scanning if system doesn't have a swap device. Let's start simpler way without adding complexity at this moment. However, keep in mind, too that it's a caveat that workloads with a lot of pages cache are likely to ignore MADV_COLD on anonymous memory because we rarely age anonymous LRU lists. * man-page material MADV_COLD (since Linux x.x) Pages in the specified regions will be treated as less-recently-accessed compared to pages in the system with similar access frequencies. In contrast to MADV_FREE, the contents of the region are preserved regardless of subsequent writes to pages. MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP pages. [akpm@linux-foundation.org: resolve conflicts with hmm.git] Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: kbuild test robot <lkp@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Chris Zankel <chris@zankel.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Daniel Colascione <dancol@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Oleksandr Natalenko <oleksandr@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Sonny Rao <sonnyrao@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tim Murray <timmurray@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9c276cc65a58faf98be8e56962745ec99ab87636;"This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages that are known to be cold from userspace and to avoid races with lmkd by reclaiming apps as soon as they entered the cached state";yes;yes;no 41;MDY6Q29tbWl0MjMyNTI5ODo5YzI3NmNjNjVhNThmYWY5OGJlOGU1Njk2Mjc0NWVjOTlhYjg3NjM2;Minchan Kim;Linus Torvalds;"mm: introduce MADV_COLD Patch series ""Introduce MADV_COLD and MADV_PAGEOUT"", v7. - Background The Android terminology used for forking a new process and starting an app from scratch is a cold start, while resuming an existing app is a hot start. While we continually try to improve the performance of cold starts, hot starts will always be significantly less power hungry as well as faster so we are trying to make hot start more likely than cold start. To increase hot start, Android userspace manages the order that apps should be killed in a process called ActivityManagerService. ActivityManagerService tracks every Android app or service that the user could be interacting with at any time and translates that into a ranked list for lmkd(low memory killer daemon). They are likely to be killed by lmkd if the system has to reclaim memory. In that sense they are similar to entries in any other cache. Those apps are kept alive for opportunistic performance improvements but those performance improvements will vary based on the memory requirements of individual workloads. - Problem Naturally, cached apps were dominant consumers of memory on the system. However, they were not significant consumers of swap even though they are good candidate for swap. Under investigation, swapping out only begins once the low zone watermark is hit and kswapd wakes up, but the overall allocation rate in the system might trip lmkd thresholds and cause a cached process to be killed(we measured performance swapping out vs. zapping the memory by killing a process. Unsurprisingly, zapping is 10x times faster even though we use zram which is much faster than real storage) so kill from lmkd will often satisfy the high zone watermark, resulting in very few pages actually being moved to swap. - Approach The approach we chose was to use a new interface to allow userspace to proactively reclaim entire processes by leveraging platform information. This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages that are known to be cold from userspace and to avoid races with lmkd by reclaiming apps as soon as they entered the cached state. Additionally, it could provide many chances for platform to use much information to optimize memory efficiency. To achieve the goal, the patchset introduce two new options for madvise. One is MADV_COLD which will deactivate activated pages and the other is MADV_PAGEOUT which will reclaim private pages instantly. These new options complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to gain some free memory space. MADV_PAGEOUT is similar to MADV_DONTNEED in a way that it hints the kernel that memory region is not currently needed and should be reclaimed immediately; MADV_COLD is similar to MADV_FREE in a way that it hints the kernel that memory region is not currently needed and should be reclaimed when memory pressure rises. This patch (of 5): When a process expects no accesses to a certain memory range, it could give a hint to kernel that the pages can be reclaimed when memory pressure happens but data should be preserved for future use. This could reduce workingset eviction so it ends up increasing performance. This patch introduces the new MADV_COLD hint to madvise(2) syscall. MADV_COLD can be used by a process to mark a memory range as not expected to be used in the near future. The hint can help kernel in deciding which pages to evict early during memory pressure. It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves active file page -> inactive file LRU active anon page -> inacdtive anon LRU Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file LRU's head because MADV_COLD is a little bit different symantic. MADV_FREE means it's okay to discard when the memory pressure because the content of the page is *garbage* so freeing such pages is almost zero overhead since we don't need to swap out and access afterward causes just minor fault. Thus, it would make sense to put those freeable pages in inactive file LRU to compete other used-once pages. It makes sense for implmentaion point of view, too because it's not swapbacked memory any longer until it would be re-dirtied. Even, it could give a bonus to make them be reclaimed on swapless system. However, MADV_COLD doesn't mean garbage so reclaiming them requires swap-out/in in the end so it's bigger cost. Since we have designed VM LRU aging based on cost-model, anonymous cold pages would be better to position inactive anon's LRU list, not file LRU. Furthermore, it would help to avoid unnecessary scanning if system doesn't have a swap device. Let's start simpler way without adding complexity at this moment. However, keep in mind, too that it's a caveat that workloads with a lot of pages cache are likely to ignore MADV_COLD on anonymous memory because we rarely age anonymous LRU lists. * man-page material MADV_COLD (since Linux x.x) Pages in the specified regions will be treated as less-recently-accessed compared to pages in the system with similar access frequencies. In contrast to MADV_FREE, the contents of the region are preserved regardless of subsequent writes to pages. MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP pages. [akpm@linux-foundation.org: resolve conflicts with hmm.git] Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: kbuild test robot <lkp@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Chris Zankel <chris@zankel.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Daniel Colascione <dancol@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Oleksandr Natalenko <oleksandr@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Sonny Rao <sonnyrao@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tim Murray <timmurray@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9c276cc65a58faf98be8e56962745ec99ab87636;" Additionally, it could provide many chances for platform to use much information to optimize memory efficiency";no;yes;no 41;MDY6Q29tbWl0MjMyNTI5ODo5YzI3NmNjNjVhNThmYWY5OGJlOGU1Njk2Mjc0NWVjOTlhYjg3NjM2;Minchan Kim;Linus Torvalds;"mm: introduce MADV_COLD Patch series ""Introduce MADV_COLD and MADV_PAGEOUT"", v7. - Background The Android terminology used for forking a new process and starting an app from scratch is a cold start, while resuming an existing app is a hot start. While we continually try to improve the performance of cold starts, hot starts will always be significantly less power hungry as well as faster so we are trying to make hot start more likely than cold start. To increase hot start, Android userspace manages the order that apps should be killed in a process called ActivityManagerService. ActivityManagerService tracks every Android app or service that the user could be interacting with at any time and translates that into a ranked list for lmkd(low memory killer daemon). They are likely to be killed by lmkd if the system has to reclaim memory. In that sense they are similar to entries in any other cache. Those apps are kept alive for opportunistic performance improvements but those performance improvements will vary based on the memory requirements of individual workloads. - Problem Naturally, cached apps were dominant consumers of memory on the system. However, they were not significant consumers of swap even though they are good candidate for swap. Under investigation, swapping out only begins once the low zone watermark is hit and kswapd wakes up, but the overall allocation rate in the system might trip lmkd thresholds and cause a cached process to be killed(we measured performance swapping out vs. zapping the memory by killing a process. Unsurprisingly, zapping is 10x times faster even though we use zram which is much faster than real storage) so kill from lmkd will often satisfy the high zone watermark, resulting in very few pages actually being moved to swap. - Approach The approach we chose was to use a new interface to allow userspace to proactively reclaim entire processes by leveraging platform information. This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages that are known to be cold from userspace and to avoid races with lmkd by reclaiming apps as soon as they entered the cached state. Additionally, it could provide many chances for platform to use much information to optimize memory efficiency. To achieve the goal, the patchset introduce two new options for madvise. One is MADV_COLD which will deactivate activated pages and the other is MADV_PAGEOUT which will reclaim private pages instantly. These new options complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to gain some free memory space. MADV_PAGEOUT is similar to MADV_DONTNEED in a way that it hints the kernel that memory region is not currently needed and should be reclaimed immediately; MADV_COLD is similar to MADV_FREE in a way that it hints the kernel that memory region is not currently needed and should be reclaimed when memory pressure rises. This patch (of 5): When a process expects no accesses to a certain memory range, it could give a hint to kernel that the pages can be reclaimed when memory pressure happens but data should be preserved for future use. This could reduce workingset eviction so it ends up increasing performance. This patch introduces the new MADV_COLD hint to madvise(2) syscall. MADV_COLD can be used by a process to mark a memory range as not expected to be used in the near future. The hint can help kernel in deciding which pages to evict early during memory pressure. It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves active file page -> inactive file LRU active anon page -> inacdtive anon LRU Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file LRU's head because MADV_COLD is a little bit different symantic. MADV_FREE means it's okay to discard when the memory pressure because the content of the page is *garbage* so freeing such pages is almost zero overhead since we don't need to swap out and access afterward causes just minor fault. Thus, it would make sense to put those freeable pages in inactive file LRU to compete other used-once pages. It makes sense for implmentaion point of view, too because it's not swapbacked memory any longer until it would be re-dirtied. Even, it could give a bonus to make them be reclaimed on swapless system. However, MADV_COLD doesn't mean garbage so reclaiming them requires swap-out/in in the end so it's bigger cost. Since we have designed VM LRU aging based on cost-model, anonymous cold pages would be better to position inactive anon's LRU list, not file LRU. Furthermore, it would help to avoid unnecessary scanning if system doesn't have a swap device. Let's start simpler way without adding complexity at this moment. However, keep in mind, too that it's a caveat that workloads with a lot of pages cache are likely to ignore MADV_COLD on anonymous memory because we rarely age anonymous LRU lists. * man-page material MADV_COLD (since Linux x.x) Pages in the specified regions will be treated as less-recently-accessed compared to pages in the system with similar access frequencies. In contrast to MADV_FREE, the contents of the region are preserved regardless of subsequent writes to pages. MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP pages. [akpm@linux-foundation.org: resolve conflicts with hmm.git] Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: kbuild test robot <lkp@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Chris Zankel <chris@zankel.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Daniel Colascione <dancol@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Oleksandr Natalenko <oleksandr@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Sonny Rao <sonnyrao@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tim Murray <timmurray@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9c276cc65a58faf98be8e56962745ec99ab87636;To achieve the goal, the patchset introduce two new options for madvise;yes;yes;no 41;MDY6Q29tbWl0MjMyNTI5ODo5YzI3NmNjNjVhNThmYWY5OGJlOGU1Njk2Mjc0NWVjOTlhYjg3NjM2;Minchan Kim;Linus Torvalds;"mm: introduce MADV_COLD Patch series ""Introduce MADV_COLD and MADV_PAGEOUT"", v7. - Background The Android terminology used for forking a new process and starting an app from scratch is a cold start, while resuming an existing app is a hot start. While we continually try to improve the performance of cold starts, hot starts will always be significantly less power hungry as well as faster so we are trying to make hot start more likely than cold start. To increase hot start, Android userspace manages the order that apps should be killed in a process called ActivityManagerService. ActivityManagerService tracks every Android app or service that the user could be interacting with at any time and translates that into a ranked list for lmkd(low memory killer daemon). They are likely to be killed by lmkd if the system has to reclaim memory. In that sense they are similar to entries in any other cache. Those apps are kept alive for opportunistic performance improvements but those performance improvements will vary based on the memory requirements of individual workloads. - Problem Naturally, cached apps were dominant consumers of memory on the system. However, they were not significant consumers of swap even though they are good candidate for swap. Under investigation, swapping out only begins once the low zone watermark is hit and kswapd wakes up, but the overall allocation rate in the system might trip lmkd thresholds and cause a cached process to be killed(we measured performance swapping out vs. zapping the memory by killing a process. Unsurprisingly, zapping is 10x times faster even though we use zram which is much faster than real storage) so kill from lmkd will often satisfy the high zone watermark, resulting in very few pages actually being moved to swap. - Approach The approach we chose was to use a new interface to allow userspace to proactively reclaim entire processes by leveraging platform information. This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages that are known to be cold from userspace and to avoid races with lmkd by reclaiming apps as soon as they entered the cached state. Additionally, it could provide many chances for platform to use much information to optimize memory efficiency. To achieve the goal, the patchset introduce two new options for madvise. One is MADV_COLD which will deactivate activated pages and the other is MADV_PAGEOUT which will reclaim private pages instantly. These new options complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to gain some free memory space. MADV_PAGEOUT is similar to MADV_DONTNEED in a way that it hints the kernel that memory region is not currently needed and should be reclaimed immediately; MADV_COLD is similar to MADV_FREE in a way that it hints the kernel that memory region is not currently needed and should be reclaimed when memory pressure rises. This patch (of 5): When a process expects no accesses to a certain memory range, it could give a hint to kernel that the pages can be reclaimed when memory pressure happens but data should be preserved for future use. This could reduce workingset eviction so it ends up increasing performance. This patch introduces the new MADV_COLD hint to madvise(2) syscall. MADV_COLD can be used by a process to mark a memory range as not expected to be used in the near future. The hint can help kernel in deciding which pages to evict early during memory pressure. It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves active file page -> inactive file LRU active anon page -> inacdtive anon LRU Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file LRU's head because MADV_COLD is a little bit different symantic. MADV_FREE means it's okay to discard when the memory pressure because the content of the page is *garbage* so freeing such pages is almost zero overhead since we don't need to swap out and access afterward causes just minor fault. Thus, it would make sense to put those freeable pages in inactive file LRU to compete other used-once pages. It makes sense for implmentaion point of view, too because it's not swapbacked memory any longer until it would be re-dirtied. Even, it could give a bonus to make them be reclaimed on swapless system. However, MADV_COLD doesn't mean garbage so reclaiming them requires swap-out/in in the end so it's bigger cost. Since we have designed VM LRU aging based on cost-model, anonymous cold pages would be better to position inactive anon's LRU list, not file LRU. Furthermore, it would help to avoid unnecessary scanning if system doesn't have a swap device. Let's start simpler way without adding complexity at this moment. However, keep in mind, too that it's a caveat that workloads with a lot of pages cache are likely to ignore MADV_COLD on anonymous memory because we rarely age anonymous LRU lists. * man-page material MADV_COLD (since Linux x.x) Pages in the specified regions will be treated as less-recently-accessed compared to pages in the system with similar access frequencies. In contrast to MADV_FREE, the contents of the region are preserved regardless of subsequent writes to pages. MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP pages. [akpm@linux-foundation.org: resolve conflicts with hmm.git] Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: kbuild test robot <lkp@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Chris Zankel <chris@zankel.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Daniel Colascione <dancol@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Oleksandr Natalenko <oleksandr@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Sonny Rao <sonnyrao@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tim Murray <timmurray@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9c276cc65a58faf98be8e56962745ec99ab87636;"One is MADV_COLD which will deactivate activated pages and the other is MADV_PAGEOUT which will reclaim private pages instantly";yes;yes;no 41;MDY6Q29tbWl0MjMyNTI5ODo5YzI3NmNjNjVhNThmYWY5OGJlOGU1Njk2Mjc0NWVjOTlhYjg3NjM2;Minchan Kim;Linus Torvalds;"mm: introduce MADV_COLD Patch series ""Introduce MADV_COLD and MADV_PAGEOUT"", v7. - Background The Android terminology used for forking a new process and starting an app from scratch is a cold start, while resuming an existing app is a hot start. While we continually try to improve the performance of cold starts, hot starts will always be significantly less power hungry as well as faster so we are trying to make hot start more likely than cold start. To increase hot start, Android userspace manages the order that apps should be killed in a process called ActivityManagerService. ActivityManagerService tracks every Android app or service that the user could be interacting with at any time and translates that into a ranked list for lmkd(low memory killer daemon). They are likely to be killed by lmkd if the system has to reclaim memory. In that sense they are similar to entries in any other cache. Those apps are kept alive for opportunistic performance improvements but those performance improvements will vary based on the memory requirements of individual workloads. - Problem Naturally, cached apps were dominant consumers of memory on the system. However, they were not significant consumers of swap even though they are good candidate for swap. Under investigation, swapping out only begins once the low zone watermark is hit and kswapd wakes up, but the overall allocation rate in the system might trip lmkd thresholds and cause a cached process to be killed(we measured performance swapping out vs. zapping the memory by killing a process. Unsurprisingly, zapping is 10x times faster even though we use zram which is much faster than real storage) so kill from lmkd will often satisfy the high zone watermark, resulting in very few pages actually being moved to swap. - Approach The approach we chose was to use a new interface to allow userspace to proactively reclaim entire processes by leveraging platform information. This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages that are known to be cold from userspace and to avoid races with lmkd by reclaiming apps as soon as they entered the cached state. Additionally, it could provide many chances for platform to use much information to optimize memory efficiency. To achieve the goal, the patchset introduce two new options for madvise. One is MADV_COLD which will deactivate activated pages and the other is MADV_PAGEOUT which will reclaim private pages instantly. These new options complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to gain some free memory space. MADV_PAGEOUT is similar to MADV_DONTNEED in a way that it hints the kernel that memory region is not currently needed and should be reclaimed immediately; MADV_COLD is similar to MADV_FREE in a way that it hints the kernel that memory region is not currently needed and should be reclaimed when memory pressure rises. This patch (of 5): When a process expects no accesses to a certain memory range, it could give a hint to kernel that the pages can be reclaimed when memory pressure happens but data should be preserved for future use. This could reduce workingset eviction so it ends up increasing performance. This patch introduces the new MADV_COLD hint to madvise(2) syscall. MADV_COLD can be used by a process to mark a memory range as not expected to be used in the near future. The hint can help kernel in deciding which pages to evict early during memory pressure. It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves active file page -> inactive file LRU active anon page -> inacdtive anon LRU Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file LRU's head because MADV_COLD is a little bit different symantic. MADV_FREE means it's okay to discard when the memory pressure because the content of the page is *garbage* so freeing such pages is almost zero overhead since we don't need to swap out and access afterward causes just minor fault. Thus, it would make sense to put those freeable pages in inactive file LRU to compete other used-once pages. It makes sense for implmentaion point of view, too because it's not swapbacked memory any longer until it would be re-dirtied. Even, it could give a bonus to make them be reclaimed on swapless system. However, MADV_COLD doesn't mean garbage so reclaiming them requires swap-out/in in the end so it's bigger cost. Since we have designed VM LRU aging based on cost-model, anonymous cold pages would be better to position inactive anon's LRU list, not file LRU. Furthermore, it would help to avoid unnecessary scanning if system doesn't have a swap device. Let's start simpler way without adding complexity at this moment. However, keep in mind, too that it's a caveat that workloads with a lot of pages cache are likely to ignore MADV_COLD on anonymous memory because we rarely age anonymous LRU lists. * man-page material MADV_COLD (since Linux x.x) Pages in the specified regions will be treated as less-recently-accessed compared to pages in the system with similar access frequencies. In contrast to MADV_FREE, the contents of the region are preserved regardless of subsequent writes to pages. MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP pages. [akpm@linux-foundation.org: resolve conflicts with hmm.git] Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: kbuild test robot <lkp@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Chris Zankel <chris@zankel.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Daniel Colascione <dancol@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Oleksandr Natalenko <oleksandr@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Sonny Rao <sonnyrao@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tim Murray <timmurray@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9c276cc65a58faf98be8e56962745ec99ab87636;" These new options complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to gain some free memory space";yes;yes;no 41;MDY6Q29tbWl0MjMyNTI5ODo5YzI3NmNjNjVhNThmYWY5OGJlOGU1Njk2Mjc0NWVjOTlhYjg3NjM2;Minchan Kim;Linus Torvalds;"mm: introduce MADV_COLD Patch series ""Introduce MADV_COLD and MADV_PAGEOUT"", v7. - Background The Android terminology used for forking a new process and starting an app from scratch is a cold start, while resuming an existing app is a hot start. While we continually try to improve the performance of cold starts, hot starts will always be significantly less power hungry as well as faster so we are trying to make hot start more likely than cold start. To increase hot start, Android userspace manages the order that apps should be killed in a process called ActivityManagerService. ActivityManagerService tracks every Android app or service that the user could be interacting with at any time and translates that into a ranked list for lmkd(low memory killer daemon). They are likely to be killed by lmkd if the system has to reclaim memory. In that sense they are similar to entries in any other cache. Those apps are kept alive for opportunistic performance improvements but those performance improvements will vary based on the memory requirements of individual workloads. - Problem Naturally, cached apps were dominant consumers of memory on the system. However, they were not significant consumers of swap even though they are good candidate for swap. Under investigation, swapping out only begins once the low zone watermark is hit and kswapd wakes up, but the overall allocation rate in the system might trip lmkd thresholds and cause a cached process to be killed(we measured performance swapping out vs. zapping the memory by killing a process. Unsurprisingly, zapping is 10x times faster even though we use zram which is much faster than real storage) so kill from lmkd will often satisfy the high zone watermark, resulting in very few pages actually being moved to swap. - Approach The approach we chose was to use a new interface to allow userspace to proactively reclaim entire processes by leveraging platform information. This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages that are known to be cold from userspace and to avoid races with lmkd by reclaiming apps as soon as they entered the cached state. Additionally, it could provide many chances for platform to use much information to optimize memory efficiency. To achieve the goal, the patchset introduce two new options for madvise. One is MADV_COLD which will deactivate activated pages and the other is MADV_PAGEOUT which will reclaim private pages instantly. These new options complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to gain some free memory space. MADV_PAGEOUT is similar to MADV_DONTNEED in a way that it hints the kernel that memory region is not currently needed and should be reclaimed immediately; MADV_COLD is similar to MADV_FREE in a way that it hints the kernel that memory region is not currently needed and should be reclaimed when memory pressure rises. This patch (of 5): When a process expects no accesses to a certain memory range, it could give a hint to kernel that the pages can be reclaimed when memory pressure happens but data should be preserved for future use. This could reduce workingset eviction so it ends up increasing performance. This patch introduces the new MADV_COLD hint to madvise(2) syscall. MADV_COLD can be used by a process to mark a memory range as not expected to be used in the near future. The hint can help kernel in deciding which pages to evict early during memory pressure. It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves active file page -> inactive file LRU active anon page -> inacdtive anon LRU Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file LRU's head because MADV_COLD is a little bit different symantic. MADV_FREE means it's okay to discard when the memory pressure because the content of the page is *garbage* so freeing such pages is almost zero overhead since we don't need to swap out and access afterward causes just minor fault. Thus, it would make sense to put those freeable pages in inactive file LRU to compete other used-once pages. It makes sense for implmentaion point of view, too because it's not swapbacked memory any longer until it would be re-dirtied. Even, it could give a bonus to make them be reclaimed on swapless system. However, MADV_COLD doesn't mean garbage so reclaiming them requires swap-out/in in the end so it's bigger cost. Since we have designed VM LRU aging based on cost-model, anonymous cold pages would be better to position inactive anon's LRU list, not file LRU. Furthermore, it would help to avoid unnecessary scanning if system doesn't have a swap device. Let's start simpler way without adding complexity at this moment. However, keep in mind, too that it's a caveat that workloads with a lot of pages cache are likely to ignore MADV_COLD on anonymous memory because we rarely age anonymous LRU lists. * man-page material MADV_COLD (since Linux x.x) Pages in the specified regions will be treated as less-recently-accessed compared to pages in the system with similar access frequencies. In contrast to MADV_FREE, the contents of the region are preserved regardless of subsequent writes to pages. MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP pages. [akpm@linux-foundation.org: resolve conflicts with hmm.git] Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: kbuild test robot <lkp@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Chris Zankel <chris@zankel.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Daniel Colascione <dancol@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Oleksandr Natalenko <oleksandr@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Sonny Rao <sonnyrao@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tim Murray <timmurray@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9c276cc65a58faf98be8e56962745ec99ab87636;" MADV_PAGEOUT is similar to MADV_DONTNEED in a way that it hints the kernel that memory region is not currently needed and should be reclaimed immediately; MADV_COLD is similar to MADV_FREE in a way that it hints the kernel that memory region is not currently needed and should be reclaimed when memory pressure rises";yes;yes;no 41;MDY6Q29tbWl0MjMyNTI5ODo5YzI3NmNjNjVhNThmYWY5OGJlOGU1Njk2Mjc0NWVjOTlhYjg3NjM2;Minchan Kim;Linus Torvalds;"mm: introduce MADV_COLD Patch series ""Introduce MADV_COLD and MADV_PAGEOUT"", v7. - Background The Android terminology used for forking a new process and starting an app from scratch is a cold start, while resuming an existing app is a hot start. While we continually try to improve the performance of cold starts, hot starts will always be significantly less power hungry as well as faster so we are trying to make hot start more likely than cold start. To increase hot start, Android userspace manages the order that apps should be killed in a process called ActivityManagerService. ActivityManagerService tracks every Android app or service that the user could be interacting with at any time and translates that into a ranked list for lmkd(low memory killer daemon). They are likely to be killed by lmkd if the system has to reclaim memory. In that sense they are similar to entries in any other cache. Those apps are kept alive for opportunistic performance improvements but those performance improvements will vary based on the memory requirements of individual workloads. - Problem Naturally, cached apps were dominant consumers of memory on the system. However, they were not significant consumers of swap even though they are good candidate for swap. Under investigation, swapping out only begins once the low zone watermark is hit and kswapd wakes up, but the overall allocation rate in the system might trip lmkd thresholds and cause a cached process to be killed(we measured performance swapping out vs. zapping the memory by killing a process. Unsurprisingly, zapping is 10x times faster even though we use zram which is much faster than real storage) so kill from lmkd will often satisfy the high zone watermark, resulting in very few pages actually being moved to swap. - Approach The approach we chose was to use a new interface to allow userspace to proactively reclaim entire processes by leveraging platform information. This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages that are known to be cold from userspace and to avoid races with lmkd by reclaiming apps as soon as they entered the cached state. Additionally, it could provide many chances for platform to use much information to optimize memory efficiency. To achieve the goal, the patchset introduce two new options for madvise. One is MADV_COLD which will deactivate activated pages and the other is MADV_PAGEOUT which will reclaim private pages instantly. These new options complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to gain some free memory space. MADV_PAGEOUT is similar to MADV_DONTNEED in a way that it hints the kernel that memory region is not currently needed and should be reclaimed immediately; MADV_COLD is similar to MADV_FREE in a way that it hints the kernel that memory region is not currently needed and should be reclaimed when memory pressure rises. This patch (of 5): When a process expects no accesses to a certain memory range, it could give a hint to kernel that the pages can be reclaimed when memory pressure happens but data should be preserved for future use. This could reduce workingset eviction so it ends up increasing performance. This patch introduces the new MADV_COLD hint to madvise(2) syscall. MADV_COLD can be used by a process to mark a memory range as not expected to be used in the near future. The hint can help kernel in deciding which pages to evict early during memory pressure. It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves active file page -> inactive file LRU active anon page -> inacdtive anon LRU Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file LRU's head because MADV_COLD is a little bit different symantic. MADV_FREE means it's okay to discard when the memory pressure because the content of the page is *garbage* so freeing such pages is almost zero overhead since we don't need to swap out and access afterward causes just minor fault. Thus, it would make sense to put those freeable pages in inactive file LRU to compete other used-once pages. It makes sense for implmentaion point of view, too because it's not swapbacked memory any longer until it would be re-dirtied. Even, it could give a bonus to make them be reclaimed on swapless system. However, MADV_COLD doesn't mean garbage so reclaiming them requires swap-out/in in the end so it's bigger cost. Since we have designed VM LRU aging based on cost-model, anonymous cold pages would be better to position inactive anon's LRU list, not file LRU. Furthermore, it would help to avoid unnecessary scanning if system doesn't have a swap device. Let's start simpler way without adding complexity at this moment. However, keep in mind, too that it's a caveat that workloads with a lot of pages cache are likely to ignore MADV_COLD on anonymous memory because we rarely age anonymous LRU lists. * man-page material MADV_COLD (since Linux x.x) Pages in the specified regions will be treated as less-recently-accessed compared to pages in the system with similar access frequencies. In contrast to MADV_FREE, the contents of the region are preserved regardless of subsequent writes to pages. MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP pages. [akpm@linux-foundation.org: resolve conflicts with hmm.git] Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: kbuild test robot <lkp@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Chris Zankel <chris@zankel.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Daniel Colascione <dancol@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Oleksandr Natalenko <oleksandr@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Sonny Rao <sonnyrao@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tim Murray <timmurray@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9c276cc65a58faf98be8e56962745ec99ab87636;"When a process expects no accesses to a certain memory range, it could give a hint to kernel that the pages can be reclaimed when memory pressure happens but data should be preserved for future use";yes;no;no 41;MDY6Q29tbWl0MjMyNTI5ODo5YzI3NmNjNjVhNThmYWY5OGJlOGU1Njk2Mjc0NWVjOTlhYjg3NjM2;Minchan Kim;Linus Torvalds;"mm: introduce MADV_COLD Patch series ""Introduce MADV_COLD and MADV_PAGEOUT"", v7. - Background The Android terminology used for forking a new process and starting an app from scratch is a cold start, while resuming an existing app is a hot start. While we continually try to improve the performance of cold starts, hot starts will always be significantly less power hungry as well as faster so we are trying to make hot start more likely than cold start. To increase hot start, Android userspace manages the order that apps should be killed in a process called ActivityManagerService. ActivityManagerService tracks every Android app or service that the user could be interacting with at any time and translates that into a ranked list for lmkd(low memory killer daemon). They are likely to be killed by lmkd if the system has to reclaim memory. In that sense they are similar to entries in any other cache. Those apps are kept alive for opportunistic performance improvements but those performance improvements will vary based on the memory requirements of individual workloads. - Problem Naturally, cached apps were dominant consumers of memory on the system. However, they were not significant consumers of swap even though they are good candidate for swap. Under investigation, swapping out only begins once the low zone watermark is hit and kswapd wakes up, but the overall allocation rate in the system might trip lmkd thresholds and cause a cached process to be killed(we measured performance swapping out vs. zapping the memory by killing a process. Unsurprisingly, zapping is 10x times faster even though we use zram which is much faster than real storage) so kill from lmkd will often satisfy the high zone watermark, resulting in very few pages actually being moved to swap. - Approach The approach we chose was to use a new interface to allow userspace to proactively reclaim entire processes by leveraging platform information. This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages that are known to be cold from userspace and to avoid races with lmkd by reclaiming apps as soon as they entered the cached state. Additionally, it could provide many chances for platform to use much information to optimize memory efficiency. To achieve the goal, the patchset introduce two new options for madvise. One is MADV_COLD which will deactivate activated pages and the other is MADV_PAGEOUT which will reclaim private pages instantly. These new options complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to gain some free memory space. MADV_PAGEOUT is similar to MADV_DONTNEED in a way that it hints the kernel that memory region is not currently needed and should be reclaimed immediately; MADV_COLD is similar to MADV_FREE in a way that it hints the kernel that memory region is not currently needed and should be reclaimed when memory pressure rises. This patch (of 5): When a process expects no accesses to a certain memory range, it could give a hint to kernel that the pages can be reclaimed when memory pressure happens but data should be preserved for future use. This could reduce workingset eviction so it ends up increasing performance. This patch introduces the new MADV_COLD hint to madvise(2) syscall. MADV_COLD can be used by a process to mark a memory range as not expected to be used in the near future. The hint can help kernel in deciding which pages to evict early during memory pressure. It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves active file page -> inactive file LRU active anon page -> inacdtive anon LRU Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file LRU's head because MADV_COLD is a little bit different symantic. MADV_FREE means it's okay to discard when the memory pressure because the content of the page is *garbage* so freeing such pages is almost zero overhead since we don't need to swap out and access afterward causes just minor fault. Thus, it would make sense to put those freeable pages in inactive file LRU to compete other used-once pages. It makes sense for implmentaion point of view, too because it's not swapbacked memory any longer until it would be re-dirtied. Even, it could give a bonus to make them be reclaimed on swapless system. However, MADV_COLD doesn't mean garbage so reclaiming them requires swap-out/in in the end so it's bigger cost. Since we have designed VM LRU aging based on cost-model, anonymous cold pages would be better to position inactive anon's LRU list, not file LRU. Furthermore, it would help to avoid unnecessary scanning if system doesn't have a swap device. Let's start simpler way without adding complexity at this moment. However, keep in mind, too that it's a caveat that workloads with a lot of pages cache are likely to ignore MADV_COLD on anonymous memory because we rarely age anonymous LRU lists. * man-page material MADV_COLD (since Linux x.x) Pages in the specified regions will be treated as less-recently-accessed compared to pages in the system with similar access frequencies. In contrast to MADV_FREE, the contents of the region are preserved regardless of subsequent writes to pages. MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP pages. [akpm@linux-foundation.org: resolve conflicts with hmm.git] Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: kbuild test robot <lkp@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Chris Zankel <chris@zankel.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Daniel Colascione <dancol@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Oleksandr Natalenko <oleksandr@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Sonny Rao <sonnyrao@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tim Murray <timmurray@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9c276cc65a58faf98be8e56962745ec99ab87636;" This could reduce workingset eviction so it ends up increasing performance";no;yes;no 41;MDY6Q29tbWl0MjMyNTI5ODo5YzI3NmNjNjVhNThmYWY5OGJlOGU1Njk2Mjc0NWVjOTlhYjg3NjM2;Minchan Kim;Linus Torvalds;"mm: introduce MADV_COLD Patch series ""Introduce MADV_COLD and MADV_PAGEOUT"", v7. - Background The Android terminology used for forking a new process and starting an app from scratch is a cold start, while resuming an existing app is a hot start. While we continually try to improve the performance of cold starts, hot starts will always be significantly less power hungry as well as faster so we are trying to make hot start more likely than cold start. To increase hot start, Android userspace manages the order that apps should be killed in a process called ActivityManagerService. ActivityManagerService tracks every Android app or service that the user could be interacting with at any time and translates that into a ranked list for lmkd(low memory killer daemon). They are likely to be killed by lmkd if the system has to reclaim memory. In that sense they are similar to entries in any other cache. Those apps are kept alive for opportunistic performance improvements but those performance improvements will vary based on the memory requirements of individual workloads. - Problem Naturally, cached apps were dominant consumers of memory on the system. However, they were not significant consumers of swap even though they are good candidate for swap. Under investigation, swapping out only begins once the low zone watermark is hit and kswapd wakes up, but the overall allocation rate in the system might trip lmkd thresholds and cause a cached process to be killed(we measured performance swapping out vs. zapping the memory by killing a process. Unsurprisingly, zapping is 10x times faster even though we use zram which is much faster than real storage) so kill from lmkd will often satisfy the high zone watermark, resulting in very few pages actually being moved to swap. - Approach The approach we chose was to use a new interface to allow userspace to proactively reclaim entire processes by leveraging platform information. This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages that are known to be cold from userspace and to avoid races with lmkd by reclaiming apps as soon as they entered the cached state. Additionally, it could provide many chances for platform to use much information to optimize memory efficiency. To achieve the goal, the patchset introduce two new options for madvise. One is MADV_COLD which will deactivate activated pages and the other is MADV_PAGEOUT which will reclaim private pages instantly. These new options complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to gain some free memory space. MADV_PAGEOUT is similar to MADV_DONTNEED in a way that it hints the kernel that memory region is not currently needed and should be reclaimed immediately; MADV_COLD is similar to MADV_FREE in a way that it hints the kernel that memory region is not currently needed and should be reclaimed when memory pressure rises. This patch (of 5): When a process expects no accesses to a certain memory range, it could give a hint to kernel that the pages can be reclaimed when memory pressure happens but data should be preserved for future use. This could reduce workingset eviction so it ends up increasing performance. This patch introduces the new MADV_COLD hint to madvise(2) syscall. MADV_COLD can be used by a process to mark a memory range as not expected to be used in the near future. The hint can help kernel in deciding which pages to evict early during memory pressure. It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves active file page -> inactive file LRU active anon page -> inacdtive anon LRU Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file LRU's head because MADV_COLD is a little bit different symantic. MADV_FREE means it's okay to discard when the memory pressure because the content of the page is *garbage* so freeing such pages is almost zero overhead since we don't need to swap out and access afterward causes just minor fault. Thus, it would make sense to put those freeable pages in inactive file LRU to compete other used-once pages. It makes sense for implmentaion point of view, too because it's not swapbacked memory any longer until it would be re-dirtied. Even, it could give a bonus to make them be reclaimed on swapless system. However, MADV_COLD doesn't mean garbage so reclaiming them requires swap-out/in in the end so it's bigger cost. Since we have designed VM LRU aging based on cost-model, anonymous cold pages would be better to position inactive anon's LRU list, not file LRU. Furthermore, it would help to avoid unnecessary scanning if system doesn't have a swap device. Let's start simpler way without adding complexity at this moment. However, keep in mind, too that it's a caveat that workloads with a lot of pages cache are likely to ignore MADV_COLD on anonymous memory because we rarely age anonymous LRU lists. * man-page material MADV_COLD (since Linux x.x) Pages in the specified regions will be treated as less-recently-accessed compared to pages in the system with similar access frequencies. In contrast to MADV_FREE, the contents of the region are preserved regardless of subsequent writes to pages. MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP pages. [akpm@linux-foundation.org: resolve conflicts with hmm.git] Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: kbuild test robot <lkp@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Chris Zankel <chris@zankel.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Daniel Colascione <dancol@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Oleksandr Natalenko <oleksandr@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Sonny Rao <sonnyrao@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tim Murray <timmurray@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9c276cc65a58faf98be8e56962745ec99ab87636;This patch introduces the new MADV_COLD hint to madvise(2) syscall;yes;no;no 41;MDY6Q29tbWl0MjMyNTI5ODo5YzI3NmNjNjVhNThmYWY5OGJlOGU1Njk2Mjc0NWVjOTlhYjg3NjM2;Minchan Kim;Linus Torvalds;"mm: introduce MADV_COLD Patch series ""Introduce MADV_COLD and MADV_PAGEOUT"", v7. - Background The Android terminology used for forking a new process and starting an app from scratch is a cold start, while resuming an existing app is a hot start. While we continually try to improve the performance of cold starts, hot starts will always be significantly less power hungry as well as faster so we are trying to make hot start more likely than cold start. To increase hot start, Android userspace manages the order that apps should be killed in a process called ActivityManagerService. ActivityManagerService tracks every Android app or service that the user could be interacting with at any time and translates that into a ranked list for lmkd(low memory killer daemon). They are likely to be killed by lmkd if the system has to reclaim memory. In that sense they are similar to entries in any other cache. Those apps are kept alive for opportunistic performance improvements but those performance improvements will vary based on the memory requirements of individual workloads. - Problem Naturally, cached apps were dominant consumers of memory on the system. However, they were not significant consumers of swap even though they are good candidate for swap. Under investigation, swapping out only begins once the low zone watermark is hit and kswapd wakes up, but the overall allocation rate in the system might trip lmkd thresholds and cause a cached process to be killed(we measured performance swapping out vs. zapping the memory by killing a process. Unsurprisingly, zapping is 10x times faster even though we use zram which is much faster than real storage) so kill from lmkd will often satisfy the high zone watermark, resulting in very few pages actually being moved to swap. - Approach The approach we chose was to use a new interface to allow userspace to proactively reclaim entire processes by leveraging platform information. This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages that are known to be cold from userspace and to avoid races with lmkd by reclaiming apps as soon as they entered the cached state. Additionally, it could provide many chances for platform to use much information to optimize memory efficiency. To achieve the goal, the patchset introduce two new options for madvise. One is MADV_COLD which will deactivate activated pages and the other is MADV_PAGEOUT which will reclaim private pages instantly. These new options complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to gain some free memory space. MADV_PAGEOUT is similar to MADV_DONTNEED in a way that it hints the kernel that memory region is not currently needed and should be reclaimed immediately; MADV_COLD is similar to MADV_FREE in a way that it hints the kernel that memory region is not currently needed and should be reclaimed when memory pressure rises. This patch (of 5): When a process expects no accesses to a certain memory range, it could give a hint to kernel that the pages can be reclaimed when memory pressure happens but data should be preserved for future use. This could reduce workingset eviction so it ends up increasing performance. This patch introduces the new MADV_COLD hint to madvise(2) syscall. MADV_COLD can be used by a process to mark a memory range as not expected to be used in the near future. The hint can help kernel in deciding which pages to evict early during memory pressure. It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves active file page -> inactive file LRU active anon page -> inacdtive anon LRU Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file LRU's head because MADV_COLD is a little bit different symantic. MADV_FREE means it's okay to discard when the memory pressure because the content of the page is *garbage* so freeing such pages is almost zero overhead since we don't need to swap out and access afterward causes just minor fault. Thus, it would make sense to put those freeable pages in inactive file LRU to compete other used-once pages. It makes sense for implmentaion point of view, too because it's not swapbacked memory any longer until it would be re-dirtied. Even, it could give a bonus to make them be reclaimed on swapless system. However, MADV_COLD doesn't mean garbage so reclaiming them requires swap-out/in in the end so it's bigger cost. Since we have designed VM LRU aging based on cost-model, anonymous cold pages would be better to position inactive anon's LRU list, not file LRU. Furthermore, it would help to avoid unnecessary scanning if system doesn't have a swap device. Let's start simpler way without adding complexity at this moment. However, keep in mind, too that it's a caveat that workloads with a lot of pages cache are likely to ignore MADV_COLD on anonymous memory because we rarely age anonymous LRU lists. * man-page material MADV_COLD (since Linux x.x) Pages in the specified regions will be treated as less-recently-accessed compared to pages in the system with similar access frequencies. In contrast to MADV_FREE, the contents of the region are preserved regardless of subsequent writes to pages. MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP pages. [akpm@linux-foundation.org: resolve conflicts with hmm.git] Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: kbuild test robot <lkp@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Chris Zankel <chris@zankel.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Daniel Colascione <dancol@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Oleksandr Natalenko <oleksandr@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Sonny Rao <sonnyrao@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tim Murray <timmurray@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9c276cc65a58faf98be8e56962745ec99ab87636;"MADV_COLD can be used by a process to mark a memory range as not expected to be used in the near future";yes;yes;no 41;MDY6Q29tbWl0MjMyNTI5ODo5YzI3NmNjNjVhNThmYWY5OGJlOGU1Njk2Mjc0NWVjOTlhYjg3NjM2;Minchan Kim;Linus Torvalds;"mm: introduce MADV_COLD Patch series ""Introduce MADV_COLD and MADV_PAGEOUT"", v7. - Background The Android terminology used for forking a new process and starting an app from scratch is a cold start, while resuming an existing app is a hot start. While we continually try to improve the performance of cold starts, hot starts will always be significantly less power hungry as well as faster so we are trying to make hot start more likely than cold start. To increase hot start, Android userspace manages the order that apps should be killed in a process called ActivityManagerService. ActivityManagerService tracks every Android app or service that the user could be interacting with at any time and translates that into a ranked list for lmkd(low memory killer daemon). They are likely to be killed by lmkd if the system has to reclaim memory. In that sense they are similar to entries in any other cache. Those apps are kept alive for opportunistic performance improvements but those performance improvements will vary based on the memory requirements of individual workloads. - Problem Naturally, cached apps were dominant consumers of memory on the system. However, they were not significant consumers of swap even though they are good candidate for swap. Under investigation, swapping out only begins once the low zone watermark is hit and kswapd wakes up, but the overall allocation rate in the system might trip lmkd thresholds and cause a cached process to be killed(we measured performance swapping out vs. zapping the memory by killing a process. Unsurprisingly, zapping is 10x times faster even though we use zram which is much faster than real storage) so kill from lmkd will often satisfy the high zone watermark, resulting in very few pages actually being moved to swap. - Approach The approach we chose was to use a new interface to allow userspace to proactively reclaim entire processes by leveraging platform information. This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages that are known to be cold from userspace and to avoid races with lmkd by reclaiming apps as soon as they entered the cached state. Additionally, it could provide many chances for platform to use much information to optimize memory efficiency. To achieve the goal, the patchset introduce two new options for madvise. One is MADV_COLD which will deactivate activated pages and the other is MADV_PAGEOUT which will reclaim private pages instantly. These new options complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to gain some free memory space. MADV_PAGEOUT is similar to MADV_DONTNEED in a way that it hints the kernel that memory region is not currently needed and should be reclaimed immediately; MADV_COLD is similar to MADV_FREE in a way that it hints the kernel that memory region is not currently needed and should be reclaimed when memory pressure rises. This patch (of 5): When a process expects no accesses to a certain memory range, it could give a hint to kernel that the pages can be reclaimed when memory pressure happens but data should be preserved for future use. This could reduce workingset eviction so it ends up increasing performance. This patch introduces the new MADV_COLD hint to madvise(2) syscall. MADV_COLD can be used by a process to mark a memory range as not expected to be used in the near future. The hint can help kernel in deciding which pages to evict early during memory pressure. It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves active file page -> inactive file LRU active anon page -> inacdtive anon LRU Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file LRU's head because MADV_COLD is a little bit different symantic. MADV_FREE means it's okay to discard when the memory pressure because the content of the page is *garbage* so freeing such pages is almost zero overhead since we don't need to swap out and access afterward causes just minor fault. Thus, it would make sense to put those freeable pages in inactive file LRU to compete other used-once pages. It makes sense for implmentaion point of view, too because it's not swapbacked memory any longer until it would be re-dirtied. Even, it could give a bonus to make them be reclaimed on swapless system. However, MADV_COLD doesn't mean garbage so reclaiming them requires swap-out/in in the end so it's bigger cost. Since we have designed VM LRU aging based on cost-model, anonymous cold pages would be better to position inactive anon's LRU list, not file LRU. Furthermore, it would help to avoid unnecessary scanning if system doesn't have a swap device. Let's start simpler way without adding complexity at this moment. However, keep in mind, too that it's a caveat that workloads with a lot of pages cache are likely to ignore MADV_COLD on anonymous memory because we rarely age anonymous LRU lists. * man-page material MADV_COLD (since Linux x.x) Pages in the specified regions will be treated as less-recently-accessed compared to pages in the system with similar access frequencies. In contrast to MADV_FREE, the contents of the region are preserved regardless of subsequent writes to pages. MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP pages. [akpm@linux-foundation.org: resolve conflicts with hmm.git] Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: kbuild test robot <lkp@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Chris Zankel <chris@zankel.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Daniel Colascione <dancol@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Oleksandr Natalenko <oleksandr@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Sonny Rao <sonnyrao@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tim Murray <timmurray@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9c276cc65a58faf98be8e56962745ec99ab87636;" The hint can help kernel in deciding which pages to evict early during memory pressure";yes;yes;no 41;MDY6Q29tbWl0MjMyNTI5ODo5YzI3NmNjNjVhNThmYWY5OGJlOGU1Njk2Mjc0NWVjOTlhYjg3NjM2;Minchan Kim;Linus Torvalds;"mm: introduce MADV_COLD Patch series ""Introduce MADV_COLD and MADV_PAGEOUT"", v7. - Background The Android terminology used for forking a new process and starting an app from scratch is a cold start, while resuming an existing app is a hot start. While we continually try to improve the performance of cold starts, hot starts will always be significantly less power hungry as well as faster so we are trying to make hot start more likely than cold start. To increase hot start, Android userspace manages the order that apps should be killed in a process called ActivityManagerService. ActivityManagerService tracks every Android app or service that the user could be interacting with at any time and translates that into a ranked list for lmkd(low memory killer daemon). They are likely to be killed by lmkd if the system has to reclaim memory. In that sense they are similar to entries in any other cache. Those apps are kept alive for opportunistic performance improvements but those performance improvements will vary based on the memory requirements of individual workloads. - Problem Naturally, cached apps were dominant consumers of memory on the system. However, they were not significant consumers of swap even though they are good candidate for swap. Under investigation, swapping out only begins once the low zone watermark is hit and kswapd wakes up, but the overall allocation rate in the system might trip lmkd thresholds and cause a cached process to be killed(we measured performance swapping out vs. zapping the memory by killing a process. Unsurprisingly, zapping is 10x times faster even though we use zram which is much faster than real storage) so kill from lmkd will often satisfy the high zone watermark, resulting in very few pages actually being moved to swap. - Approach The approach we chose was to use a new interface to allow userspace to proactively reclaim entire processes by leveraging platform information. This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages that are known to be cold from userspace and to avoid races with lmkd by reclaiming apps as soon as they entered the cached state. Additionally, it could provide many chances for platform to use much information to optimize memory efficiency. To achieve the goal, the patchset introduce two new options for madvise. One is MADV_COLD which will deactivate activated pages and the other is MADV_PAGEOUT which will reclaim private pages instantly. These new options complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to gain some free memory space. MADV_PAGEOUT is similar to MADV_DONTNEED in a way that it hints the kernel that memory region is not currently needed and should be reclaimed immediately; MADV_COLD is similar to MADV_FREE in a way that it hints the kernel that memory region is not currently needed and should be reclaimed when memory pressure rises. This patch (of 5): When a process expects no accesses to a certain memory range, it could give a hint to kernel that the pages can be reclaimed when memory pressure happens but data should be preserved for future use. This could reduce workingset eviction so it ends up increasing performance. This patch introduces the new MADV_COLD hint to madvise(2) syscall. MADV_COLD can be used by a process to mark a memory range as not expected to be used in the near future. The hint can help kernel in deciding which pages to evict early during memory pressure. It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves active file page -> inactive file LRU active anon page -> inacdtive anon LRU Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file LRU's head because MADV_COLD is a little bit different symantic. MADV_FREE means it's okay to discard when the memory pressure because the content of the page is *garbage* so freeing such pages is almost zero overhead since we don't need to swap out and access afterward causes just minor fault. Thus, it would make sense to put those freeable pages in inactive file LRU to compete other used-once pages. It makes sense for implmentaion point of view, too because it's not swapbacked memory any longer until it would be re-dirtied. Even, it could give a bonus to make them be reclaimed on swapless system. However, MADV_COLD doesn't mean garbage so reclaiming them requires swap-out/in in the end so it's bigger cost. Since we have designed VM LRU aging based on cost-model, anonymous cold pages would be better to position inactive anon's LRU list, not file LRU. Furthermore, it would help to avoid unnecessary scanning if system doesn't have a swap device. Let's start simpler way without adding complexity at this moment. However, keep in mind, too that it's a caveat that workloads with a lot of pages cache are likely to ignore MADV_COLD on anonymous memory because we rarely age anonymous LRU lists. * man-page material MADV_COLD (since Linux x.x) Pages in the specified regions will be treated as less-recently-accessed compared to pages in the system with similar access frequencies. In contrast to MADV_FREE, the contents of the region are preserved regardless of subsequent writes to pages. MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP pages. [akpm@linux-foundation.org: resolve conflicts with hmm.git] Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: kbuild test robot <lkp@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Chris Zankel <chris@zankel.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Daniel Colascione <dancol@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Oleksandr Natalenko <oleksandr@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Sonny Rao <sonnyrao@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tim Murray <timmurray@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9c276cc65a58faf98be8e56962745ec99ab87636;It works for every LRU pages like MADV_[DONTNEED|FREE];yes;yes;no 41;MDY6Q29tbWl0MjMyNTI5ODo5YzI3NmNjNjVhNThmYWY5OGJlOGU1Njk2Mjc0NWVjOTlhYjg3NjM2;Minchan Kim;Linus Torvalds;"mm: introduce MADV_COLD Patch series ""Introduce MADV_COLD and MADV_PAGEOUT"", v7. - Background The Android terminology used for forking a new process and starting an app from scratch is a cold start, while resuming an existing app is a hot start. While we continually try to improve the performance of cold starts, hot starts will always be significantly less power hungry as well as faster so we are trying to make hot start more likely than cold start. To increase hot start, Android userspace manages the order that apps should be killed in a process called ActivityManagerService. ActivityManagerService tracks every Android app or service that the user could be interacting with at any time and translates that into a ranked list for lmkd(low memory killer daemon). They are likely to be killed by lmkd if the system has to reclaim memory. In that sense they are similar to entries in any other cache. Those apps are kept alive for opportunistic performance improvements but those performance improvements will vary based on the memory requirements of individual workloads. - Problem Naturally, cached apps were dominant consumers of memory on the system. However, they were not significant consumers of swap even though they are good candidate for swap. Under investigation, swapping out only begins once the low zone watermark is hit and kswapd wakes up, but the overall allocation rate in the system might trip lmkd thresholds and cause a cached process to be killed(we measured performance swapping out vs. zapping the memory by killing a process. Unsurprisingly, zapping is 10x times faster even though we use zram which is much faster than real storage) so kill from lmkd will often satisfy the high zone watermark, resulting in very few pages actually being moved to swap. - Approach The approach we chose was to use a new interface to allow userspace to proactively reclaim entire processes by leveraging platform information. This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages that are known to be cold from userspace and to avoid races with lmkd by reclaiming apps as soon as they entered the cached state. Additionally, it could provide many chances for platform to use much information to optimize memory efficiency. To achieve the goal, the patchset introduce two new options for madvise. One is MADV_COLD which will deactivate activated pages and the other is MADV_PAGEOUT which will reclaim private pages instantly. These new options complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to gain some free memory space. MADV_PAGEOUT is similar to MADV_DONTNEED in a way that it hints the kernel that memory region is not currently needed and should be reclaimed immediately; MADV_COLD is similar to MADV_FREE in a way that it hints the kernel that memory region is not currently needed and should be reclaimed when memory pressure rises. This patch (of 5): When a process expects no accesses to a certain memory range, it could give a hint to kernel that the pages can be reclaimed when memory pressure happens but data should be preserved for future use. This could reduce workingset eviction so it ends up increasing performance. This patch introduces the new MADV_COLD hint to madvise(2) syscall. MADV_COLD can be used by a process to mark a memory range as not expected to be used in the near future. The hint can help kernel in deciding which pages to evict early during memory pressure. It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves active file page -> inactive file LRU active anon page -> inacdtive anon LRU Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file LRU's head because MADV_COLD is a little bit different symantic. MADV_FREE means it's okay to discard when the memory pressure because the content of the page is *garbage* so freeing such pages is almost zero overhead since we don't need to swap out and access afterward causes just minor fault. Thus, it would make sense to put those freeable pages in inactive file LRU to compete other used-once pages. It makes sense for implmentaion point of view, too because it's not swapbacked memory any longer until it would be re-dirtied. Even, it could give a bonus to make them be reclaimed on swapless system. However, MADV_COLD doesn't mean garbage so reclaiming them requires swap-out/in in the end so it's bigger cost. Since we have designed VM LRU aging based on cost-model, anonymous cold pages would be better to position inactive anon's LRU list, not file LRU. Furthermore, it would help to avoid unnecessary scanning if system doesn't have a swap device. Let's start simpler way without adding complexity at this moment. However, keep in mind, too that it's a caveat that workloads with a lot of pages cache are likely to ignore MADV_COLD on anonymous memory because we rarely age anonymous LRU lists. * man-page material MADV_COLD (since Linux x.x) Pages in the specified regions will be treated as less-recently-accessed compared to pages in the system with similar access frequencies. In contrast to MADV_FREE, the contents of the region are preserved regardless of subsequent writes to pages. MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP pages. [akpm@linux-foundation.org: resolve conflicts with hmm.git] Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: kbuild test robot <lkp@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Chris Zankel <chris@zankel.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Daniel Colascione <dancol@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Oleksandr Natalenko <oleksandr@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Sonny Rao <sonnyrao@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tim Murray <timmurray@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9c276cc65a58faf98be8e56962745ec99ab87636;"IOW, It moves active file page -> inactive file LRU active anon page -> inacdtive anon LRU Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file LRU's head because MADV_COLD is a little bit different symantic";yes;yes;yes 41;MDY6Q29tbWl0MjMyNTI5ODo5YzI3NmNjNjVhNThmYWY5OGJlOGU1Njk2Mjc0NWVjOTlhYjg3NjM2;Minchan Kim;Linus Torvalds;"mm: introduce MADV_COLD Patch series ""Introduce MADV_COLD and MADV_PAGEOUT"", v7. - Background The Android terminology used for forking a new process and starting an app from scratch is a cold start, while resuming an existing app is a hot start. While we continually try to improve the performance of cold starts, hot starts will always be significantly less power hungry as well as faster so we are trying to make hot start more likely than cold start. To increase hot start, Android userspace manages the order that apps should be killed in a process called ActivityManagerService. ActivityManagerService tracks every Android app or service that the user could be interacting with at any time and translates that into a ranked list for lmkd(low memory killer daemon). They are likely to be killed by lmkd if the system has to reclaim memory. In that sense they are similar to entries in any other cache. Those apps are kept alive for opportunistic performance improvements but those performance improvements will vary based on the memory requirements of individual workloads. - Problem Naturally, cached apps were dominant consumers of memory on the system. However, they were not significant consumers of swap even though they are good candidate for swap. Under investigation, swapping out only begins once the low zone watermark is hit and kswapd wakes up, but the overall allocation rate in the system might trip lmkd thresholds and cause a cached process to be killed(we measured performance swapping out vs. zapping the memory by killing a process. Unsurprisingly, zapping is 10x times faster even though we use zram which is much faster than real storage) so kill from lmkd will often satisfy the high zone watermark, resulting in very few pages actually being moved to swap. - Approach The approach we chose was to use a new interface to allow userspace to proactively reclaim entire processes by leveraging platform information. This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages that are known to be cold from userspace and to avoid races with lmkd by reclaiming apps as soon as they entered the cached state. Additionally, it could provide many chances for platform to use much information to optimize memory efficiency. To achieve the goal, the patchset introduce two new options for madvise. One is MADV_COLD which will deactivate activated pages and the other is MADV_PAGEOUT which will reclaim private pages instantly. These new options complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to gain some free memory space. MADV_PAGEOUT is similar to MADV_DONTNEED in a way that it hints the kernel that memory region is not currently needed and should be reclaimed immediately; MADV_COLD is similar to MADV_FREE in a way that it hints the kernel that memory region is not currently needed and should be reclaimed when memory pressure rises. This patch (of 5): When a process expects no accesses to a certain memory range, it could give a hint to kernel that the pages can be reclaimed when memory pressure happens but data should be preserved for future use. This could reduce workingset eviction so it ends up increasing performance. This patch introduces the new MADV_COLD hint to madvise(2) syscall. MADV_COLD can be used by a process to mark a memory range as not expected to be used in the near future. The hint can help kernel in deciding which pages to evict early during memory pressure. It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves active file page -> inactive file LRU active anon page -> inacdtive anon LRU Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file LRU's head because MADV_COLD is a little bit different symantic. MADV_FREE means it's okay to discard when the memory pressure because the content of the page is *garbage* so freeing such pages is almost zero overhead since we don't need to swap out and access afterward causes just minor fault. Thus, it would make sense to put those freeable pages in inactive file LRU to compete other used-once pages. It makes sense for implmentaion point of view, too because it's not swapbacked memory any longer until it would be re-dirtied. Even, it could give a bonus to make them be reclaimed on swapless system. However, MADV_COLD doesn't mean garbage so reclaiming them requires swap-out/in in the end so it's bigger cost. Since we have designed VM LRU aging based on cost-model, anonymous cold pages would be better to position inactive anon's LRU list, not file LRU. Furthermore, it would help to avoid unnecessary scanning if system doesn't have a swap device. Let's start simpler way without adding complexity at this moment. However, keep in mind, too that it's a caveat that workloads with a lot of pages cache are likely to ignore MADV_COLD on anonymous memory because we rarely age anonymous LRU lists. * man-page material MADV_COLD (since Linux x.x) Pages in the specified regions will be treated as less-recently-accessed compared to pages in the system with similar access frequencies. In contrast to MADV_FREE, the contents of the region are preserved regardless of subsequent writes to pages. MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP pages. [akpm@linux-foundation.org: resolve conflicts with hmm.git] Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: kbuild test robot <lkp@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Chris Zankel <chris@zankel.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Daniel Colascione <dancol@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Oleksandr Natalenko <oleksandr@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Sonny Rao <sonnyrao@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tim Murray <timmurray@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9c276cc65a58faf98be8e56962745ec99ab87636;"MADV_FREE means it's okay to discard when the memory pressure because the content of the page is *garbage* so freeing such pages is almost zero overhead since we don't need to swap out and access afterward causes just minor fault";yes;yes;yes 41;MDY6Q29tbWl0MjMyNTI5ODo5YzI3NmNjNjVhNThmYWY5OGJlOGU1Njk2Mjc0NWVjOTlhYjg3NjM2;Minchan Kim;Linus Torvalds;"mm: introduce MADV_COLD Patch series ""Introduce MADV_COLD and MADV_PAGEOUT"", v7. - Background The Android terminology used for forking a new process and starting an app from scratch is a cold start, while resuming an existing app is a hot start. While we continually try to improve the performance of cold starts, hot starts will always be significantly less power hungry as well as faster so we are trying to make hot start more likely than cold start. To increase hot start, Android userspace manages the order that apps should be killed in a process called ActivityManagerService. ActivityManagerService tracks every Android app or service that the user could be interacting with at any time and translates that into a ranked list for lmkd(low memory killer daemon). They are likely to be killed by lmkd if the system has to reclaim memory. In that sense they are similar to entries in any other cache. Those apps are kept alive for opportunistic performance improvements but those performance improvements will vary based on the memory requirements of individual workloads. - Problem Naturally, cached apps were dominant consumers of memory on the system. However, they were not significant consumers of swap even though they are good candidate for swap. Under investigation, swapping out only begins once the low zone watermark is hit and kswapd wakes up, but the overall allocation rate in the system might trip lmkd thresholds and cause a cached process to be killed(we measured performance swapping out vs. zapping the memory by killing a process. Unsurprisingly, zapping is 10x times faster even though we use zram which is much faster than real storage) so kill from lmkd will often satisfy the high zone watermark, resulting in very few pages actually being moved to swap. - Approach The approach we chose was to use a new interface to allow userspace to proactively reclaim entire processes by leveraging platform information. This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages that are known to be cold from userspace and to avoid races with lmkd by reclaiming apps as soon as they entered the cached state. Additionally, it could provide many chances for platform to use much information to optimize memory efficiency. To achieve the goal, the patchset introduce two new options for madvise. One is MADV_COLD which will deactivate activated pages and the other is MADV_PAGEOUT which will reclaim private pages instantly. These new options complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to gain some free memory space. MADV_PAGEOUT is similar to MADV_DONTNEED in a way that it hints the kernel that memory region is not currently needed and should be reclaimed immediately; MADV_COLD is similar to MADV_FREE in a way that it hints the kernel that memory region is not currently needed and should be reclaimed when memory pressure rises. This patch (of 5): When a process expects no accesses to a certain memory range, it could give a hint to kernel that the pages can be reclaimed when memory pressure happens but data should be preserved for future use. This could reduce workingset eviction so it ends up increasing performance. This patch introduces the new MADV_COLD hint to madvise(2) syscall. MADV_COLD can be used by a process to mark a memory range as not expected to be used in the near future. The hint can help kernel in deciding which pages to evict early during memory pressure. It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves active file page -> inactive file LRU active anon page -> inacdtive anon LRU Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file LRU's head because MADV_COLD is a little bit different symantic. MADV_FREE means it's okay to discard when the memory pressure because the content of the page is *garbage* so freeing such pages is almost zero overhead since we don't need to swap out and access afterward causes just minor fault. Thus, it would make sense to put those freeable pages in inactive file LRU to compete other used-once pages. It makes sense for implmentaion point of view, too because it's not swapbacked memory any longer until it would be re-dirtied. Even, it could give a bonus to make them be reclaimed on swapless system. However, MADV_COLD doesn't mean garbage so reclaiming them requires swap-out/in in the end so it's bigger cost. Since we have designed VM LRU aging based on cost-model, anonymous cold pages would be better to position inactive anon's LRU list, not file LRU. Furthermore, it would help to avoid unnecessary scanning if system doesn't have a swap device. Let's start simpler way without adding complexity at this moment. However, keep in mind, too that it's a caveat that workloads with a lot of pages cache are likely to ignore MADV_COLD on anonymous memory because we rarely age anonymous LRU lists. * man-page material MADV_COLD (since Linux x.x) Pages in the specified regions will be treated as less-recently-accessed compared to pages in the system with similar access frequencies. In contrast to MADV_FREE, the contents of the region are preserved regardless of subsequent writes to pages. MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP pages. [akpm@linux-foundation.org: resolve conflicts with hmm.git] Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: kbuild test robot <lkp@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Chris Zankel <chris@zankel.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Daniel Colascione <dancol@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Oleksandr Natalenko <oleksandr@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Sonny Rao <sonnyrao@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tim Murray <timmurray@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9c276cc65a58faf98be8e56962745ec99ab87636;" Thus, it would make sense to put those freeable pages in inactive file LRU to compete other used-once pages";yes;yes;no 41;MDY6Q29tbWl0MjMyNTI5ODo5YzI3NmNjNjVhNThmYWY5OGJlOGU1Njk2Mjc0NWVjOTlhYjg3NjM2;Minchan Kim;Linus Torvalds;"mm: introduce MADV_COLD Patch series ""Introduce MADV_COLD and MADV_PAGEOUT"", v7. - Background The Android terminology used for forking a new process and starting an app from scratch is a cold start, while resuming an existing app is a hot start. While we continually try to improve the performance of cold starts, hot starts will always be significantly less power hungry as well as faster so we are trying to make hot start more likely than cold start. To increase hot start, Android userspace manages the order that apps should be killed in a process called ActivityManagerService. ActivityManagerService tracks every Android app or service that the user could be interacting with at any time and translates that into a ranked list for lmkd(low memory killer daemon). They are likely to be killed by lmkd if the system has to reclaim memory. In that sense they are similar to entries in any other cache. Those apps are kept alive for opportunistic performance improvements but those performance improvements will vary based on the memory requirements of individual workloads. - Problem Naturally, cached apps were dominant consumers of memory on the system. However, they were not significant consumers of swap even though they are good candidate for swap. Under investigation, swapping out only begins once the low zone watermark is hit and kswapd wakes up, but the overall allocation rate in the system might trip lmkd thresholds and cause a cached process to be killed(we measured performance swapping out vs. zapping the memory by killing a process. Unsurprisingly, zapping is 10x times faster even though we use zram which is much faster than real storage) so kill from lmkd will often satisfy the high zone watermark, resulting in very few pages actually being moved to swap. - Approach The approach we chose was to use a new interface to allow userspace to proactively reclaim entire processes by leveraging platform information. This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages that are known to be cold from userspace and to avoid races with lmkd by reclaiming apps as soon as they entered the cached state. Additionally, it could provide many chances for platform to use much information to optimize memory efficiency. To achieve the goal, the patchset introduce two new options for madvise. One is MADV_COLD which will deactivate activated pages and the other is MADV_PAGEOUT which will reclaim private pages instantly. These new options complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to gain some free memory space. MADV_PAGEOUT is similar to MADV_DONTNEED in a way that it hints the kernel that memory region is not currently needed and should be reclaimed immediately; MADV_COLD is similar to MADV_FREE in a way that it hints the kernel that memory region is not currently needed and should be reclaimed when memory pressure rises. This patch (of 5): When a process expects no accesses to a certain memory range, it could give a hint to kernel that the pages can be reclaimed when memory pressure happens but data should be preserved for future use. This could reduce workingset eviction so it ends up increasing performance. This patch introduces the new MADV_COLD hint to madvise(2) syscall. MADV_COLD can be used by a process to mark a memory range as not expected to be used in the near future. The hint can help kernel in deciding which pages to evict early during memory pressure. It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves active file page -> inactive file LRU active anon page -> inacdtive anon LRU Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file LRU's head because MADV_COLD is a little bit different symantic. MADV_FREE means it's okay to discard when the memory pressure because the content of the page is *garbage* so freeing such pages is almost zero overhead since we don't need to swap out and access afterward causes just minor fault. Thus, it would make sense to put those freeable pages in inactive file LRU to compete other used-once pages. It makes sense for implmentaion point of view, too because it's not swapbacked memory any longer until it would be re-dirtied. Even, it could give a bonus to make them be reclaimed on swapless system. However, MADV_COLD doesn't mean garbage so reclaiming them requires swap-out/in in the end so it's bigger cost. Since we have designed VM LRU aging based on cost-model, anonymous cold pages would be better to position inactive anon's LRU list, not file LRU. Furthermore, it would help to avoid unnecessary scanning if system doesn't have a swap device. Let's start simpler way without adding complexity at this moment. However, keep in mind, too that it's a caveat that workloads with a lot of pages cache are likely to ignore MADV_COLD on anonymous memory because we rarely age anonymous LRU lists. * man-page material MADV_COLD (since Linux x.x) Pages in the specified regions will be treated as less-recently-accessed compared to pages in the system with similar access frequencies. In contrast to MADV_FREE, the contents of the region are preserved regardless of subsequent writes to pages. MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP pages. [akpm@linux-foundation.org: resolve conflicts with hmm.git] Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: kbuild test robot <lkp@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Chris Zankel <chris@zankel.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Daniel Colascione <dancol@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Oleksandr Natalenko <oleksandr@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Sonny Rao <sonnyrao@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tim Murray <timmurray@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9c276cc65a58faf98be8e56962745ec99ab87636;" It makes sense for implmentaion point of view, too because it's not swapbacked memory any longer until it would be re-dirtied";yes;yes;yes 41;MDY6Q29tbWl0MjMyNTI5ODo5YzI3NmNjNjVhNThmYWY5OGJlOGU1Njk2Mjc0NWVjOTlhYjg3NjM2;Minchan Kim;Linus Torvalds;"mm: introduce MADV_COLD Patch series ""Introduce MADV_COLD and MADV_PAGEOUT"", v7. - Background The Android terminology used for forking a new process and starting an app from scratch is a cold start, while resuming an existing app is a hot start. While we continually try to improve the performance of cold starts, hot starts will always be significantly less power hungry as well as faster so we are trying to make hot start more likely than cold start. To increase hot start, Android userspace manages the order that apps should be killed in a process called ActivityManagerService. ActivityManagerService tracks every Android app or service that the user could be interacting with at any time and translates that into a ranked list for lmkd(low memory killer daemon). They are likely to be killed by lmkd if the system has to reclaim memory. In that sense they are similar to entries in any other cache. Those apps are kept alive for opportunistic performance improvements but those performance improvements will vary based on the memory requirements of individual workloads. - Problem Naturally, cached apps were dominant consumers of memory on the system. However, they were not significant consumers of swap even though they are good candidate for swap. Under investigation, swapping out only begins once the low zone watermark is hit and kswapd wakes up, but the overall allocation rate in the system might trip lmkd thresholds and cause a cached process to be killed(we measured performance swapping out vs. zapping the memory by killing a process. Unsurprisingly, zapping is 10x times faster even though we use zram which is much faster than real storage) so kill from lmkd will often satisfy the high zone watermark, resulting in very few pages actually being moved to swap. - Approach The approach we chose was to use a new interface to allow userspace to proactively reclaim entire processes by leveraging platform information. This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages that are known to be cold from userspace and to avoid races with lmkd by reclaiming apps as soon as they entered the cached state. Additionally, it could provide many chances for platform to use much information to optimize memory efficiency. To achieve the goal, the patchset introduce two new options for madvise. One is MADV_COLD which will deactivate activated pages and the other is MADV_PAGEOUT which will reclaim private pages instantly. These new options complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to gain some free memory space. MADV_PAGEOUT is similar to MADV_DONTNEED in a way that it hints the kernel that memory region is not currently needed and should be reclaimed immediately; MADV_COLD is similar to MADV_FREE in a way that it hints the kernel that memory region is not currently needed and should be reclaimed when memory pressure rises. This patch (of 5): When a process expects no accesses to a certain memory range, it could give a hint to kernel that the pages can be reclaimed when memory pressure happens but data should be preserved for future use. This could reduce workingset eviction so it ends up increasing performance. This patch introduces the new MADV_COLD hint to madvise(2) syscall. MADV_COLD can be used by a process to mark a memory range as not expected to be used in the near future. The hint can help kernel in deciding which pages to evict early during memory pressure. It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves active file page -> inactive file LRU active anon page -> inacdtive anon LRU Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file LRU's head because MADV_COLD is a little bit different symantic. MADV_FREE means it's okay to discard when the memory pressure because the content of the page is *garbage* so freeing such pages is almost zero overhead since we don't need to swap out and access afterward causes just minor fault. Thus, it would make sense to put those freeable pages in inactive file LRU to compete other used-once pages. It makes sense for implmentaion point of view, too because it's not swapbacked memory any longer until it would be re-dirtied. Even, it could give a bonus to make them be reclaimed on swapless system. However, MADV_COLD doesn't mean garbage so reclaiming them requires swap-out/in in the end so it's bigger cost. Since we have designed VM LRU aging based on cost-model, anonymous cold pages would be better to position inactive anon's LRU list, not file LRU. Furthermore, it would help to avoid unnecessary scanning if system doesn't have a swap device. Let's start simpler way without adding complexity at this moment. However, keep in mind, too that it's a caveat that workloads with a lot of pages cache are likely to ignore MADV_COLD on anonymous memory because we rarely age anonymous LRU lists. * man-page material MADV_COLD (since Linux x.x) Pages in the specified regions will be treated as less-recently-accessed compared to pages in the system with similar access frequencies. In contrast to MADV_FREE, the contents of the region are preserved regardless of subsequent writes to pages. MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP pages. [akpm@linux-foundation.org: resolve conflicts with hmm.git] Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: kbuild test robot <lkp@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Chris Zankel <chris@zankel.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Daniel Colascione <dancol@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Oleksandr Natalenko <oleksandr@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Sonny Rao <sonnyrao@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tim Murray <timmurray@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9c276cc65a58faf98be8e56962745ec99ab87636;" Even, it could give a bonus to make them be reclaimed on swapless system";yes;yes;no 41;MDY6Q29tbWl0MjMyNTI5ODo5YzI3NmNjNjVhNThmYWY5OGJlOGU1Njk2Mjc0NWVjOTlhYjg3NjM2;Minchan Kim;Linus Torvalds;"mm: introduce MADV_COLD Patch series ""Introduce MADV_COLD and MADV_PAGEOUT"", v7. - Background The Android terminology used for forking a new process and starting an app from scratch is a cold start, while resuming an existing app is a hot start. While we continually try to improve the performance of cold starts, hot starts will always be significantly less power hungry as well as faster so we are trying to make hot start more likely than cold start. To increase hot start, Android userspace manages the order that apps should be killed in a process called ActivityManagerService. ActivityManagerService tracks every Android app or service that the user could be interacting with at any time and translates that into a ranked list for lmkd(low memory killer daemon). They are likely to be killed by lmkd if the system has to reclaim memory. In that sense they are similar to entries in any other cache. Those apps are kept alive for opportunistic performance improvements but those performance improvements will vary based on the memory requirements of individual workloads. - Problem Naturally, cached apps were dominant consumers of memory on the system. However, they were not significant consumers of swap even though they are good candidate for swap. Under investigation, swapping out only begins once the low zone watermark is hit and kswapd wakes up, but the overall allocation rate in the system might trip lmkd thresholds and cause a cached process to be killed(we measured performance swapping out vs. zapping the memory by killing a process. Unsurprisingly, zapping is 10x times faster even though we use zram which is much faster than real storage) so kill from lmkd will often satisfy the high zone watermark, resulting in very few pages actually being moved to swap. - Approach The approach we chose was to use a new interface to allow userspace to proactively reclaim entire processes by leveraging platform information. This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages that are known to be cold from userspace and to avoid races with lmkd by reclaiming apps as soon as they entered the cached state. Additionally, it could provide many chances for platform to use much information to optimize memory efficiency. To achieve the goal, the patchset introduce two new options for madvise. One is MADV_COLD which will deactivate activated pages and the other is MADV_PAGEOUT which will reclaim private pages instantly. These new options complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to gain some free memory space. MADV_PAGEOUT is similar to MADV_DONTNEED in a way that it hints the kernel that memory region is not currently needed and should be reclaimed immediately; MADV_COLD is similar to MADV_FREE in a way that it hints the kernel that memory region is not currently needed and should be reclaimed when memory pressure rises. This patch (of 5): When a process expects no accesses to a certain memory range, it could give a hint to kernel that the pages can be reclaimed when memory pressure happens but data should be preserved for future use. This could reduce workingset eviction so it ends up increasing performance. This patch introduces the new MADV_COLD hint to madvise(2) syscall. MADV_COLD can be used by a process to mark a memory range as not expected to be used in the near future. The hint can help kernel in deciding which pages to evict early during memory pressure. It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves active file page -> inactive file LRU active anon page -> inacdtive anon LRU Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file LRU's head because MADV_COLD is a little bit different symantic. MADV_FREE means it's okay to discard when the memory pressure because the content of the page is *garbage* so freeing such pages is almost zero overhead since we don't need to swap out and access afterward causes just minor fault. Thus, it would make sense to put those freeable pages in inactive file LRU to compete other used-once pages. It makes sense for implmentaion point of view, too because it's not swapbacked memory any longer until it would be re-dirtied. Even, it could give a bonus to make them be reclaimed on swapless system. However, MADV_COLD doesn't mean garbage so reclaiming them requires swap-out/in in the end so it's bigger cost. Since we have designed VM LRU aging based on cost-model, anonymous cold pages would be better to position inactive anon's LRU list, not file LRU. Furthermore, it would help to avoid unnecessary scanning if system doesn't have a swap device. Let's start simpler way without adding complexity at this moment. However, keep in mind, too that it's a caveat that workloads with a lot of pages cache are likely to ignore MADV_COLD on anonymous memory because we rarely age anonymous LRU lists. * man-page material MADV_COLD (since Linux x.x) Pages in the specified regions will be treated as less-recently-accessed compared to pages in the system with similar access frequencies. In contrast to MADV_FREE, the contents of the region are preserved regardless of subsequent writes to pages. MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP pages. [akpm@linux-foundation.org: resolve conflicts with hmm.git] Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: kbuild test robot <lkp@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Chris Zankel <chris@zankel.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Daniel Colascione <dancol@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Oleksandr Natalenko <oleksandr@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Sonny Rao <sonnyrao@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tim Murray <timmurray@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9c276cc65a58faf98be8e56962745ec99ab87636;" However, MADV_COLD doesn't mean garbage so reclaiming them requires swap-out/in in the end so it's bigger cost";yes;yes;yes 41;MDY6Q29tbWl0MjMyNTI5ODo5YzI3NmNjNjVhNThmYWY5OGJlOGU1Njk2Mjc0NWVjOTlhYjg3NjM2;Minchan Kim;Linus Torvalds;"mm: introduce MADV_COLD Patch series ""Introduce MADV_COLD and MADV_PAGEOUT"", v7. - Background The Android terminology used for forking a new process and starting an app from scratch is a cold start, while resuming an existing app is a hot start. While we continually try to improve the performance of cold starts, hot starts will always be significantly less power hungry as well as faster so we are trying to make hot start more likely than cold start. To increase hot start, Android userspace manages the order that apps should be killed in a process called ActivityManagerService. ActivityManagerService tracks every Android app or service that the user could be interacting with at any time and translates that into a ranked list for lmkd(low memory killer daemon). They are likely to be killed by lmkd if the system has to reclaim memory. In that sense they are similar to entries in any other cache. Those apps are kept alive for opportunistic performance improvements but those performance improvements will vary based on the memory requirements of individual workloads. - Problem Naturally, cached apps were dominant consumers of memory on the system. However, they were not significant consumers of swap even though they are good candidate for swap. Under investigation, swapping out only begins once the low zone watermark is hit and kswapd wakes up, but the overall allocation rate in the system might trip lmkd thresholds and cause a cached process to be killed(we measured performance swapping out vs. zapping the memory by killing a process. Unsurprisingly, zapping is 10x times faster even though we use zram which is much faster than real storage) so kill from lmkd will often satisfy the high zone watermark, resulting in very few pages actually being moved to swap. - Approach The approach we chose was to use a new interface to allow userspace to proactively reclaim entire processes by leveraging platform information. This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages that are known to be cold from userspace and to avoid races with lmkd by reclaiming apps as soon as they entered the cached state. Additionally, it could provide many chances for platform to use much information to optimize memory efficiency. To achieve the goal, the patchset introduce two new options for madvise. One is MADV_COLD which will deactivate activated pages and the other is MADV_PAGEOUT which will reclaim private pages instantly. These new options complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to gain some free memory space. MADV_PAGEOUT is similar to MADV_DONTNEED in a way that it hints the kernel that memory region is not currently needed and should be reclaimed immediately; MADV_COLD is similar to MADV_FREE in a way that it hints the kernel that memory region is not currently needed and should be reclaimed when memory pressure rises. This patch (of 5): When a process expects no accesses to a certain memory range, it could give a hint to kernel that the pages can be reclaimed when memory pressure happens but data should be preserved for future use. This could reduce workingset eviction so it ends up increasing performance. This patch introduces the new MADV_COLD hint to madvise(2) syscall. MADV_COLD can be used by a process to mark a memory range as not expected to be used in the near future. The hint can help kernel in deciding which pages to evict early during memory pressure. It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves active file page -> inactive file LRU active anon page -> inacdtive anon LRU Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file LRU's head because MADV_COLD is a little bit different symantic. MADV_FREE means it's okay to discard when the memory pressure because the content of the page is *garbage* so freeing such pages is almost zero overhead since we don't need to swap out and access afterward causes just minor fault. Thus, it would make sense to put those freeable pages in inactive file LRU to compete other used-once pages. It makes sense for implmentaion point of view, too because it's not swapbacked memory any longer until it would be re-dirtied. Even, it could give a bonus to make them be reclaimed on swapless system. However, MADV_COLD doesn't mean garbage so reclaiming them requires swap-out/in in the end so it's bigger cost. Since we have designed VM LRU aging based on cost-model, anonymous cold pages would be better to position inactive anon's LRU list, not file LRU. Furthermore, it would help to avoid unnecessary scanning if system doesn't have a swap device. Let's start simpler way without adding complexity at this moment. However, keep in mind, too that it's a caveat that workloads with a lot of pages cache are likely to ignore MADV_COLD on anonymous memory because we rarely age anonymous LRU lists. * man-page material MADV_COLD (since Linux x.x) Pages in the specified regions will be treated as less-recently-accessed compared to pages in the system with similar access frequencies. In contrast to MADV_FREE, the contents of the region are preserved regardless of subsequent writes to pages. MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP pages. [akpm@linux-foundation.org: resolve conflicts with hmm.git] Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: kbuild test robot <lkp@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Chris Zankel <chris@zankel.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Daniel Colascione <dancol@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Oleksandr Natalenko <oleksandr@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Sonny Rao <sonnyrao@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tim Murray <timmurray@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9c276cc65a58faf98be8e56962745ec99ab87636;" Since we have designed VM LRU aging based on cost-model, anonymous cold pages would be better to position inactive anon's LRU list, not file LRU";yes;yes;yes 41;MDY6Q29tbWl0MjMyNTI5ODo5YzI3NmNjNjVhNThmYWY5OGJlOGU1Njk2Mjc0NWVjOTlhYjg3NjM2;Minchan Kim;Linus Torvalds;"mm: introduce MADV_COLD Patch series ""Introduce MADV_COLD and MADV_PAGEOUT"", v7. - Background The Android terminology used for forking a new process and starting an app from scratch is a cold start, while resuming an existing app is a hot start. While we continually try to improve the performance of cold starts, hot starts will always be significantly less power hungry as well as faster so we are trying to make hot start more likely than cold start. To increase hot start, Android userspace manages the order that apps should be killed in a process called ActivityManagerService. ActivityManagerService tracks every Android app or service that the user could be interacting with at any time and translates that into a ranked list for lmkd(low memory killer daemon). They are likely to be killed by lmkd if the system has to reclaim memory. In that sense they are similar to entries in any other cache. Those apps are kept alive for opportunistic performance improvements but those performance improvements will vary based on the memory requirements of individual workloads. - Problem Naturally, cached apps were dominant consumers of memory on the system. However, they were not significant consumers of swap even though they are good candidate for swap. Under investigation, swapping out only begins once the low zone watermark is hit and kswapd wakes up, but the overall allocation rate in the system might trip lmkd thresholds and cause a cached process to be killed(we measured performance swapping out vs. zapping the memory by killing a process. Unsurprisingly, zapping is 10x times faster even though we use zram which is much faster than real storage) so kill from lmkd will often satisfy the high zone watermark, resulting in very few pages actually being moved to swap. - Approach The approach we chose was to use a new interface to allow userspace to proactively reclaim entire processes by leveraging platform information. This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages that are known to be cold from userspace and to avoid races with lmkd by reclaiming apps as soon as they entered the cached state. Additionally, it could provide many chances for platform to use much information to optimize memory efficiency. To achieve the goal, the patchset introduce two new options for madvise. One is MADV_COLD which will deactivate activated pages and the other is MADV_PAGEOUT which will reclaim private pages instantly. These new options complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to gain some free memory space. MADV_PAGEOUT is similar to MADV_DONTNEED in a way that it hints the kernel that memory region is not currently needed and should be reclaimed immediately; MADV_COLD is similar to MADV_FREE in a way that it hints the kernel that memory region is not currently needed and should be reclaimed when memory pressure rises. This patch (of 5): When a process expects no accesses to a certain memory range, it could give a hint to kernel that the pages can be reclaimed when memory pressure happens but data should be preserved for future use. This could reduce workingset eviction so it ends up increasing performance. This patch introduces the new MADV_COLD hint to madvise(2) syscall. MADV_COLD can be used by a process to mark a memory range as not expected to be used in the near future. The hint can help kernel in deciding which pages to evict early during memory pressure. It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves active file page -> inactive file LRU active anon page -> inacdtive anon LRU Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file LRU's head because MADV_COLD is a little bit different symantic. MADV_FREE means it's okay to discard when the memory pressure because the content of the page is *garbage* so freeing such pages is almost zero overhead since we don't need to swap out and access afterward causes just minor fault. Thus, it would make sense to put those freeable pages in inactive file LRU to compete other used-once pages. It makes sense for implmentaion point of view, too because it's not swapbacked memory any longer until it would be re-dirtied. Even, it could give a bonus to make them be reclaimed on swapless system. However, MADV_COLD doesn't mean garbage so reclaiming them requires swap-out/in in the end so it's bigger cost. Since we have designed VM LRU aging based on cost-model, anonymous cold pages would be better to position inactive anon's LRU list, not file LRU. Furthermore, it would help to avoid unnecessary scanning if system doesn't have a swap device. Let's start simpler way without adding complexity at this moment. However, keep in mind, too that it's a caveat that workloads with a lot of pages cache are likely to ignore MADV_COLD on anonymous memory because we rarely age anonymous LRU lists. * man-page material MADV_COLD (since Linux x.x) Pages in the specified regions will be treated as less-recently-accessed compared to pages in the system with similar access frequencies. In contrast to MADV_FREE, the contents of the region are preserved regardless of subsequent writes to pages. MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP pages. [akpm@linux-foundation.org: resolve conflicts with hmm.git] Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: kbuild test robot <lkp@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Chris Zankel <chris@zankel.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Daniel Colascione <dancol@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Oleksandr Natalenko <oleksandr@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Sonny Rao <sonnyrao@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tim Murray <timmurray@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9c276cc65a58faf98be8e56962745ec99ab87636;" Furthermore, it would help to avoid unnecessary scanning if system doesn't have a swap device";yes;yes;no 41;MDY6Q29tbWl0MjMyNTI5ODo5YzI3NmNjNjVhNThmYWY5OGJlOGU1Njk2Mjc0NWVjOTlhYjg3NjM2;Minchan Kim;Linus Torvalds;"mm: introduce MADV_COLD Patch series ""Introduce MADV_COLD and MADV_PAGEOUT"", v7. - Background The Android terminology used for forking a new process and starting an app from scratch is a cold start, while resuming an existing app is a hot start. While we continually try to improve the performance of cold starts, hot starts will always be significantly less power hungry as well as faster so we are trying to make hot start more likely than cold start. To increase hot start, Android userspace manages the order that apps should be killed in a process called ActivityManagerService. ActivityManagerService tracks every Android app or service that the user could be interacting with at any time and translates that into a ranked list for lmkd(low memory killer daemon). They are likely to be killed by lmkd if the system has to reclaim memory. In that sense they are similar to entries in any other cache. Those apps are kept alive for opportunistic performance improvements but those performance improvements will vary based on the memory requirements of individual workloads. - Problem Naturally, cached apps were dominant consumers of memory on the system. However, they were not significant consumers of swap even though they are good candidate for swap. Under investigation, swapping out only begins once the low zone watermark is hit and kswapd wakes up, but the overall allocation rate in the system might trip lmkd thresholds and cause a cached process to be killed(we measured performance swapping out vs. zapping the memory by killing a process. Unsurprisingly, zapping is 10x times faster even though we use zram which is much faster than real storage) so kill from lmkd will often satisfy the high zone watermark, resulting in very few pages actually being moved to swap. - Approach The approach we chose was to use a new interface to allow userspace to proactively reclaim entire processes by leveraging platform information. This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages that are known to be cold from userspace and to avoid races with lmkd by reclaiming apps as soon as they entered the cached state. Additionally, it could provide many chances for platform to use much information to optimize memory efficiency. To achieve the goal, the patchset introduce two new options for madvise. One is MADV_COLD which will deactivate activated pages and the other is MADV_PAGEOUT which will reclaim private pages instantly. These new options complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to gain some free memory space. MADV_PAGEOUT is similar to MADV_DONTNEED in a way that it hints the kernel that memory region is not currently needed and should be reclaimed immediately; MADV_COLD is similar to MADV_FREE in a way that it hints the kernel that memory region is not currently needed and should be reclaimed when memory pressure rises. This patch (of 5): When a process expects no accesses to a certain memory range, it could give a hint to kernel that the pages can be reclaimed when memory pressure happens but data should be preserved for future use. This could reduce workingset eviction so it ends up increasing performance. This patch introduces the new MADV_COLD hint to madvise(2) syscall. MADV_COLD can be used by a process to mark a memory range as not expected to be used in the near future. The hint can help kernel in deciding which pages to evict early during memory pressure. It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves active file page -> inactive file LRU active anon page -> inacdtive anon LRU Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file LRU's head because MADV_COLD is a little bit different symantic. MADV_FREE means it's okay to discard when the memory pressure because the content of the page is *garbage* so freeing such pages is almost zero overhead since we don't need to swap out and access afterward causes just minor fault. Thus, it would make sense to put those freeable pages in inactive file LRU to compete other used-once pages. It makes sense for implmentaion point of view, too because it's not swapbacked memory any longer until it would be re-dirtied. Even, it could give a bonus to make them be reclaimed on swapless system. However, MADV_COLD doesn't mean garbage so reclaiming them requires swap-out/in in the end so it's bigger cost. Since we have designed VM LRU aging based on cost-model, anonymous cold pages would be better to position inactive anon's LRU list, not file LRU. Furthermore, it would help to avoid unnecessary scanning if system doesn't have a swap device. Let's start simpler way without adding complexity at this moment. However, keep in mind, too that it's a caveat that workloads with a lot of pages cache are likely to ignore MADV_COLD on anonymous memory because we rarely age anonymous LRU lists. * man-page material MADV_COLD (since Linux x.x) Pages in the specified regions will be treated as less-recently-accessed compared to pages in the system with similar access frequencies. In contrast to MADV_FREE, the contents of the region are preserved regardless of subsequent writes to pages. MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP pages. [akpm@linux-foundation.org: resolve conflicts with hmm.git] Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: kbuild test robot <lkp@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Chris Zankel <chris@zankel.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Daniel Colascione <dancol@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Oleksandr Natalenko <oleksandr@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Sonny Rao <sonnyrao@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tim Murray <timmurray@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9c276cc65a58faf98be8e56962745ec99ab87636;" Let's start simpler way without adding complexity at this moment";yes;yes;no 41;MDY6Q29tbWl0MjMyNTI5ODo5YzI3NmNjNjVhNThmYWY5OGJlOGU1Njk2Mjc0NWVjOTlhYjg3NjM2;Minchan Kim;Linus Torvalds;"mm: introduce MADV_COLD Patch series ""Introduce MADV_COLD and MADV_PAGEOUT"", v7. - Background The Android terminology used for forking a new process and starting an app from scratch is a cold start, while resuming an existing app is a hot start. While we continually try to improve the performance of cold starts, hot starts will always be significantly less power hungry as well as faster so we are trying to make hot start more likely than cold start. To increase hot start, Android userspace manages the order that apps should be killed in a process called ActivityManagerService. ActivityManagerService tracks every Android app or service that the user could be interacting with at any time and translates that into a ranked list for lmkd(low memory killer daemon). They are likely to be killed by lmkd if the system has to reclaim memory. In that sense they are similar to entries in any other cache. Those apps are kept alive for opportunistic performance improvements but those performance improvements will vary based on the memory requirements of individual workloads. - Problem Naturally, cached apps were dominant consumers of memory on the system. However, they were not significant consumers of swap even though they are good candidate for swap. Under investigation, swapping out only begins once the low zone watermark is hit and kswapd wakes up, but the overall allocation rate in the system might trip lmkd thresholds and cause a cached process to be killed(we measured performance swapping out vs. zapping the memory by killing a process. Unsurprisingly, zapping is 10x times faster even though we use zram which is much faster than real storage) so kill from lmkd will often satisfy the high zone watermark, resulting in very few pages actually being moved to swap. - Approach The approach we chose was to use a new interface to allow userspace to proactively reclaim entire processes by leveraging platform information. This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages that are known to be cold from userspace and to avoid races with lmkd by reclaiming apps as soon as they entered the cached state. Additionally, it could provide many chances for platform to use much information to optimize memory efficiency. To achieve the goal, the patchset introduce two new options for madvise. One is MADV_COLD which will deactivate activated pages and the other is MADV_PAGEOUT which will reclaim private pages instantly. These new options complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to gain some free memory space. MADV_PAGEOUT is similar to MADV_DONTNEED in a way that it hints the kernel that memory region is not currently needed and should be reclaimed immediately; MADV_COLD is similar to MADV_FREE in a way that it hints the kernel that memory region is not currently needed and should be reclaimed when memory pressure rises. This patch (of 5): When a process expects no accesses to a certain memory range, it could give a hint to kernel that the pages can be reclaimed when memory pressure happens but data should be preserved for future use. This could reduce workingset eviction so it ends up increasing performance. This patch introduces the new MADV_COLD hint to madvise(2) syscall. MADV_COLD can be used by a process to mark a memory range as not expected to be used in the near future. The hint can help kernel in deciding which pages to evict early during memory pressure. It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves active file page -> inactive file LRU active anon page -> inacdtive anon LRU Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file LRU's head because MADV_COLD is a little bit different symantic. MADV_FREE means it's okay to discard when the memory pressure because the content of the page is *garbage* so freeing such pages is almost zero overhead since we don't need to swap out and access afterward causes just minor fault. Thus, it would make sense to put those freeable pages in inactive file LRU to compete other used-once pages. It makes sense for implmentaion point of view, too because it's not swapbacked memory any longer until it would be re-dirtied. Even, it could give a bonus to make them be reclaimed on swapless system. However, MADV_COLD doesn't mean garbage so reclaiming them requires swap-out/in in the end so it's bigger cost. Since we have designed VM LRU aging based on cost-model, anonymous cold pages would be better to position inactive anon's LRU list, not file LRU. Furthermore, it would help to avoid unnecessary scanning if system doesn't have a swap device. Let's start simpler way without adding complexity at this moment. However, keep in mind, too that it's a caveat that workloads with a lot of pages cache are likely to ignore MADV_COLD on anonymous memory because we rarely age anonymous LRU lists. * man-page material MADV_COLD (since Linux x.x) Pages in the specified regions will be treated as less-recently-accessed compared to pages in the system with similar access frequencies. In contrast to MADV_FREE, the contents of the region are preserved regardless of subsequent writes to pages. MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP pages. [akpm@linux-foundation.org: resolve conflicts with hmm.git] Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: kbuild test robot <lkp@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Chris Zankel <chris@zankel.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Daniel Colascione <dancol@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Oleksandr Natalenko <oleksandr@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Sonny Rao <sonnyrao@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tim Murray <timmurray@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9c276cc65a58faf98be8e56962745ec99ab87636;" However, keep in mind, too that it's a caveat that workloads with a lot of pages cache are likely to ignore MADV_COLD on anonymous memory because we rarely age anonymous LRU lists";yes;yes;yes 41;MDY6Q29tbWl0MjMyNTI5ODo5YzI3NmNjNjVhNThmYWY5OGJlOGU1Njk2Mjc0NWVjOTlhYjg3NjM2;Minchan Kim;Linus Torvalds;"mm: introduce MADV_COLD Patch series ""Introduce MADV_COLD and MADV_PAGEOUT"", v7. - Background The Android terminology used for forking a new process and starting an app from scratch is a cold start, while resuming an existing app is a hot start. While we continually try to improve the performance of cold starts, hot starts will always be significantly less power hungry as well as faster so we are trying to make hot start more likely than cold start. To increase hot start, Android userspace manages the order that apps should be killed in a process called ActivityManagerService. ActivityManagerService tracks every Android app or service that the user could be interacting with at any time and translates that into a ranked list for lmkd(low memory killer daemon). They are likely to be killed by lmkd if the system has to reclaim memory. In that sense they are similar to entries in any other cache. Those apps are kept alive for opportunistic performance improvements but those performance improvements will vary based on the memory requirements of individual workloads. - Problem Naturally, cached apps were dominant consumers of memory on the system. However, they were not significant consumers of swap even though they are good candidate for swap. Under investigation, swapping out only begins once the low zone watermark is hit and kswapd wakes up, but the overall allocation rate in the system might trip lmkd thresholds and cause a cached process to be killed(we measured performance swapping out vs. zapping the memory by killing a process. Unsurprisingly, zapping is 10x times faster even though we use zram which is much faster than real storage) so kill from lmkd will often satisfy the high zone watermark, resulting in very few pages actually being moved to swap. - Approach The approach we chose was to use a new interface to allow userspace to proactively reclaim entire processes by leveraging platform information. This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages that are known to be cold from userspace and to avoid races with lmkd by reclaiming apps as soon as they entered the cached state. Additionally, it could provide many chances for platform to use much information to optimize memory efficiency. To achieve the goal, the patchset introduce two new options for madvise. One is MADV_COLD which will deactivate activated pages and the other is MADV_PAGEOUT which will reclaim private pages instantly. These new options complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to gain some free memory space. MADV_PAGEOUT is similar to MADV_DONTNEED in a way that it hints the kernel that memory region is not currently needed and should be reclaimed immediately; MADV_COLD is similar to MADV_FREE in a way that it hints the kernel that memory region is not currently needed and should be reclaimed when memory pressure rises. This patch (of 5): When a process expects no accesses to a certain memory range, it could give a hint to kernel that the pages can be reclaimed when memory pressure happens but data should be preserved for future use. This could reduce workingset eviction so it ends up increasing performance. This patch introduces the new MADV_COLD hint to madvise(2) syscall. MADV_COLD can be used by a process to mark a memory range as not expected to be used in the near future. The hint can help kernel in deciding which pages to evict early during memory pressure. It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves active file page -> inactive file LRU active anon page -> inacdtive anon LRU Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file LRU's head because MADV_COLD is a little bit different symantic. MADV_FREE means it's okay to discard when the memory pressure because the content of the page is *garbage* so freeing such pages is almost zero overhead since we don't need to swap out and access afterward causes just minor fault. Thus, it would make sense to put those freeable pages in inactive file LRU to compete other used-once pages. It makes sense for implmentaion point of view, too because it's not swapbacked memory any longer until it would be re-dirtied. Even, it could give a bonus to make them be reclaimed on swapless system. However, MADV_COLD doesn't mean garbage so reclaiming them requires swap-out/in in the end so it's bigger cost. Since we have designed VM LRU aging based on cost-model, anonymous cold pages would be better to position inactive anon's LRU list, not file LRU. Furthermore, it would help to avoid unnecessary scanning if system doesn't have a swap device. Let's start simpler way without adding complexity at this moment. However, keep in mind, too that it's a caveat that workloads with a lot of pages cache are likely to ignore MADV_COLD on anonymous memory because we rarely age anonymous LRU lists. * man-page material MADV_COLD (since Linux x.x) Pages in the specified regions will be treated as less-recently-accessed compared to pages in the system with similar access frequencies. In contrast to MADV_FREE, the contents of the region are preserved regardless of subsequent writes to pages. MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP pages. [akpm@linux-foundation.org: resolve conflicts with hmm.git] Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: kbuild test robot <lkp@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Chris Zankel <chris@zankel.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Daniel Colascione <dancol@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Oleksandr Natalenko <oleksandr@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Sonny Rao <sonnyrao@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tim Murray <timmurray@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9c276cc65a58faf98be8e56962745ec99ab87636;"MADV_COLD (since Linux x.x) Pages in the specified regions will be treated as less-recently-accessed compared to pages in the system with similar access frequencies";yes;no;yes 41;MDY6Q29tbWl0MjMyNTI5ODo5YzI3NmNjNjVhNThmYWY5OGJlOGU1Njk2Mjc0NWVjOTlhYjg3NjM2;Minchan Kim;Linus Torvalds;"mm: introduce MADV_COLD Patch series ""Introduce MADV_COLD and MADV_PAGEOUT"", v7. - Background The Android terminology used for forking a new process and starting an app from scratch is a cold start, while resuming an existing app is a hot start. While we continually try to improve the performance of cold starts, hot starts will always be significantly less power hungry as well as faster so we are trying to make hot start more likely than cold start. To increase hot start, Android userspace manages the order that apps should be killed in a process called ActivityManagerService. ActivityManagerService tracks every Android app or service that the user could be interacting with at any time and translates that into a ranked list for lmkd(low memory killer daemon). They are likely to be killed by lmkd if the system has to reclaim memory. In that sense they are similar to entries in any other cache. Those apps are kept alive for opportunistic performance improvements but those performance improvements will vary based on the memory requirements of individual workloads. - Problem Naturally, cached apps were dominant consumers of memory on the system. However, they were not significant consumers of swap even though they are good candidate for swap. Under investigation, swapping out only begins once the low zone watermark is hit and kswapd wakes up, but the overall allocation rate in the system might trip lmkd thresholds and cause a cached process to be killed(we measured performance swapping out vs. zapping the memory by killing a process. Unsurprisingly, zapping is 10x times faster even though we use zram which is much faster than real storage) so kill from lmkd will often satisfy the high zone watermark, resulting in very few pages actually being moved to swap. - Approach The approach we chose was to use a new interface to allow userspace to proactively reclaim entire processes by leveraging platform information. This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages that are known to be cold from userspace and to avoid races with lmkd by reclaiming apps as soon as they entered the cached state. Additionally, it could provide many chances for platform to use much information to optimize memory efficiency. To achieve the goal, the patchset introduce two new options for madvise. One is MADV_COLD which will deactivate activated pages and the other is MADV_PAGEOUT which will reclaim private pages instantly. These new options complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to gain some free memory space. MADV_PAGEOUT is similar to MADV_DONTNEED in a way that it hints the kernel that memory region is not currently needed and should be reclaimed immediately; MADV_COLD is similar to MADV_FREE in a way that it hints the kernel that memory region is not currently needed and should be reclaimed when memory pressure rises. This patch (of 5): When a process expects no accesses to a certain memory range, it could give a hint to kernel that the pages can be reclaimed when memory pressure happens but data should be preserved for future use. This could reduce workingset eviction so it ends up increasing performance. This patch introduces the new MADV_COLD hint to madvise(2) syscall. MADV_COLD can be used by a process to mark a memory range as not expected to be used in the near future. The hint can help kernel in deciding which pages to evict early during memory pressure. It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves active file page -> inactive file LRU active anon page -> inacdtive anon LRU Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file LRU's head because MADV_COLD is a little bit different symantic. MADV_FREE means it's okay to discard when the memory pressure because the content of the page is *garbage* so freeing such pages is almost zero overhead since we don't need to swap out and access afterward causes just minor fault. Thus, it would make sense to put those freeable pages in inactive file LRU to compete other used-once pages. It makes sense for implmentaion point of view, too because it's not swapbacked memory any longer until it would be re-dirtied. Even, it could give a bonus to make them be reclaimed on swapless system. However, MADV_COLD doesn't mean garbage so reclaiming them requires swap-out/in in the end so it's bigger cost. Since we have designed VM LRU aging based on cost-model, anonymous cold pages would be better to position inactive anon's LRU list, not file LRU. Furthermore, it would help to avoid unnecessary scanning if system doesn't have a swap device. Let's start simpler way without adding complexity at this moment. However, keep in mind, too that it's a caveat that workloads with a lot of pages cache are likely to ignore MADV_COLD on anonymous memory because we rarely age anonymous LRU lists. * man-page material MADV_COLD (since Linux x.x) Pages in the specified regions will be treated as less-recently-accessed compared to pages in the system with similar access frequencies. In contrast to MADV_FREE, the contents of the region are preserved regardless of subsequent writes to pages. MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP pages. [akpm@linux-foundation.org: resolve conflicts with hmm.git] Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: kbuild test robot <lkp@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Chris Zankel <chris@zankel.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Daniel Colascione <dancol@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Oleksandr Natalenko <oleksandr@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Sonny Rao <sonnyrao@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tim Murray <timmurray@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9c276cc65a58faf98be8e56962745ec99ab87636;" In contrast to MADV_FREE, the contents of the region are preserved regardless of subsequent writes to pages";yes;no;no 41;MDY6Q29tbWl0MjMyNTI5ODo5YzI3NmNjNjVhNThmYWY5OGJlOGU1Njk2Mjc0NWVjOTlhYjg3NjM2;Minchan Kim;Linus Torvalds;"mm: introduce MADV_COLD Patch series ""Introduce MADV_COLD and MADV_PAGEOUT"", v7. - Background The Android terminology used for forking a new process and starting an app from scratch is a cold start, while resuming an existing app is a hot start. While we continually try to improve the performance of cold starts, hot starts will always be significantly less power hungry as well as faster so we are trying to make hot start more likely than cold start. To increase hot start, Android userspace manages the order that apps should be killed in a process called ActivityManagerService. ActivityManagerService tracks every Android app or service that the user could be interacting with at any time and translates that into a ranked list for lmkd(low memory killer daemon). They are likely to be killed by lmkd if the system has to reclaim memory. In that sense they are similar to entries in any other cache. Those apps are kept alive for opportunistic performance improvements but those performance improvements will vary based on the memory requirements of individual workloads. - Problem Naturally, cached apps were dominant consumers of memory on the system. However, they were not significant consumers of swap even though they are good candidate for swap. Under investigation, swapping out only begins once the low zone watermark is hit and kswapd wakes up, but the overall allocation rate in the system might trip lmkd thresholds and cause a cached process to be killed(we measured performance swapping out vs. zapping the memory by killing a process. Unsurprisingly, zapping is 10x times faster even though we use zram which is much faster than real storage) so kill from lmkd will often satisfy the high zone watermark, resulting in very few pages actually being moved to swap. - Approach The approach we chose was to use a new interface to allow userspace to proactively reclaim entire processes by leveraging platform information. This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages that are known to be cold from userspace and to avoid races with lmkd by reclaiming apps as soon as they entered the cached state. Additionally, it could provide many chances for platform to use much information to optimize memory efficiency. To achieve the goal, the patchset introduce two new options for madvise. One is MADV_COLD which will deactivate activated pages and the other is MADV_PAGEOUT which will reclaim private pages instantly. These new options complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to gain some free memory space. MADV_PAGEOUT is similar to MADV_DONTNEED in a way that it hints the kernel that memory region is not currently needed and should be reclaimed immediately; MADV_COLD is similar to MADV_FREE in a way that it hints the kernel that memory region is not currently needed and should be reclaimed when memory pressure rises. This patch (of 5): When a process expects no accesses to a certain memory range, it could give a hint to kernel that the pages can be reclaimed when memory pressure happens but data should be preserved for future use. This could reduce workingset eviction so it ends up increasing performance. This patch introduces the new MADV_COLD hint to madvise(2) syscall. MADV_COLD can be used by a process to mark a memory range as not expected to be used in the near future. The hint can help kernel in deciding which pages to evict early during memory pressure. It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves active file page -> inactive file LRU active anon page -> inacdtive anon LRU Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file LRU's head because MADV_COLD is a little bit different symantic. MADV_FREE means it's okay to discard when the memory pressure because the content of the page is *garbage* so freeing such pages is almost zero overhead since we don't need to swap out and access afterward causes just minor fault. Thus, it would make sense to put those freeable pages in inactive file LRU to compete other used-once pages. It makes sense for implmentaion point of view, too because it's not swapbacked memory any longer until it would be re-dirtied. Even, it could give a bonus to make them be reclaimed on swapless system. However, MADV_COLD doesn't mean garbage so reclaiming them requires swap-out/in in the end so it's bigger cost. Since we have designed VM LRU aging based on cost-model, anonymous cold pages would be better to position inactive anon's LRU list, not file LRU. Furthermore, it would help to avoid unnecessary scanning if system doesn't have a swap device. Let's start simpler way without adding complexity at this moment. However, keep in mind, too that it's a caveat that workloads with a lot of pages cache are likely to ignore MADV_COLD on anonymous memory because we rarely age anonymous LRU lists. * man-page material MADV_COLD (since Linux x.x) Pages in the specified regions will be treated as less-recently-accessed compared to pages in the system with similar access frequencies. In contrast to MADV_FREE, the contents of the region are preserved regardless of subsequent writes to pages. MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP pages. [akpm@linux-foundation.org: resolve conflicts with hmm.git] Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: kbuild test robot <lkp@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Chris Zankel <chris@zankel.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Daniel Colascione <dancol@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Oleksandr Natalenko <oleksandr@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Sonny Rao <sonnyrao@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tim Murray <timmurray@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9c276cc65a58faf98be8e56962745ec99ab87636;"MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP pages.";yes;no;yes 42;MDY6Q29tbWl0MjMyNTI5ODoxZWI0MWJiMDdlNTY4NjI5ODA1YWNiODQ1NWM0NGI2NzA5ODE5NTUx;Michal Hocko;Linus Torvalds;"mm, oom: consider present pages for the node size constrained_alloc() calculates the size of the oom domain by using node_spanned_pages which is incorrect because this is the full range of the physical memory range that the numa node occupies rather than the memory that backs that range which is represented by node_present_pages. Sparsely populated nodes (e.g. after memory hot remove or simply sparse due to memory layout) can have really a large difference between the two. This shouldn't really cause any real user observable problems because the oom calculates a ratio against totalpages and used memory cannot exceed present pages but it is confusing and wrong from code point of view. Link: http://lkml.kernel.org/r/20190829163443.899-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: David Hildenbrand <david@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/1eb41bb07e568629805acb8455c44b6709819551;mm, oom: consider present pages for the node size;yes;no;no 42;MDY6Q29tbWl0MjMyNTI5ODoxZWI0MWJiMDdlNTY4NjI5ODA1YWNiODQ1NWM0NGI2NzA5ODE5NTUx;Michal Hocko;Linus Torvalds;"mm, oom: consider present pages for the node size constrained_alloc() calculates the size of the oom domain by using node_spanned_pages which is incorrect because this is the full range of the physical memory range that the numa node occupies rather than the memory that backs that range which is represented by node_present_pages. Sparsely populated nodes (e.g. after memory hot remove or simply sparse due to memory layout) can have really a large difference between the two. This shouldn't really cause any real user observable problems because the oom calculates a ratio against totalpages and used memory cannot exceed present pages but it is confusing and wrong from code point of view. Link: http://lkml.kernel.org/r/20190829163443.899-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: David Hildenbrand <david@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/1eb41bb07e568629805acb8455c44b6709819551;"constrained_alloc() calculates the size of the oom domain by using node_spanned_pages which is incorrect because this is the full range of the physical memory range that the numa node occupies rather than the memory that backs that range which is represented by node_present_pages";no;yes;yes 42;MDY6Q29tbWl0MjMyNTI5ODoxZWI0MWJiMDdlNTY4NjI5ODA1YWNiODQ1NWM0NGI2NzA5ODE5NTUx;Michal Hocko;Linus Torvalds;"mm, oom: consider present pages for the node size constrained_alloc() calculates the size of the oom domain by using node_spanned_pages which is incorrect because this is the full range of the physical memory range that the numa node occupies rather than the memory that backs that range which is represented by node_present_pages. Sparsely populated nodes (e.g. after memory hot remove or simply sparse due to memory layout) can have really a large difference between the two. This shouldn't really cause any real user observable problems because the oom calculates a ratio against totalpages and used memory cannot exceed present pages but it is confusing and wrong from code point of view. Link: http://lkml.kernel.org/r/20190829163443.899-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: David Hildenbrand <david@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/1eb41bb07e568629805acb8455c44b6709819551;" after memory hot remove or simply sparse due to memory layout) can have really a large difference between the two";no;no;yes 42;MDY6Q29tbWl0MjMyNTI5ODoxZWI0MWJiMDdlNTY4NjI5ODA1YWNiODQ1NWM0NGI2NzA5ODE5NTUx;Michal Hocko;Linus Torvalds;"mm, oom: consider present pages for the node size constrained_alloc() calculates the size of the oom domain by using node_spanned_pages which is incorrect because this is the full range of the physical memory range that the numa node occupies rather than the memory that backs that range which is represented by node_present_pages. Sparsely populated nodes (e.g. after memory hot remove or simply sparse due to memory layout) can have really a large difference between the two. This shouldn't really cause any real user observable problems because the oom calculates a ratio against totalpages and used memory cannot exceed present pages but it is confusing and wrong from code point of view. Link: http://lkml.kernel.org/r/20190829163443.899-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: David Hildenbrand <david@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/1eb41bb07e568629805acb8455c44b6709819551;"This shouldn't really cause any real user observable problems because the oom calculates a ratio against totalpages and used memory cannot exceed present pages but it is confusing and wrong from code point of view.";no;yes;yes 44;MDY6Q29tbWl0MjMyNTI5ODo3MGNiNmQyNjc3OTA1MTIxYmZjN2ZkZjViYWJmZDg0NDQyMThlZGQ5;Edward Chron;Linus Torvalds;"mm/oom: add oom_score_adj and pgtables to Killed process message For an OOM event: print oom_score_adj value for the OOM Killed process to document what the oom score adjust value was at the time the process was OOM Killed. The adjustment value can be set by user code and it affects the resulting oom_score so it is used to influence kill process selection. When eligible tasks are not printed (sysctl oom_dump_tasks = 0) printing this value is the only documentation of the value for the process being killed. Having this value on the Killed process message is useful to document if a miscconfiguration occurred or to confirm that the oom_score_adj configuration applies as expected. An example which illustates both misconfiguration and validation that the oom_score_adj was applied as expected is: Aug 14 23:00:02 testserver kernel: Out of memory: Killed process 2692 (systemd-udevd) total-vm:1056800kB, anon-rss:1052760kB, file-rss:4kB, shmem-rss:0kB pgtables:22kB oom_score_adj:1000 The systemd-udevd is a critical system application that should have an oom_score_adj of -1000. It was miconfigured to have a adjustment of 1000 making it a highly favored OOM kill target process. The output documents both the misconfiguration and the fact that the process was correctly targeted by OOM due to the miconfiguration. This can be quite helpful for triage and problem determination. The addition of the pgtables_bytes shows page table usage by the process and is a useful measure of the memory size of the process. Link: http://lkml.kernel.org/r/20190822173157.1569-1-echron@arista.com Signed-off-by: Edward Chron <echron@arista.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/70cb6d2677905121bfc7fdf5babfd8444218edd9;mm/oom: add oom_score_adj and pgtables to Killed process message;yes;no;no 44;MDY6Q29tbWl0MjMyNTI5ODo3MGNiNmQyNjc3OTA1MTIxYmZjN2ZkZjViYWJmZDg0NDQyMThlZGQ5;Edward Chron;Linus Torvalds;"mm/oom: add oom_score_adj and pgtables to Killed process message For an OOM event: print oom_score_adj value for the OOM Killed process to document what the oom score adjust value was at the time the process was OOM Killed. The adjustment value can be set by user code and it affects the resulting oom_score so it is used to influence kill process selection. When eligible tasks are not printed (sysctl oom_dump_tasks = 0) printing this value is the only documentation of the value for the process being killed. Having this value on the Killed process message is useful to document if a miscconfiguration occurred or to confirm that the oom_score_adj configuration applies as expected. An example which illustates both misconfiguration and validation that the oom_score_adj was applied as expected is: Aug 14 23:00:02 testserver kernel: Out of memory: Killed process 2692 (systemd-udevd) total-vm:1056800kB, anon-rss:1052760kB, file-rss:4kB, shmem-rss:0kB pgtables:22kB oom_score_adj:1000 The systemd-udevd is a critical system application that should have an oom_score_adj of -1000. It was miconfigured to have a adjustment of 1000 making it a highly favored OOM kill target process. The output documents both the misconfiguration and the fact that the process was correctly targeted by OOM due to the miconfiguration. This can be quite helpful for triage and problem determination. The addition of the pgtables_bytes shows page table usage by the process and is a useful measure of the memory size of the process. Link: http://lkml.kernel.org/r/20190822173157.1569-1-echron@arista.com Signed-off-by: Edward Chron <echron@arista.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/70cb6d2677905121bfc7fdf5babfd8444218edd9;"For an OOM event: print oom_score_adj value for the OOM Killed process to document what the oom score adjust value was at the time the process was OOM Killed";yes;yes;no 44;MDY6Q29tbWl0MjMyNTI5ODo3MGNiNmQyNjc3OTA1MTIxYmZjN2ZkZjViYWJmZDg0NDQyMThlZGQ5;Edward Chron;Linus Torvalds;"mm/oom: add oom_score_adj and pgtables to Killed process message For an OOM event: print oom_score_adj value for the OOM Killed process to document what the oom score adjust value was at the time the process was OOM Killed. The adjustment value can be set by user code and it affects the resulting oom_score so it is used to influence kill process selection. When eligible tasks are not printed (sysctl oom_dump_tasks = 0) printing this value is the only documentation of the value for the process being killed. Having this value on the Killed process message is useful to document if a miscconfiguration occurred or to confirm that the oom_score_adj configuration applies as expected. An example which illustates both misconfiguration and validation that the oom_score_adj was applied as expected is: Aug 14 23:00:02 testserver kernel: Out of memory: Killed process 2692 (systemd-udevd) total-vm:1056800kB, anon-rss:1052760kB, file-rss:4kB, shmem-rss:0kB pgtables:22kB oom_score_adj:1000 The systemd-udevd is a critical system application that should have an oom_score_adj of -1000. It was miconfigured to have a adjustment of 1000 making it a highly favored OOM kill target process. The output documents both the misconfiguration and the fact that the process was correctly targeted by OOM due to the miconfiguration. This can be quite helpful for triage and problem determination. The addition of the pgtables_bytes shows page table usage by the process and is a useful measure of the memory size of the process. Link: http://lkml.kernel.org/r/20190822173157.1569-1-echron@arista.com Signed-off-by: Edward Chron <echron@arista.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/70cb6d2677905121bfc7fdf5babfd8444218edd9;" The adjustment value can be set by user code and it affects the resulting oom_score so it is used to influence kill process selection";yes;no;yes 44;MDY6Q29tbWl0MjMyNTI5ODo3MGNiNmQyNjc3OTA1MTIxYmZjN2ZkZjViYWJmZDg0NDQyMThlZGQ5;Edward Chron;Linus Torvalds;"mm/oom: add oom_score_adj and pgtables to Killed process message For an OOM event: print oom_score_adj value for the OOM Killed process to document what the oom score adjust value was at the time the process was OOM Killed. The adjustment value can be set by user code and it affects the resulting oom_score so it is used to influence kill process selection. When eligible tasks are not printed (sysctl oom_dump_tasks = 0) printing this value is the only documentation of the value for the process being killed. Having this value on the Killed process message is useful to document if a miscconfiguration occurred or to confirm that the oom_score_adj configuration applies as expected. An example which illustates both misconfiguration and validation that the oom_score_adj was applied as expected is: Aug 14 23:00:02 testserver kernel: Out of memory: Killed process 2692 (systemd-udevd) total-vm:1056800kB, anon-rss:1052760kB, file-rss:4kB, shmem-rss:0kB pgtables:22kB oom_score_adj:1000 The systemd-udevd is a critical system application that should have an oom_score_adj of -1000. It was miconfigured to have a adjustment of 1000 making it a highly favored OOM kill target process. The output documents both the misconfiguration and the fact that the process was correctly targeted by OOM due to the miconfiguration. This can be quite helpful for triage and problem determination. The addition of the pgtables_bytes shows page table usage by the process and is a useful measure of the memory size of the process. Link: http://lkml.kernel.org/r/20190822173157.1569-1-echron@arista.com Signed-off-by: Edward Chron <echron@arista.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/70cb6d2677905121bfc7fdf5babfd8444218edd9;"When eligible tasks are not printed (sysctl oom_dump_tasks = 0) printing this value is the only documentation of the value for the process being killed";yes;yes;yes 44;MDY6Q29tbWl0MjMyNTI5ODo3MGNiNmQyNjc3OTA1MTIxYmZjN2ZkZjViYWJmZDg0NDQyMThlZGQ5;Edward Chron;Linus Torvalds;"mm/oom: add oom_score_adj and pgtables to Killed process message For an OOM event: print oom_score_adj value for the OOM Killed process to document what the oom score adjust value was at the time the process was OOM Killed. The adjustment value can be set by user code and it affects the resulting oom_score so it is used to influence kill process selection. When eligible tasks are not printed (sysctl oom_dump_tasks = 0) printing this value is the only documentation of the value for the process being killed. Having this value on the Killed process message is useful to document if a miscconfiguration occurred or to confirm that the oom_score_adj configuration applies as expected. An example which illustates both misconfiguration and validation that the oom_score_adj was applied as expected is: Aug 14 23:00:02 testserver kernel: Out of memory: Killed process 2692 (systemd-udevd) total-vm:1056800kB, anon-rss:1052760kB, file-rss:4kB, shmem-rss:0kB pgtables:22kB oom_score_adj:1000 The systemd-udevd is a critical system application that should have an oom_score_adj of -1000. It was miconfigured to have a adjustment of 1000 making it a highly favored OOM kill target process. The output documents both the misconfiguration and the fact that the process was correctly targeted by OOM due to the miconfiguration. This can be quite helpful for triage and problem determination. The addition of the pgtables_bytes shows page table usage by the process and is a useful measure of the memory size of the process. Link: http://lkml.kernel.org/r/20190822173157.1569-1-echron@arista.com Signed-off-by: Edward Chron <echron@arista.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/70cb6d2677905121bfc7fdf5babfd8444218edd9;" Having this value on the Killed process message is useful to document if a miscconfiguration occurred or to confirm that the oom_score_adj configuration applies as expected";no;yes;no 44;MDY6Q29tbWl0MjMyNTI5ODo3MGNiNmQyNjc3OTA1MTIxYmZjN2ZkZjViYWJmZDg0NDQyMThlZGQ5;Edward Chron;Linus Torvalds;"mm/oom: add oom_score_adj and pgtables to Killed process message For an OOM event: print oom_score_adj value for the OOM Killed process to document what the oom score adjust value was at the time the process was OOM Killed. The adjustment value can be set by user code and it affects the resulting oom_score so it is used to influence kill process selection. When eligible tasks are not printed (sysctl oom_dump_tasks = 0) printing this value is the only documentation of the value for the process being killed. Having this value on the Killed process message is useful to document if a miscconfiguration occurred or to confirm that the oom_score_adj configuration applies as expected. An example which illustates both misconfiguration and validation that the oom_score_adj was applied as expected is: Aug 14 23:00:02 testserver kernel: Out of memory: Killed process 2692 (systemd-udevd) total-vm:1056800kB, anon-rss:1052760kB, file-rss:4kB, shmem-rss:0kB pgtables:22kB oom_score_adj:1000 The systemd-udevd is a critical system application that should have an oom_score_adj of -1000. It was miconfigured to have a adjustment of 1000 making it a highly favored OOM kill target process. The output documents both the misconfiguration and the fact that the process was correctly targeted by OOM due to the miconfiguration. This can be quite helpful for triage and problem determination. The addition of the pgtables_bytes shows page table usage by the process and is a useful measure of the memory size of the process. Link: http://lkml.kernel.org/r/20190822173157.1569-1-echron@arista.com Signed-off-by: Edward Chron <echron@arista.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/70cb6d2677905121bfc7fdf5babfd8444218edd9;"An example which illustates both misconfiguration and validation that the oom_score_adj was applied as expected is";no;yes;yes 44;MDY6Q29tbWl0MjMyNTI5ODo3MGNiNmQyNjc3OTA1MTIxYmZjN2ZkZjViYWJmZDg0NDQyMThlZGQ5;Edward Chron;Linus Torvalds;"mm/oom: add oom_score_adj and pgtables to Killed process message For an OOM event: print oom_score_adj value for the OOM Killed process to document what the oom score adjust value was at the time the process was OOM Killed. The adjustment value can be set by user code and it affects the resulting oom_score so it is used to influence kill process selection. When eligible tasks are not printed (sysctl oom_dump_tasks = 0) printing this value is the only documentation of the value for the process being killed. Having this value on the Killed process message is useful to document if a miscconfiguration occurred or to confirm that the oom_score_adj configuration applies as expected. An example which illustates both misconfiguration and validation that the oom_score_adj was applied as expected is: Aug 14 23:00:02 testserver kernel: Out of memory: Killed process 2692 (systemd-udevd) total-vm:1056800kB, anon-rss:1052760kB, file-rss:4kB, shmem-rss:0kB pgtables:22kB oom_score_adj:1000 The systemd-udevd is a critical system application that should have an oom_score_adj of -1000. It was miconfigured to have a adjustment of 1000 making it a highly favored OOM kill target process. The output documents both the misconfiguration and the fact that the process was correctly targeted by OOM due to the miconfiguration. This can be quite helpful for triage and problem determination. The addition of the pgtables_bytes shows page table usage by the process and is a useful measure of the memory size of the process. Link: http://lkml.kernel.org/r/20190822173157.1569-1-echron@arista.com Signed-off-by: Edward Chron <echron@arista.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/70cb6d2677905121bfc7fdf5babfd8444218edd9;"Aug 14 23:00:02 testserver kernel: Out of memory: Killed process 2692 (systemd-udevd) total-vm:1056800kB, anon-rss:1052760kB, file-rss:4kB, shmem-rss:0kB pgtables:22kB oom_score_adj:1000 The systemd-udevd is a critical system application that should have an oom_score_adj of -1000";no;no;yes 44;MDY6Q29tbWl0MjMyNTI5ODo3MGNiNmQyNjc3OTA1MTIxYmZjN2ZkZjViYWJmZDg0NDQyMThlZGQ5;Edward Chron;Linus Torvalds;"mm/oom: add oom_score_adj and pgtables to Killed process message For an OOM event: print oom_score_adj value for the OOM Killed process to document what the oom score adjust value was at the time the process was OOM Killed. The adjustment value can be set by user code and it affects the resulting oom_score so it is used to influence kill process selection. When eligible tasks are not printed (sysctl oom_dump_tasks = 0) printing this value is the only documentation of the value for the process being killed. Having this value on the Killed process message is useful to document if a miscconfiguration occurred or to confirm that the oom_score_adj configuration applies as expected. An example which illustates both misconfiguration and validation that the oom_score_adj was applied as expected is: Aug 14 23:00:02 testserver kernel: Out of memory: Killed process 2692 (systemd-udevd) total-vm:1056800kB, anon-rss:1052760kB, file-rss:4kB, shmem-rss:0kB pgtables:22kB oom_score_adj:1000 The systemd-udevd is a critical system application that should have an oom_score_adj of -1000. It was miconfigured to have a adjustment of 1000 making it a highly favored OOM kill target process. The output documents both the misconfiguration and the fact that the process was correctly targeted by OOM due to the miconfiguration. This can be quite helpful for triage and problem determination. The addition of the pgtables_bytes shows page table usage by the process and is a useful measure of the memory size of the process. Link: http://lkml.kernel.org/r/20190822173157.1569-1-echron@arista.com Signed-off-by: Edward Chron <echron@arista.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/70cb6d2677905121bfc7fdf5babfd8444218edd9;" It was miconfigured to have a adjustment of 1000 making it a highly favored OOM kill target process";no;no;yes 44;MDY6Q29tbWl0MjMyNTI5ODo3MGNiNmQyNjc3OTA1MTIxYmZjN2ZkZjViYWJmZDg0NDQyMThlZGQ5;Edward Chron;Linus Torvalds;"mm/oom: add oom_score_adj and pgtables to Killed process message For an OOM event: print oom_score_adj value for the OOM Killed process to document what the oom score adjust value was at the time the process was OOM Killed. The adjustment value can be set by user code and it affects the resulting oom_score so it is used to influence kill process selection. When eligible tasks are not printed (sysctl oom_dump_tasks = 0) printing this value is the only documentation of the value for the process being killed. Having this value on the Killed process message is useful to document if a miscconfiguration occurred or to confirm that the oom_score_adj configuration applies as expected. An example which illustates both misconfiguration and validation that the oom_score_adj was applied as expected is: Aug 14 23:00:02 testserver kernel: Out of memory: Killed process 2692 (systemd-udevd) total-vm:1056800kB, anon-rss:1052760kB, file-rss:4kB, shmem-rss:0kB pgtables:22kB oom_score_adj:1000 The systemd-udevd is a critical system application that should have an oom_score_adj of -1000. It was miconfigured to have a adjustment of 1000 making it a highly favored OOM kill target process. The output documents both the misconfiguration and the fact that the process was correctly targeted by OOM due to the miconfiguration. This can be quite helpful for triage and problem determination. The addition of the pgtables_bytes shows page table usage by the process and is a useful measure of the memory size of the process. Link: http://lkml.kernel.org/r/20190822173157.1569-1-echron@arista.com Signed-off-by: Edward Chron <echron@arista.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/70cb6d2677905121bfc7fdf5babfd8444218edd9;" The output documents both the misconfiguration and the fact that the process was correctly targeted by OOM due to the miconfiguration";yes;yes;yes 44;MDY6Q29tbWl0MjMyNTI5ODo3MGNiNmQyNjc3OTA1MTIxYmZjN2ZkZjViYWJmZDg0NDQyMThlZGQ5;Edward Chron;Linus Torvalds;"mm/oom: add oom_score_adj and pgtables to Killed process message For an OOM event: print oom_score_adj value for the OOM Killed process to document what the oom score adjust value was at the time the process was OOM Killed. The adjustment value can be set by user code and it affects the resulting oom_score so it is used to influence kill process selection. When eligible tasks are not printed (sysctl oom_dump_tasks = 0) printing this value is the only documentation of the value for the process being killed. Having this value on the Killed process message is useful to document if a miscconfiguration occurred or to confirm that the oom_score_adj configuration applies as expected. An example which illustates both misconfiguration and validation that the oom_score_adj was applied as expected is: Aug 14 23:00:02 testserver kernel: Out of memory: Killed process 2692 (systemd-udevd) total-vm:1056800kB, anon-rss:1052760kB, file-rss:4kB, shmem-rss:0kB pgtables:22kB oom_score_adj:1000 The systemd-udevd is a critical system application that should have an oom_score_adj of -1000. It was miconfigured to have a adjustment of 1000 making it a highly favored OOM kill target process. The output documents both the misconfiguration and the fact that the process was correctly targeted by OOM due to the miconfiguration. This can be quite helpful for triage and problem determination. The addition of the pgtables_bytes shows page table usage by the process and is a useful measure of the memory size of the process. Link: http://lkml.kernel.org/r/20190822173157.1569-1-echron@arista.com Signed-off-by: Edward Chron <echron@arista.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/70cb6d2677905121bfc7fdf5babfd8444218edd9;" This can be quite helpful for triage and problem determination";no;yes;no 44;MDY6Q29tbWl0MjMyNTI5ODo3MGNiNmQyNjc3OTA1MTIxYmZjN2ZkZjViYWJmZDg0NDQyMThlZGQ5;Edward Chron;Linus Torvalds;"mm/oom: add oom_score_adj and pgtables to Killed process message For an OOM event: print oom_score_adj value for the OOM Killed process to document what the oom score adjust value was at the time the process was OOM Killed. The adjustment value can be set by user code and it affects the resulting oom_score so it is used to influence kill process selection. When eligible tasks are not printed (sysctl oom_dump_tasks = 0) printing this value is the only documentation of the value for the process being killed. Having this value on the Killed process message is useful to document if a miscconfiguration occurred or to confirm that the oom_score_adj configuration applies as expected. An example which illustates both misconfiguration and validation that the oom_score_adj was applied as expected is: Aug 14 23:00:02 testserver kernel: Out of memory: Killed process 2692 (systemd-udevd) total-vm:1056800kB, anon-rss:1052760kB, file-rss:4kB, shmem-rss:0kB pgtables:22kB oom_score_adj:1000 The systemd-udevd is a critical system application that should have an oom_score_adj of -1000. It was miconfigured to have a adjustment of 1000 making it a highly favored OOM kill target process. The output documents both the misconfiguration and the fact that the process was correctly targeted by OOM due to the miconfiguration. This can be quite helpful for triage and problem determination. The addition of the pgtables_bytes shows page table usage by the process and is a useful measure of the memory size of the process. Link: http://lkml.kernel.org/r/20190822173157.1569-1-echron@arista.com Signed-off-by: Edward Chron <echron@arista.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/70cb6d2677905121bfc7fdf5babfd8444218edd9;"The addition of the pgtables_bytes shows page table usage by the process and is a useful measure of the memory size of the process.";yes;yes;no 45;MDY6Q29tbWl0MjMyNTI5ODpmOWM2NDU2MjFhMjhlMzc4MTNhMWRlOTZkOWNiZDg5Y2RlOTRhMWU0;Tetsuo Handa;Linus Torvalds;"memcg, oom: don't require __GFP_FS when invoking memcg OOM killer Masoud Sharbiani noticed that commit 29ef680ae7c21110 (""memcg, oom: move out_of_memory back to the charge path"") broke memcg OOM called from __xfs_filemap_fault() path. It turned out that try_charge() is retrying forever without making forward progress because mem_cgroup_oom(GFP_NOFS) cannot invoke the OOM killer due to commit 3da88fb3bacfaa33 (""mm, oom: move GFP_NOFS check to out_of_memory""). Allowing forced charge due to being unable to invoke memcg OOM killer will lead to global OOM situation. Also, just returning -ENOMEM will be risky because OOM path is lost and some paths (e.g. get_user_pages()) will leak -ENOMEM. Therefore, invoking memcg OOM killer (despite GFP_NOFS) will be the only choice we can choose for now. Until 29ef680ae7c21110, we were able to invoke memcg OOM killer when GFP_KERNEL reclaim failed [1]. But since 29ef680ae7c21110, we need to invoke memcg OOM killer when GFP_NOFS reclaim failed [2]. Although in the past we did invoke memcg OOM killer for GFP_NOFS [3], we might get pre-mature memcg OOM reports due to this patch. [1] leaker invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0 CPU: 0 PID: 2746 Comm: leaker Not tainted 4.18.0+ #19 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018 Call Trace: dump_stack+0x63/0x88 dump_header+0x67/0x27a ? mem_cgroup_scan_tasks+0x91/0xf0 oom_kill_process+0x210/0x410 out_of_memory+0x10a/0x2c0 mem_cgroup_out_of_memory+0x46/0x80 mem_cgroup_oom_synchronize+0x2e4/0x310 ? high_work_func+0x20/0x20 pagefault_out_of_memory+0x31/0x76 mm_fault_error+0x55/0x115 ? handle_mm_fault+0xfd/0x220 __do_page_fault+0x433/0x4e0 do_page_fault+0x22/0x30 ? page_fault+0x8/0x30 page_fault+0x1e/0x30 RIP: 0033:0x4009f0 Code: 03 00 00 00 e8 71 fd ff ff 48 83 f8 ff 49 89 c6 74 74 48 89 c6 bf c0 0c 40 00 31 c0 e8 69 fd ff ff 45 85 ff 7e 21 31 c9 66 90 <41> 0f be 14 0e 01 d3 f7 c1 ff 0f 00 00 75 05 41 c6 04 0e 2a 48 83 RSP: 002b:00007ffe29ae96f0 EFLAGS: 00010206 RAX: 000000000000001b RBX: 0000000000000000 RCX: 0000000001ce1000 RDX: 0000000000000000 RSI: 000000007fffffe5 RDI: 0000000000000000 RBP: 000000000000000c R08: 0000000000000000 R09: 00007f94be09220d R10: 0000000000000002 R11: 0000000000000246 R12: 00000000000186a0 R13: 0000000000000003 R14: 00007f949d845000 R15: 0000000002800000 Task in /leaker killed as a result of limit of /leaker memory: usage 524288kB, limit 524288kB, failcnt 158965 memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0 kmem: usage 2016kB, limit 9007199254740988kB, failcnt 0 Memory cgroup stats for /leaker: cache:844KB rss:521136KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:132KB writeback:0KB inactive_anon:0KB active_anon:521224KB inactive_file:1012KB active_file:8KB unevictable:0KB Memory cgroup out of memory: Kill process 2746 (leaker) score 998 or sacrifice child Killed process 2746 (leaker) total-vm:536704kB, anon-rss:521176kB, file-rss:1208kB, shmem-rss:0kB oom_reaper: reaped process 2746 (leaker), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [2] leaker invoked oom-killer: gfp_mask=0x600040(GFP_NOFS), nodemask=(null), order=0, oom_score_adj=0 CPU: 1 PID: 2746 Comm: leaker Not tainted 4.18.0+ #20 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018 Call Trace: dump_stack+0x63/0x88 dump_header+0x67/0x27a ? mem_cgroup_scan_tasks+0x91/0xf0 oom_kill_process+0x210/0x410 out_of_memory+0x109/0x2d0 mem_cgroup_out_of_memory+0x46/0x80 try_charge+0x58d/0x650 ? __radix_tree_replace+0x81/0x100 mem_cgroup_try_charge+0x7a/0x100 __add_to_page_cache_locked+0x92/0x180 add_to_page_cache_lru+0x4d/0xf0 iomap_readpages_actor+0xde/0x1b0 ? iomap_zero_range_actor+0x1d0/0x1d0 iomap_apply+0xaf/0x130 iomap_readpages+0x9f/0x150 ? iomap_zero_range_actor+0x1d0/0x1d0 xfs_vm_readpages+0x18/0x20 [xfs] read_pages+0x60/0x140 __do_page_cache_readahead+0x193/0x1b0 ondemand_readahead+0x16d/0x2c0 page_cache_async_readahead+0x9a/0xd0 filemap_fault+0x403/0x620 ? alloc_set_pte+0x12c/0x540 ? _cond_resched+0x14/0x30 __xfs_filemap_fault+0x66/0x180 [xfs] xfs_filemap_fault+0x27/0x30 [xfs] __do_fault+0x19/0x40 __handle_mm_fault+0x8e8/0xb60 handle_mm_fault+0xfd/0x220 __do_page_fault+0x238/0x4e0 do_page_fault+0x22/0x30 ? page_fault+0x8/0x30 page_fault+0x1e/0x30 RIP: 0033:0x4009f0 Code: 03 00 00 00 e8 71 fd ff ff 48 83 f8 ff 49 89 c6 74 74 48 89 c6 bf c0 0c 40 00 31 c0 e8 69 fd ff ff 45 85 ff 7e 21 31 c9 66 90 <41> 0f be 14 0e 01 d3 f7 c1 ff 0f 00 00 75 05 41 c6 04 0e 2a 48 83 RSP: 002b:00007ffda45c9290 EFLAGS: 00010206 RAX: 000000000000001b RBX: 0000000000000000 RCX: 0000000001a1e000 RDX: 0000000000000000 RSI: 000000007fffffe5 RDI: 0000000000000000 RBP: 000000000000000c R08: 0000000000000000 R09: 00007f6d061ff20d R10: 0000000000000002 R11: 0000000000000246 R12: 00000000000186a0 R13: 0000000000000003 R14: 00007f6ce59b2000 R15: 0000000002800000 Task in /leaker killed as a result of limit of /leaker memory: usage 524288kB, limit 524288kB, failcnt 7221 memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0 kmem: usage 1944kB, limit 9007199254740988kB, failcnt 0 Memory cgroup stats for /leaker: cache:3632KB rss:518232KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB inactive_anon:0KB active_anon:518408KB inactive_file:3908KB active_file:12KB unevictable:0KB Memory cgroup out of memory: Kill process 2746 (leaker) score 992 or sacrifice child Killed process 2746 (leaker) total-vm:536704kB, anon-rss:518264kB, file-rss:1188kB, shmem-rss:0kB oom_reaper: reaped process 2746 (leaker), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [3] leaker invoked oom-killer: gfp_mask=0x50, order=0, oom_score_adj=0 leaker cpuset=/ mems_allowed=0 CPU: 1 PID: 3206 Comm: leaker Not tainted 3.10.0-957.27.2.el7.x86_64 #1 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018 Call Trace: [<ffffffffaf364147>] dump_stack+0x19/0x1b [<ffffffffaf35eb6a>] dump_header+0x90/0x229 [<ffffffffaedbb456>] ? find_lock_task_mm+0x56/0xc0 [<ffffffffaee32a38>] ? try_get_mem_cgroup_from_mm+0x28/0x60 [<ffffffffaedbb904>] oom_kill_process+0x254/0x3d0 [<ffffffffaee36c36>] mem_cgroup_oom_synchronize+0x546/0x570 [<ffffffffaee360b0>] ? mem_cgroup_charge_common+0xc0/0xc0 [<ffffffffaedbc194>] pagefault_out_of_memory+0x14/0x90 [<ffffffffaf35d072>] mm_fault_error+0x6a/0x157 [<ffffffffaf3717c8>] __do_page_fault+0x3c8/0x4f0 [<ffffffffaf371925>] do_page_fault+0x35/0x90 [<ffffffffaf36d768>] page_fault+0x28/0x30 Task in /leaker killed as a result of limit of /leaker memory: usage 524288kB, limit 524288kB, failcnt 20628 memory+swap: usage 524288kB, limit 9007199254740988kB, failcnt 0 kmem: usage 0kB, limit 9007199254740988kB, failcnt 0 Memory cgroup stats for /leaker: cache:840KB rss:523448KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:523448KB inactive_file:464KB active_file:376KB unevictable:0KB Memory cgroup out of memory: Kill process 3206 (leaker) score 970 or sacrifice child Killed process 3206 (leaker) total-vm:536692kB, anon-rss:523304kB, file-rss:412kB, shmem-rss:0kB Bisected by Masoud Sharbiani. Link: http://lkml.kernel.org/r/cbe54ed1-b6ba-a056-8899-2dc42526371d@i-love.sakura.ne.jp Fixes: 3da88fb3bacfaa33 (""mm, oom: move GFP_NOFS check to out_of_memory"") [necessary after 29ef680ae7c21110] Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-by: Masoud Sharbiani <msharbiani@apple.com> Tested-by: Masoud Sharbiani <msharbiani@apple.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> [4.19+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/f9c645621a28e37813a1de96d9cbd89cde94a1e4;memcg, oom: don't require __GFP_FS when invoking memcg OOM killer;yes;no;no 45;MDY6Q29tbWl0MjMyNTI5ODpmOWM2NDU2MjFhMjhlMzc4MTNhMWRlOTZkOWNiZDg5Y2RlOTRhMWU0;Tetsuo Handa;Linus Torvalds;"memcg, oom: don't require __GFP_FS when invoking memcg OOM killer Masoud Sharbiani noticed that commit 29ef680ae7c21110 (""memcg, oom: move out_of_memory back to the charge path"") broke memcg OOM called from __xfs_filemap_fault() path. It turned out that try_charge() is retrying forever without making forward progress because mem_cgroup_oom(GFP_NOFS) cannot invoke the OOM killer due to commit 3da88fb3bacfaa33 (""mm, oom: move GFP_NOFS check to out_of_memory""). Allowing forced charge due to being unable to invoke memcg OOM killer will lead to global OOM situation. Also, just returning -ENOMEM will be risky because OOM path is lost and some paths (e.g. get_user_pages()) will leak -ENOMEM. Therefore, invoking memcg OOM killer (despite GFP_NOFS) will be the only choice we can choose for now. Until 29ef680ae7c21110, we were able to invoke memcg OOM killer when GFP_KERNEL reclaim failed [1]. But since 29ef680ae7c21110, we need to invoke memcg OOM killer when GFP_NOFS reclaim failed [2]. Although in the past we did invoke memcg OOM killer for GFP_NOFS [3], we might get pre-mature memcg OOM reports due to this patch. [1] leaker invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0 CPU: 0 PID: 2746 Comm: leaker Not tainted 4.18.0+ #19 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018 Call Trace: dump_stack+0x63/0x88 dump_header+0x67/0x27a ? mem_cgroup_scan_tasks+0x91/0xf0 oom_kill_process+0x210/0x410 out_of_memory+0x10a/0x2c0 mem_cgroup_out_of_memory+0x46/0x80 mem_cgroup_oom_synchronize+0x2e4/0x310 ? high_work_func+0x20/0x20 pagefault_out_of_memory+0x31/0x76 mm_fault_error+0x55/0x115 ? handle_mm_fault+0xfd/0x220 __do_page_fault+0x433/0x4e0 do_page_fault+0x22/0x30 ? page_fault+0x8/0x30 page_fault+0x1e/0x30 RIP: 0033:0x4009f0 Code: 03 00 00 00 e8 71 fd ff ff 48 83 f8 ff 49 89 c6 74 74 48 89 c6 bf c0 0c 40 00 31 c0 e8 69 fd ff ff 45 85 ff 7e 21 31 c9 66 90 <41> 0f be 14 0e 01 d3 f7 c1 ff 0f 00 00 75 05 41 c6 04 0e 2a 48 83 RSP: 002b:00007ffe29ae96f0 EFLAGS: 00010206 RAX: 000000000000001b RBX: 0000000000000000 RCX: 0000000001ce1000 RDX: 0000000000000000 RSI: 000000007fffffe5 RDI: 0000000000000000 RBP: 000000000000000c R08: 0000000000000000 R09: 00007f94be09220d R10: 0000000000000002 R11: 0000000000000246 R12: 00000000000186a0 R13: 0000000000000003 R14: 00007f949d845000 R15: 0000000002800000 Task in /leaker killed as a result of limit of /leaker memory: usage 524288kB, limit 524288kB, failcnt 158965 memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0 kmem: usage 2016kB, limit 9007199254740988kB, failcnt 0 Memory cgroup stats for /leaker: cache:844KB rss:521136KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:132KB writeback:0KB inactive_anon:0KB active_anon:521224KB inactive_file:1012KB active_file:8KB unevictable:0KB Memory cgroup out of memory: Kill process 2746 (leaker) score 998 or sacrifice child Killed process 2746 (leaker) total-vm:536704kB, anon-rss:521176kB, file-rss:1208kB, shmem-rss:0kB oom_reaper: reaped process 2746 (leaker), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [2] leaker invoked oom-killer: gfp_mask=0x600040(GFP_NOFS), nodemask=(null), order=0, oom_score_adj=0 CPU: 1 PID: 2746 Comm: leaker Not tainted 4.18.0+ #20 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018 Call Trace: dump_stack+0x63/0x88 dump_header+0x67/0x27a ? mem_cgroup_scan_tasks+0x91/0xf0 oom_kill_process+0x210/0x410 out_of_memory+0x109/0x2d0 mem_cgroup_out_of_memory+0x46/0x80 try_charge+0x58d/0x650 ? __radix_tree_replace+0x81/0x100 mem_cgroup_try_charge+0x7a/0x100 __add_to_page_cache_locked+0x92/0x180 add_to_page_cache_lru+0x4d/0xf0 iomap_readpages_actor+0xde/0x1b0 ? iomap_zero_range_actor+0x1d0/0x1d0 iomap_apply+0xaf/0x130 iomap_readpages+0x9f/0x150 ? iomap_zero_range_actor+0x1d0/0x1d0 xfs_vm_readpages+0x18/0x20 [xfs] read_pages+0x60/0x140 __do_page_cache_readahead+0x193/0x1b0 ondemand_readahead+0x16d/0x2c0 page_cache_async_readahead+0x9a/0xd0 filemap_fault+0x403/0x620 ? alloc_set_pte+0x12c/0x540 ? _cond_resched+0x14/0x30 __xfs_filemap_fault+0x66/0x180 [xfs] xfs_filemap_fault+0x27/0x30 [xfs] __do_fault+0x19/0x40 __handle_mm_fault+0x8e8/0xb60 handle_mm_fault+0xfd/0x220 __do_page_fault+0x238/0x4e0 do_page_fault+0x22/0x30 ? page_fault+0x8/0x30 page_fault+0x1e/0x30 RIP: 0033:0x4009f0 Code: 03 00 00 00 e8 71 fd ff ff 48 83 f8 ff 49 89 c6 74 74 48 89 c6 bf c0 0c 40 00 31 c0 e8 69 fd ff ff 45 85 ff 7e 21 31 c9 66 90 <41> 0f be 14 0e 01 d3 f7 c1 ff 0f 00 00 75 05 41 c6 04 0e 2a 48 83 RSP: 002b:00007ffda45c9290 EFLAGS: 00010206 RAX: 000000000000001b RBX: 0000000000000000 RCX: 0000000001a1e000 RDX: 0000000000000000 RSI: 000000007fffffe5 RDI: 0000000000000000 RBP: 000000000000000c R08: 0000000000000000 R09: 00007f6d061ff20d R10: 0000000000000002 R11: 0000000000000246 R12: 00000000000186a0 R13: 0000000000000003 R14: 00007f6ce59b2000 R15: 0000000002800000 Task in /leaker killed as a result of limit of /leaker memory: usage 524288kB, limit 524288kB, failcnt 7221 memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0 kmem: usage 1944kB, limit 9007199254740988kB, failcnt 0 Memory cgroup stats for /leaker: cache:3632KB rss:518232KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB inactive_anon:0KB active_anon:518408KB inactive_file:3908KB active_file:12KB unevictable:0KB Memory cgroup out of memory: Kill process 2746 (leaker) score 992 or sacrifice child Killed process 2746 (leaker) total-vm:536704kB, anon-rss:518264kB, file-rss:1188kB, shmem-rss:0kB oom_reaper: reaped process 2746 (leaker), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [3] leaker invoked oom-killer: gfp_mask=0x50, order=0, oom_score_adj=0 leaker cpuset=/ mems_allowed=0 CPU: 1 PID: 3206 Comm: leaker Not tainted 3.10.0-957.27.2.el7.x86_64 #1 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018 Call Trace: [<ffffffffaf364147>] dump_stack+0x19/0x1b [<ffffffffaf35eb6a>] dump_header+0x90/0x229 [<ffffffffaedbb456>] ? find_lock_task_mm+0x56/0xc0 [<ffffffffaee32a38>] ? try_get_mem_cgroup_from_mm+0x28/0x60 [<ffffffffaedbb904>] oom_kill_process+0x254/0x3d0 [<ffffffffaee36c36>] mem_cgroup_oom_synchronize+0x546/0x570 [<ffffffffaee360b0>] ? mem_cgroup_charge_common+0xc0/0xc0 [<ffffffffaedbc194>] pagefault_out_of_memory+0x14/0x90 [<ffffffffaf35d072>] mm_fault_error+0x6a/0x157 [<ffffffffaf3717c8>] __do_page_fault+0x3c8/0x4f0 [<ffffffffaf371925>] do_page_fault+0x35/0x90 [<ffffffffaf36d768>] page_fault+0x28/0x30 Task in /leaker killed as a result of limit of /leaker memory: usage 524288kB, limit 524288kB, failcnt 20628 memory+swap: usage 524288kB, limit 9007199254740988kB, failcnt 0 kmem: usage 0kB, limit 9007199254740988kB, failcnt 0 Memory cgroup stats for /leaker: cache:840KB rss:523448KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:523448KB inactive_file:464KB active_file:376KB unevictable:0KB Memory cgroup out of memory: Kill process 3206 (leaker) score 970 or sacrifice child Killed process 3206 (leaker) total-vm:536692kB, anon-rss:523304kB, file-rss:412kB, shmem-rss:0kB Bisected by Masoud Sharbiani. Link: http://lkml.kernel.org/r/cbe54ed1-b6ba-a056-8899-2dc42526371d@i-love.sakura.ne.jp Fixes: 3da88fb3bacfaa33 (""mm, oom: move GFP_NOFS check to out_of_memory"") [necessary after 29ef680ae7c21110] Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-by: Masoud Sharbiani <msharbiani@apple.com> Tested-by: Masoud Sharbiani <msharbiani@apple.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> [4.19+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/f9c645621a28e37813a1de96d9cbd89cde94a1e4;"Masoud Sharbiani noticed that commit 29ef680ae7c21110 (""memcg, oom: move out_of_memory back to the charge path"") broke memcg OOM called from __xfs_filemap_fault() path";no;yes;yes 45;MDY6Q29tbWl0MjMyNTI5ODpmOWM2NDU2MjFhMjhlMzc4MTNhMWRlOTZkOWNiZDg5Y2RlOTRhMWU0;Tetsuo Handa;Linus Torvalds;"memcg, oom: don't require __GFP_FS when invoking memcg OOM killer Masoud Sharbiani noticed that commit 29ef680ae7c21110 (""memcg, oom: move out_of_memory back to the charge path"") broke memcg OOM called from __xfs_filemap_fault() path. It turned out that try_charge() is retrying forever without making forward progress because mem_cgroup_oom(GFP_NOFS) cannot invoke the OOM killer due to commit 3da88fb3bacfaa33 (""mm, oom: move GFP_NOFS check to out_of_memory""). Allowing forced charge due to being unable to invoke memcg OOM killer will lead to global OOM situation. Also, just returning -ENOMEM will be risky because OOM path is lost and some paths (e.g. get_user_pages()) will leak -ENOMEM. Therefore, invoking memcg OOM killer (despite GFP_NOFS) will be the only choice we can choose for now. Until 29ef680ae7c21110, we were able to invoke memcg OOM killer when GFP_KERNEL reclaim failed [1]. But since 29ef680ae7c21110, we need to invoke memcg OOM killer when GFP_NOFS reclaim failed [2]. Although in the past we did invoke memcg OOM killer for GFP_NOFS [3], we might get pre-mature memcg OOM reports due to this patch. [1] leaker invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0 CPU: 0 PID: 2746 Comm: leaker Not tainted 4.18.0+ #19 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018 Call Trace: dump_stack+0x63/0x88 dump_header+0x67/0x27a ? mem_cgroup_scan_tasks+0x91/0xf0 oom_kill_process+0x210/0x410 out_of_memory+0x10a/0x2c0 mem_cgroup_out_of_memory+0x46/0x80 mem_cgroup_oom_synchronize+0x2e4/0x310 ? high_work_func+0x20/0x20 pagefault_out_of_memory+0x31/0x76 mm_fault_error+0x55/0x115 ? handle_mm_fault+0xfd/0x220 __do_page_fault+0x433/0x4e0 do_page_fault+0x22/0x30 ? page_fault+0x8/0x30 page_fault+0x1e/0x30 RIP: 0033:0x4009f0 Code: 03 00 00 00 e8 71 fd ff ff 48 83 f8 ff 49 89 c6 74 74 48 89 c6 bf c0 0c 40 00 31 c0 e8 69 fd ff ff 45 85 ff 7e 21 31 c9 66 90 <41> 0f be 14 0e 01 d3 f7 c1 ff 0f 00 00 75 05 41 c6 04 0e 2a 48 83 RSP: 002b:00007ffe29ae96f0 EFLAGS: 00010206 RAX: 000000000000001b RBX: 0000000000000000 RCX: 0000000001ce1000 RDX: 0000000000000000 RSI: 000000007fffffe5 RDI: 0000000000000000 RBP: 000000000000000c R08: 0000000000000000 R09: 00007f94be09220d R10: 0000000000000002 R11: 0000000000000246 R12: 00000000000186a0 R13: 0000000000000003 R14: 00007f949d845000 R15: 0000000002800000 Task in /leaker killed as a result of limit of /leaker memory: usage 524288kB, limit 524288kB, failcnt 158965 memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0 kmem: usage 2016kB, limit 9007199254740988kB, failcnt 0 Memory cgroup stats for /leaker: cache:844KB rss:521136KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:132KB writeback:0KB inactive_anon:0KB active_anon:521224KB inactive_file:1012KB active_file:8KB unevictable:0KB Memory cgroup out of memory: Kill process 2746 (leaker) score 998 or sacrifice child Killed process 2746 (leaker) total-vm:536704kB, anon-rss:521176kB, file-rss:1208kB, shmem-rss:0kB oom_reaper: reaped process 2746 (leaker), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [2] leaker invoked oom-killer: gfp_mask=0x600040(GFP_NOFS), nodemask=(null), order=0, oom_score_adj=0 CPU: 1 PID: 2746 Comm: leaker Not tainted 4.18.0+ #20 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018 Call Trace: dump_stack+0x63/0x88 dump_header+0x67/0x27a ? mem_cgroup_scan_tasks+0x91/0xf0 oom_kill_process+0x210/0x410 out_of_memory+0x109/0x2d0 mem_cgroup_out_of_memory+0x46/0x80 try_charge+0x58d/0x650 ? __radix_tree_replace+0x81/0x100 mem_cgroup_try_charge+0x7a/0x100 __add_to_page_cache_locked+0x92/0x180 add_to_page_cache_lru+0x4d/0xf0 iomap_readpages_actor+0xde/0x1b0 ? iomap_zero_range_actor+0x1d0/0x1d0 iomap_apply+0xaf/0x130 iomap_readpages+0x9f/0x150 ? iomap_zero_range_actor+0x1d0/0x1d0 xfs_vm_readpages+0x18/0x20 [xfs] read_pages+0x60/0x140 __do_page_cache_readahead+0x193/0x1b0 ondemand_readahead+0x16d/0x2c0 page_cache_async_readahead+0x9a/0xd0 filemap_fault+0x403/0x620 ? alloc_set_pte+0x12c/0x540 ? _cond_resched+0x14/0x30 __xfs_filemap_fault+0x66/0x180 [xfs] xfs_filemap_fault+0x27/0x30 [xfs] __do_fault+0x19/0x40 __handle_mm_fault+0x8e8/0xb60 handle_mm_fault+0xfd/0x220 __do_page_fault+0x238/0x4e0 do_page_fault+0x22/0x30 ? page_fault+0x8/0x30 page_fault+0x1e/0x30 RIP: 0033:0x4009f0 Code: 03 00 00 00 e8 71 fd ff ff 48 83 f8 ff 49 89 c6 74 74 48 89 c6 bf c0 0c 40 00 31 c0 e8 69 fd ff ff 45 85 ff 7e 21 31 c9 66 90 <41> 0f be 14 0e 01 d3 f7 c1 ff 0f 00 00 75 05 41 c6 04 0e 2a 48 83 RSP: 002b:00007ffda45c9290 EFLAGS: 00010206 RAX: 000000000000001b RBX: 0000000000000000 RCX: 0000000001a1e000 RDX: 0000000000000000 RSI: 000000007fffffe5 RDI: 0000000000000000 RBP: 000000000000000c R08: 0000000000000000 R09: 00007f6d061ff20d R10: 0000000000000002 R11: 0000000000000246 R12: 00000000000186a0 R13: 0000000000000003 R14: 00007f6ce59b2000 R15: 0000000002800000 Task in /leaker killed as a result of limit of /leaker memory: usage 524288kB, limit 524288kB, failcnt 7221 memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0 kmem: usage 1944kB, limit 9007199254740988kB, failcnt 0 Memory cgroup stats for /leaker: cache:3632KB rss:518232KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB inactive_anon:0KB active_anon:518408KB inactive_file:3908KB active_file:12KB unevictable:0KB Memory cgroup out of memory: Kill process 2746 (leaker) score 992 or sacrifice child Killed process 2746 (leaker) total-vm:536704kB, anon-rss:518264kB, file-rss:1188kB, shmem-rss:0kB oom_reaper: reaped process 2746 (leaker), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [3] leaker invoked oom-killer: gfp_mask=0x50, order=0, oom_score_adj=0 leaker cpuset=/ mems_allowed=0 CPU: 1 PID: 3206 Comm: leaker Not tainted 3.10.0-957.27.2.el7.x86_64 #1 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018 Call Trace: [<ffffffffaf364147>] dump_stack+0x19/0x1b [<ffffffffaf35eb6a>] dump_header+0x90/0x229 [<ffffffffaedbb456>] ? find_lock_task_mm+0x56/0xc0 [<ffffffffaee32a38>] ? try_get_mem_cgroup_from_mm+0x28/0x60 [<ffffffffaedbb904>] oom_kill_process+0x254/0x3d0 [<ffffffffaee36c36>] mem_cgroup_oom_synchronize+0x546/0x570 [<ffffffffaee360b0>] ? mem_cgroup_charge_common+0xc0/0xc0 [<ffffffffaedbc194>] pagefault_out_of_memory+0x14/0x90 [<ffffffffaf35d072>] mm_fault_error+0x6a/0x157 [<ffffffffaf3717c8>] __do_page_fault+0x3c8/0x4f0 [<ffffffffaf371925>] do_page_fault+0x35/0x90 [<ffffffffaf36d768>] page_fault+0x28/0x30 Task in /leaker killed as a result of limit of /leaker memory: usage 524288kB, limit 524288kB, failcnt 20628 memory+swap: usage 524288kB, limit 9007199254740988kB, failcnt 0 kmem: usage 0kB, limit 9007199254740988kB, failcnt 0 Memory cgroup stats for /leaker: cache:840KB rss:523448KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:523448KB inactive_file:464KB active_file:376KB unevictable:0KB Memory cgroup out of memory: Kill process 3206 (leaker) score 970 or sacrifice child Killed process 3206 (leaker) total-vm:536692kB, anon-rss:523304kB, file-rss:412kB, shmem-rss:0kB Bisected by Masoud Sharbiani. Link: http://lkml.kernel.org/r/cbe54ed1-b6ba-a056-8899-2dc42526371d@i-love.sakura.ne.jp Fixes: 3da88fb3bacfaa33 (""mm, oom: move GFP_NOFS check to out_of_memory"") [necessary after 29ef680ae7c21110] Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-by: Masoud Sharbiani <msharbiani@apple.com> Tested-by: Masoud Sharbiani <msharbiani@apple.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> [4.19+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/f9c645621a28e37813a1de96d9cbd89cde94a1e4;" It turned out that try_charge() is retrying cannot invoke the OOM killer due to commit 3da88fb3bacfaa33 (""mm, oom";no;yes;yes 45;MDY6Q29tbWl0MjMyNTI5ODpmOWM2NDU2MjFhMjhlMzc4MTNhMWRlOTZkOWNiZDg5Y2RlOTRhMWU0;Tetsuo Handa;Linus Torvalds;"memcg, oom: don't require __GFP_FS when invoking memcg OOM killer Masoud Sharbiani noticed that commit 29ef680ae7c21110 (""memcg, oom: move out_of_memory back to the charge path"") broke memcg OOM called from __xfs_filemap_fault() path. It turned out that try_charge() is retrying forever without making forward progress because mem_cgroup_oom(GFP_NOFS) cannot invoke the OOM killer due to commit 3da88fb3bacfaa33 (""mm, oom: move GFP_NOFS check to out_of_memory""). Allowing forced charge due to being unable to invoke memcg OOM killer will lead to global OOM situation. Also, just returning -ENOMEM will be risky because OOM path is lost and some paths (e.g. get_user_pages()) will leak -ENOMEM. Therefore, invoking memcg OOM killer (despite GFP_NOFS) will be the only choice we can choose for now. Until 29ef680ae7c21110, we were able to invoke memcg OOM killer when GFP_KERNEL reclaim failed [1]. But since 29ef680ae7c21110, we need to invoke memcg OOM killer when GFP_NOFS reclaim failed [2]. Although in the past we did invoke memcg OOM killer for GFP_NOFS [3], we might get pre-mature memcg OOM reports due to this patch. [1] leaker invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0 CPU: 0 PID: 2746 Comm: leaker Not tainted 4.18.0+ #19 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018 Call Trace: dump_stack+0x63/0x88 dump_header+0x67/0x27a ? mem_cgroup_scan_tasks+0x91/0xf0 oom_kill_process+0x210/0x410 out_of_memory+0x10a/0x2c0 mem_cgroup_out_of_memory+0x46/0x80 mem_cgroup_oom_synchronize+0x2e4/0x310 ? high_work_func+0x20/0x20 pagefault_out_of_memory+0x31/0x76 mm_fault_error+0x55/0x115 ? handle_mm_fault+0xfd/0x220 __do_page_fault+0x433/0x4e0 do_page_fault+0x22/0x30 ? page_fault+0x8/0x30 page_fault+0x1e/0x30 RIP: 0033:0x4009f0 Code: 03 00 00 00 e8 71 fd ff ff 48 83 f8 ff 49 89 c6 74 74 48 89 c6 bf c0 0c 40 00 31 c0 e8 69 fd ff ff 45 85 ff 7e 21 31 c9 66 90 <41> 0f be 14 0e 01 d3 f7 c1 ff 0f 00 00 75 05 41 c6 04 0e 2a 48 83 RSP: 002b:00007ffe29ae96f0 EFLAGS: 00010206 RAX: 000000000000001b RBX: 0000000000000000 RCX: 0000000001ce1000 RDX: 0000000000000000 RSI: 000000007fffffe5 RDI: 0000000000000000 RBP: 000000000000000c R08: 0000000000000000 R09: 00007f94be09220d R10: 0000000000000002 R11: 0000000000000246 R12: 00000000000186a0 R13: 0000000000000003 R14: 00007f949d845000 R15: 0000000002800000 Task in /leaker killed as a result of limit of /leaker memory: usage 524288kB, limit 524288kB, failcnt 158965 memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0 kmem: usage 2016kB, limit 9007199254740988kB, failcnt 0 Memory cgroup stats for /leaker: cache:844KB rss:521136KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:132KB writeback:0KB inactive_anon:0KB active_anon:521224KB inactive_file:1012KB active_file:8KB unevictable:0KB Memory cgroup out of memory: Kill process 2746 (leaker) score 998 or sacrifice child Killed process 2746 (leaker) total-vm:536704kB, anon-rss:521176kB, file-rss:1208kB, shmem-rss:0kB oom_reaper: reaped process 2746 (leaker), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [2] leaker invoked oom-killer: gfp_mask=0x600040(GFP_NOFS), nodemask=(null), order=0, oom_score_adj=0 CPU: 1 PID: 2746 Comm: leaker Not tainted 4.18.0+ #20 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018 Call Trace: dump_stack+0x63/0x88 dump_header+0x67/0x27a ? mem_cgroup_scan_tasks+0x91/0xf0 oom_kill_process+0x210/0x410 out_of_memory+0x109/0x2d0 mem_cgroup_out_of_memory+0x46/0x80 try_charge+0x58d/0x650 ? __radix_tree_replace+0x81/0x100 mem_cgroup_try_charge+0x7a/0x100 __add_to_page_cache_locked+0x92/0x180 add_to_page_cache_lru+0x4d/0xf0 iomap_readpages_actor+0xde/0x1b0 ? iomap_zero_range_actor+0x1d0/0x1d0 iomap_apply+0xaf/0x130 iomap_readpages+0x9f/0x150 ? iomap_zero_range_actor+0x1d0/0x1d0 xfs_vm_readpages+0x18/0x20 [xfs] read_pages+0x60/0x140 __do_page_cache_readahead+0x193/0x1b0 ondemand_readahead+0x16d/0x2c0 page_cache_async_readahead+0x9a/0xd0 filemap_fault+0x403/0x620 ? alloc_set_pte+0x12c/0x540 ? _cond_resched+0x14/0x30 __xfs_filemap_fault+0x66/0x180 [xfs] xfs_filemap_fault+0x27/0x30 [xfs] __do_fault+0x19/0x40 __handle_mm_fault+0x8e8/0xb60 handle_mm_fault+0xfd/0x220 __do_page_fault+0x238/0x4e0 do_page_fault+0x22/0x30 ? page_fault+0x8/0x30 page_fault+0x1e/0x30 RIP: 0033:0x4009f0 Code: 03 00 00 00 e8 71 fd ff ff 48 83 f8 ff 49 89 c6 74 74 48 89 c6 bf c0 0c 40 00 31 c0 e8 69 fd ff ff 45 85 ff 7e 21 31 c9 66 90 <41> 0f be 14 0e 01 d3 f7 c1 ff 0f 00 00 75 05 41 c6 04 0e 2a 48 83 RSP: 002b:00007ffda45c9290 EFLAGS: 00010206 RAX: 000000000000001b RBX: 0000000000000000 RCX: 0000000001a1e000 RDX: 0000000000000000 RSI: 000000007fffffe5 RDI: 0000000000000000 RBP: 000000000000000c R08: 0000000000000000 R09: 00007f6d061ff20d R10: 0000000000000002 R11: 0000000000000246 R12: 00000000000186a0 R13: 0000000000000003 R14: 00007f6ce59b2000 R15: 0000000002800000 Task in /leaker killed as a result of limit of /leaker memory: usage 524288kB, limit 524288kB, failcnt 7221 memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0 kmem: usage 1944kB, limit 9007199254740988kB, failcnt 0 Memory cgroup stats for /leaker: cache:3632KB rss:518232KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB inactive_anon:0KB active_anon:518408KB inactive_file:3908KB active_file:12KB unevictable:0KB Memory cgroup out of memory: Kill process 2746 (leaker) score 992 or sacrifice child Killed process 2746 (leaker) total-vm:536704kB, anon-rss:518264kB, file-rss:1188kB, shmem-rss:0kB oom_reaper: reaped process 2746 (leaker), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [3] leaker invoked oom-killer: gfp_mask=0x50, order=0, oom_score_adj=0 leaker cpuset=/ mems_allowed=0 CPU: 1 PID: 3206 Comm: leaker Not tainted 3.10.0-957.27.2.el7.x86_64 #1 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018 Call Trace: [<ffffffffaf364147>] dump_stack+0x19/0x1b [<ffffffffaf35eb6a>] dump_header+0x90/0x229 [<ffffffffaedbb456>] ? find_lock_task_mm+0x56/0xc0 [<ffffffffaee32a38>] ? try_get_mem_cgroup_from_mm+0x28/0x60 [<ffffffffaedbb904>] oom_kill_process+0x254/0x3d0 [<ffffffffaee36c36>] mem_cgroup_oom_synchronize+0x546/0x570 [<ffffffffaee360b0>] ? mem_cgroup_charge_common+0xc0/0xc0 [<ffffffffaedbc194>] pagefault_out_of_memory+0x14/0x90 [<ffffffffaf35d072>] mm_fault_error+0x6a/0x157 [<ffffffffaf3717c8>] __do_page_fault+0x3c8/0x4f0 [<ffffffffaf371925>] do_page_fault+0x35/0x90 [<ffffffffaf36d768>] page_fault+0x28/0x30 Task in /leaker killed as a result of limit of /leaker memory: usage 524288kB, limit 524288kB, failcnt 20628 memory+swap: usage 524288kB, limit 9007199254740988kB, failcnt 0 kmem: usage 0kB, limit 9007199254740988kB, failcnt 0 Memory cgroup stats for /leaker: cache:840KB rss:523448KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:523448KB inactive_file:464KB active_file:376KB unevictable:0KB Memory cgroup out of memory: Kill process 3206 (leaker) score 970 or sacrifice child Killed process 3206 (leaker) total-vm:536692kB, anon-rss:523304kB, file-rss:412kB, shmem-rss:0kB Bisected by Masoud Sharbiani. Link: http://lkml.kernel.org/r/cbe54ed1-b6ba-a056-8899-2dc42526371d@i-love.sakura.ne.jp Fixes: 3da88fb3bacfaa33 (""mm, oom: move GFP_NOFS check to out_of_memory"") [necessary after 29ef680ae7c21110] Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-by: Masoud Sharbiani <msharbiani@apple.com> Tested-by: Masoud Sharbiani <msharbiani@apple.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> [4.19+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/f9c645621a28e37813a1de96d9cbd89cde94a1e4;"Allowing forced charge due to being unable to invoke memcg OOM killer will lead to global OOM situation";no;yes;yes 45;MDY6Q29tbWl0MjMyNTI5ODpmOWM2NDU2MjFhMjhlMzc4MTNhMWRlOTZkOWNiZDg5Y2RlOTRhMWU0;Tetsuo Handa;Linus Torvalds;"memcg, oom: don't require __GFP_FS when invoking memcg OOM killer Masoud Sharbiani noticed that commit 29ef680ae7c21110 (""memcg, oom: move out_of_memory back to the charge path"") broke memcg OOM called from __xfs_filemap_fault() path. It turned out that try_charge() is retrying forever without making forward progress because mem_cgroup_oom(GFP_NOFS) cannot invoke the OOM killer due to commit 3da88fb3bacfaa33 (""mm, oom: move GFP_NOFS check to out_of_memory""). Allowing forced charge due to being unable to invoke memcg OOM killer will lead to global OOM situation. Also, just returning -ENOMEM will be risky because OOM path is lost and some paths (e.g. get_user_pages()) will leak -ENOMEM. Therefore, invoking memcg OOM killer (despite GFP_NOFS) will be the only choice we can choose for now. Until 29ef680ae7c21110, we were able to invoke memcg OOM killer when GFP_KERNEL reclaim failed [1]. But since 29ef680ae7c21110, we need to invoke memcg OOM killer when GFP_NOFS reclaim failed [2]. Although in the past we did invoke memcg OOM killer for GFP_NOFS [3], we might get pre-mature memcg OOM reports due to this patch. [1] leaker invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0 CPU: 0 PID: 2746 Comm: leaker Not tainted 4.18.0+ #19 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018 Call Trace: dump_stack+0x63/0x88 dump_header+0x67/0x27a ? mem_cgroup_scan_tasks+0x91/0xf0 oom_kill_process+0x210/0x410 out_of_memory+0x10a/0x2c0 mem_cgroup_out_of_memory+0x46/0x80 mem_cgroup_oom_synchronize+0x2e4/0x310 ? high_work_func+0x20/0x20 pagefault_out_of_memory+0x31/0x76 mm_fault_error+0x55/0x115 ? handle_mm_fault+0xfd/0x220 __do_page_fault+0x433/0x4e0 do_page_fault+0x22/0x30 ? page_fault+0x8/0x30 page_fault+0x1e/0x30 RIP: 0033:0x4009f0 Code: 03 00 00 00 e8 71 fd ff ff 48 83 f8 ff 49 89 c6 74 74 48 89 c6 bf c0 0c 40 00 31 c0 e8 69 fd ff ff 45 85 ff 7e 21 31 c9 66 90 <41> 0f be 14 0e 01 d3 f7 c1 ff 0f 00 00 75 05 41 c6 04 0e 2a 48 83 RSP: 002b:00007ffe29ae96f0 EFLAGS: 00010206 RAX: 000000000000001b RBX: 0000000000000000 RCX: 0000000001ce1000 RDX: 0000000000000000 RSI: 000000007fffffe5 RDI: 0000000000000000 RBP: 000000000000000c R08: 0000000000000000 R09: 00007f94be09220d R10: 0000000000000002 R11: 0000000000000246 R12: 00000000000186a0 R13: 0000000000000003 R14: 00007f949d845000 R15: 0000000002800000 Task in /leaker killed as a result of limit of /leaker memory: usage 524288kB, limit 524288kB, failcnt 158965 memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0 kmem: usage 2016kB, limit 9007199254740988kB, failcnt 0 Memory cgroup stats for /leaker: cache:844KB rss:521136KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:132KB writeback:0KB inactive_anon:0KB active_anon:521224KB inactive_file:1012KB active_file:8KB unevictable:0KB Memory cgroup out of memory: Kill process 2746 (leaker) score 998 or sacrifice child Killed process 2746 (leaker) total-vm:536704kB, anon-rss:521176kB, file-rss:1208kB, shmem-rss:0kB oom_reaper: reaped process 2746 (leaker), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [2] leaker invoked oom-killer: gfp_mask=0x600040(GFP_NOFS), nodemask=(null), order=0, oom_score_adj=0 CPU: 1 PID: 2746 Comm: leaker Not tainted 4.18.0+ #20 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018 Call Trace: dump_stack+0x63/0x88 dump_header+0x67/0x27a ? mem_cgroup_scan_tasks+0x91/0xf0 oom_kill_process+0x210/0x410 out_of_memory+0x109/0x2d0 mem_cgroup_out_of_memory+0x46/0x80 try_charge+0x58d/0x650 ? __radix_tree_replace+0x81/0x100 mem_cgroup_try_charge+0x7a/0x100 __add_to_page_cache_locked+0x92/0x180 add_to_page_cache_lru+0x4d/0xf0 iomap_readpages_actor+0xde/0x1b0 ? iomap_zero_range_actor+0x1d0/0x1d0 iomap_apply+0xaf/0x130 iomap_readpages+0x9f/0x150 ? iomap_zero_range_actor+0x1d0/0x1d0 xfs_vm_readpages+0x18/0x20 [xfs] read_pages+0x60/0x140 __do_page_cache_readahead+0x193/0x1b0 ondemand_readahead+0x16d/0x2c0 page_cache_async_readahead+0x9a/0xd0 filemap_fault+0x403/0x620 ? alloc_set_pte+0x12c/0x540 ? _cond_resched+0x14/0x30 __xfs_filemap_fault+0x66/0x180 [xfs] xfs_filemap_fault+0x27/0x30 [xfs] __do_fault+0x19/0x40 __handle_mm_fault+0x8e8/0xb60 handle_mm_fault+0xfd/0x220 __do_page_fault+0x238/0x4e0 do_page_fault+0x22/0x30 ? page_fault+0x8/0x30 page_fault+0x1e/0x30 RIP: 0033:0x4009f0 Code: 03 00 00 00 e8 71 fd ff ff 48 83 f8 ff 49 89 c6 74 74 48 89 c6 bf c0 0c 40 00 31 c0 e8 69 fd ff ff 45 85 ff 7e 21 31 c9 66 90 <41> 0f be 14 0e 01 d3 f7 c1 ff 0f 00 00 75 05 41 c6 04 0e 2a 48 83 RSP: 002b:00007ffda45c9290 EFLAGS: 00010206 RAX: 000000000000001b RBX: 0000000000000000 RCX: 0000000001a1e000 RDX: 0000000000000000 RSI: 000000007fffffe5 RDI: 0000000000000000 RBP: 000000000000000c R08: 0000000000000000 R09: 00007f6d061ff20d R10: 0000000000000002 R11: 0000000000000246 R12: 00000000000186a0 R13: 0000000000000003 R14: 00007f6ce59b2000 R15: 0000000002800000 Task in /leaker killed as a result of limit of /leaker memory: usage 524288kB, limit 524288kB, failcnt 7221 memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0 kmem: usage 1944kB, limit 9007199254740988kB, failcnt 0 Memory cgroup stats for /leaker: cache:3632KB rss:518232KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB inactive_anon:0KB active_anon:518408KB inactive_file:3908KB active_file:12KB unevictable:0KB Memory cgroup out of memory: Kill process 2746 (leaker) score 992 or sacrifice child Killed process 2746 (leaker) total-vm:536704kB, anon-rss:518264kB, file-rss:1188kB, shmem-rss:0kB oom_reaper: reaped process 2746 (leaker), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [3] leaker invoked oom-killer: gfp_mask=0x50, order=0, oom_score_adj=0 leaker cpuset=/ mems_allowed=0 CPU: 1 PID: 3206 Comm: leaker Not tainted 3.10.0-957.27.2.el7.x86_64 #1 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018 Call Trace: [<ffffffffaf364147>] dump_stack+0x19/0x1b [<ffffffffaf35eb6a>] dump_header+0x90/0x229 [<ffffffffaedbb456>] ? find_lock_task_mm+0x56/0xc0 [<ffffffffaee32a38>] ? try_get_mem_cgroup_from_mm+0x28/0x60 [<ffffffffaedbb904>] oom_kill_process+0x254/0x3d0 [<ffffffffaee36c36>] mem_cgroup_oom_synchronize+0x546/0x570 [<ffffffffaee360b0>] ? mem_cgroup_charge_common+0xc0/0xc0 [<ffffffffaedbc194>] pagefault_out_of_memory+0x14/0x90 [<ffffffffaf35d072>] mm_fault_error+0x6a/0x157 [<ffffffffaf3717c8>] __do_page_fault+0x3c8/0x4f0 [<ffffffffaf371925>] do_page_fault+0x35/0x90 [<ffffffffaf36d768>] page_fault+0x28/0x30 Task in /leaker killed as a result of limit of /leaker memory: usage 524288kB, limit 524288kB, failcnt 20628 memory+swap: usage 524288kB, limit 9007199254740988kB, failcnt 0 kmem: usage 0kB, limit 9007199254740988kB, failcnt 0 Memory cgroup stats for /leaker: cache:840KB rss:523448KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:523448KB inactive_file:464KB active_file:376KB unevictable:0KB Memory cgroup out of memory: Kill process 3206 (leaker) score 970 or sacrifice child Killed process 3206 (leaker) total-vm:536692kB, anon-rss:523304kB, file-rss:412kB, shmem-rss:0kB Bisected by Masoud Sharbiani. Link: http://lkml.kernel.org/r/cbe54ed1-b6ba-a056-8899-2dc42526371d@i-love.sakura.ne.jp Fixes: 3da88fb3bacfaa33 (""mm, oom: move GFP_NOFS check to out_of_memory"") [necessary after 29ef680ae7c21110] Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-by: Masoud Sharbiani <msharbiani@apple.com> Tested-by: Masoud Sharbiani <msharbiani@apple.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> [4.19+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/f9c645621a28e37813a1de96d9cbd89cde94a1e4;" Also, just returning -ENOMEM will be risky because OOM path is lost and some paths (e.g";no;yes;yes 45;MDY6Q29tbWl0MjMyNTI5ODpmOWM2NDU2MjFhMjhlMzc4MTNhMWRlOTZkOWNiZDg5Y2RlOTRhMWU0;Tetsuo Handa;Linus Torvalds;"memcg, oom: don't require __GFP_FS when invoking memcg OOM killer Masoud Sharbiani noticed that commit 29ef680ae7c21110 (""memcg, oom: move out_of_memory back to the charge path"") broke memcg OOM called from __xfs_filemap_fault() path. It turned out that try_charge() is retrying forever without making forward progress because mem_cgroup_oom(GFP_NOFS) cannot invoke the OOM killer due to commit 3da88fb3bacfaa33 (""mm, oom: move GFP_NOFS check to out_of_memory""). Allowing forced charge due to being unable to invoke memcg OOM killer will lead to global OOM situation. Also, just returning -ENOMEM will be risky because OOM path is lost and some paths (e.g. get_user_pages()) will leak -ENOMEM. Therefore, invoking memcg OOM killer (despite GFP_NOFS) will be the only choice we can choose for now. Until 29ef680ae7c21110, we were able to invoke memcg OOM killer when GFP_KERNEL reclaim failed [1]. But since 29ef680ae7c21110, we need to invoke memcg OOM killer when GFP_NOFS reclaim failed [2]. Although in the past we did invoke memcg OOM killer for GFP_NOFS [3], we might get pre-mature memcg OOM reports due to this patch. [1] leaker invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0 CPU: 0 PID: 2746 Comm: leaker Not tainted 4.18.0+ #19 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018 Call Trace: dump_stack+0x63/0x88 dump_header+0x67/0x27a ? mem_cgroup_scan_tasks+0x91/0xf0 oom_kill_process+0x210/0x410 out_of_memory+0x10a/0x2c0 mem_cgroup_out_of_memory+0x46/0x80 mem_cgroup_oom_synchronize+0x2e4/0x310 ? high_work_func+0x20/0x20 pagefault_out_of_memory+0x31/0x76 mm_fault_error+0x55/0x115 ? handle_mm_fault+0xfd/0x220 __do_page_fault+0x433/0x4e0 do_page_fault+0x22/0x30 ? page_fault+0x8/0x30 page_fault+0x1e/0x30 RIP: 0033:0x4009f0 Code: 03 00 00 00 e8 71 fd ff ff 48 83 f8 ff 49 89 c6 74 74 48 89 c6 bf c0 0c 40 00 31 c0 e8 69 fd ff ff 45 85 ff 7e 21 31 c9 66 90 <41> 0f be 14 0e 01 d3 f7 c1 ff 0f 00 00 75 05 41 c6 04 0e 2a 48 83 RSP: 002b:00007ffe29ae96f0 EFLAGS: 00010206 RAX: 000000000000001b RBX: 0000000000000000 RCX: 0000000001ce1000 RDX: 0000000000000000 RSI: 000000007fffffe5 RDI: 0000000000000000 RBP: 000000000000000c R08: 0000000000000000 R09: 00007f94be09220d R10: 0000000000000002 R11: 0000000000000246 R12: 00000000000186a0 R13: 0000000000000003 R14: 00007f949d845000 R15: 0000000002800000 Task in /leaker killed as a result of limit of /leaker memory: usage 524288kB, limit 524288kB, failcnt 158965 memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0 kmem: usage 2016kB, limit 9007199254740988kB, failcnt 0 Memory cgroup stats for /leaker: cache:844KB rss:521136KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:132KB writeback:0KB inactive_anon:0KB active_anon:521224KB inactive_file:1012KB active_file:8KB unevictable:0KB Memory cgroup out of memory: Kill process 2746 (leaker) score 998 or sacrifice child Killed process 2746 (leaker) total-vm:536704kB, anon-rss:521176kB, file-rss:1208kB, shmem-rss:0kB oom_reaper: reaped process 2746 (leaker), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [2] leaker invoked oom-killer: gfp_mask=0x600040(GFP_NOFS), nodemask=(null), order=0, oom_score_adj=0 CPU: 1 PID: 2746 Comm: leaker Not tainted 4.18.0+ #20 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018 Call Trace: dump_stack+0x63/0x88 dump_header+0x67/0x27a ? mem_cgroup_scan_tasks+0x91/0xf0 oom_kill_process+0x210/0x410 out_of_memory+0x109/0x2d0 mem_cgroup_out_of_memory+0x46/0x80 try_charge+0x58d/0x650 ? __radix_tree_replace+0x81/0x100 mem_cgroup_try_charge+0x7a/0x100 __add_to_page_cache_locked+0x92/0x180 add_to_page_cache_lru+0x4d/0xf0 iomap_readpages_actor+0xde/0x1b0 ? iomap_zero_range_actor+0x1d0/0x1d0 iomap_apply+0xaf/0x130 iomap_readpages+0x9f/0x150 ? iomap_zero_range_actor+0x1d0/0x1d0 xfs_vm_readpages+0x18/0x20 [xfs] read_pages+0x60/0x140 __do_page_cache_readahead+0x193/0x1b0 ondemand_readahead+0x16d/0x2c0 page_cache_async_readahead+0x9a/0xd0 filemap_fault+0x403/0x620 ? alloc_set_pte+0x12c/0x540 ? _cond_resched+0x14/0x30 __xfs_filemap_fault+0x66/0x180 [xfs] xfs_filemap_fault+0x27/0x30 [xfs] __do_fault+0x19/0x40 __handle_mm_fault+0x8e8/0xb60 handle_mm_fault+0xfd/0x220 __do_page_fault+0x238/0x4e0 do_page_fault+0x22/0x30 ? page_fault+0x8/0x30 page_fault+0x1e/0x30 RIP: 0033:0x4009f0 Code: 03 00 00 00 e8 71 fd ff ff 48 83 f8 ff 49 89 c6 74 74 48 89 c6 bf c0 0c 40 00 31 c0 e8 69 fd ff ff 45 85 ff 7e 21 31 c9 66 90 <41> 0f be 14 0e 01 d3 f7 c1 ff 0f 00 00 75 05 41 c6 04 0e 2a 48 83 RSP: 002b:00007ffda45c9290 EFLAGS: 00010206 RAX: 000000000000001b RBX: 0000000000000000 RCX: 0000000001a1e000 RDX: 0000000000000000 RSI: 000000007fffffe5 RDI: 0000000000000000 RBP: 000000000000000c R08: 0000000000000000 R09: 00007f6d061ff20d R10: 0000000000000002 R11: 0000000000000246 R12: 00000000000186a0 R13: 0000000000000003 R14: 00007f6ce59b2000 R15: 0000000002800000 Task in /leaker killed as a result of limit of /leaker memory: usage 524288kB, limit 524288kB, failcnt 7221 memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0 kmem: usage 1944kB, limit 9007199254740988kB, failcnt 0 Memory cgroup stats for /leaker: cache:3632KB rss:518232KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB inactive_anon:0KB active_anon:518408KB inactive_file:3908KB active_file:12KB unevictable:0KB Memory cgroup out of memory: Kill process 2746 (leaker) score 992 or sacrifice child Killed process 2746 (leaker) total-vm:536704kB, anon-rss:518264kB, file-rss:1188kB, shmem-rss:0kB oom_reaper: reaped process 2746 (leaker), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [3] leaker invoked oom-killer: gfp_mask=0x50, order=0, oom_score_adj=0 leaker cpuset=/ mems_allowed=0 CPU: 1 PID: 3206 Comm: leaker Not tainted 3.10.0-957.27.2.el7.x86_64 #1 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018 Call Trace: [<ffffffffaf364147>] dump_stack+0x19/0x1b [<ffffffffaf35eb6a>] dump_header+0x90/0x229 [<ffffffffaedbb456>] ? find_lock_task_mm+0x56/0xc0 [<ffffffffaee32a38>] ? try_get_mem_cgroup_from_mm+0x28/0x60 [<ffffffffaedbb904>] oom_kill_process+0x254/0x3d0 [<ffffffffaee36c36>] mem_cgroup_oom_synchronize+0x546/0x570 [<ffffffffaee360b0>] ? mem_cgroup_charge_common+0xc0/0xc0 [<ffffffffaedbc194>] pagefault_out_of_memory+0x14/0x90 [<ffffffffaf35d072>] mm_fault_error+0x6a/0x157 [<ffffffffaf3717c8>] __do_page_fault+0x3c8/0x4f0 [<ffffffffaf371925>] do_page_fault+0x35/0x90 [<ffffffffaf36d768>] page_fault+0x28/0x30 Task in /leaker killed as a result of limit of /leaker memory: usage 524288kB, limit 524288kB, failcnt 20628 memory+swap: usage 524288kB, limit 9007199254740988kB, failcnt 0 kmem: usage 0kB, limit 9007199254740988kB, failcnt 0 Memory cgroup stats for /leaker: cache:840KB rss:523448KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:523448KB inactive_file:464KB active_file:376KB unevictable:0KB Memory cgroup out of memory: Kill process 3206 (leaker) score 970 or sacrifice child Killed process 3206 (leaker) total-vm:536692kB, anon-rss:523304kB, file-rss:412kB, shmem-rss:0kB Bisected by Masoud Sharbiani. Link: http://lkml.kernel.org/r/cbe54ed1-b6ba-a056-8899-2dc42526371d@i-love.sakura.ne.jp Fixes: 3da88fb3bacfaa33 (""mm, oom: move GFP_NOFS check to out_of_memory"") [necessary after 29ef680ae7c21110] Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-by: Masoud Sharbiani <msharbiani@apple.com> Tested-by: Masoud Sharbiani <msharbiani@apple.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> [4.19+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/f9c645621a28e37813a1de96d9cbd89cde94a1e4;" get_user_pages()) will leak -ENOMEM";no;yes;no 45;MDY6Q29tbWl0MjMyNTI5ODpmOWM2NDU2MjFhMjhlMzc4MTNhMWRlOTZkOWNiZDg5Y2RlOTRhMWU0;Tetsuo Handa;Linus Torvalds;"memcg, oom: don't require __GFP_FS when invoking memcg OOM killer Masoud Sharbiani noticed that commit 29ef680ae7c21110 (""memcg, oom: move out_of_memory back to the charge path"") broke memcg OOM called from __xfs_filemap_fault() path. It turned out that try_charge() is retrying forever without making forward progress because mem_cgroup_oom(GFP_NOFS) cannot invoke the OOM killer due to commit 3da88fb3bacfaa33 (""mm, oom: move GFP_NOFS check to out_of_memory""). Allowing forced charge due to being unable to invoke memcg OOM killer will lead to global OOM situation. Also, just returning -ENOMEM will be risky because OOM path is lost and some paths (e.g. get_user_pages()) will leak -ENOMEM. Therefore, invoking memcg OOM killer (despite GFP_NOFS) will be the only choice we can choose for now. Until 29ef680ae7c21110, we were able to invoke memcg OOM killer when GFP_KERNEL reclaim failed [1]. But since 29ef680ae7c21110, we need to invoke memcg OOM killer when GFP_NOFS reclaim failed [2]. Although in the past we did invoke memcg OOM killer for GFP_NOFS [3], we might get pre-mature memcg OOM reports due to this patch. [1] leaker invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0 CPU: 0 PID: 2746 Comm: leaker Not tainted 4.18.0+ #19 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018 Call Trace: dump_stack+0x63/0x88 dump_header+0x67/0x27a ? mem_cgroup_scan_tasks+0x91/0xf0 oom_kill_process+0x210/0x410 out_of_memory+0x10a/0x2c0 mem_cgroup_out_of_memory+0x46/0x80 mem_cgroup_oom_synchronize+0x2e4/0x310 ? high_work_func+0x20/0x20 pagefault_out_of_memory+0x31/0x76 mm_fault_error+0x55/0x115 ? handle_mm_fault+0xfd/0x220 __do_page_fault+0x433/0x4e0 do_page_fault+0x22/0x30 ? page_fault+0x8/0x30 page_fault+0x1e/0x30 RIP: 0033:0x4009f0 Code: 03 00 00 00 e8 71 fd ff ff 48 83 f8 ff 49 89 c6 74 74 48 89 c6 bf c0 0c 40 00 31 c0 e8 69 fd ff ff 45 85 ff 7e 21 31 c9 66 90 <41> 0f be 14 0e 01 d3 f7 c1 ff 0f 00 00 75 05 41 c6 04 0e 2a 48 83 RSP: 002b:00007ffe29ae96f0 EFLAGS: 00010206 RAX: 000000000000001b RBX: 0000000000000000 RCX: 0000000001ce1000 RDX: 0000000000000000 RSI: 000000007fffffe5 RDI: 0000000000000000 RBP: 000000000000000c R08: 0000000000000000 R09: 00007f94be09220d R10: 0000000000000002 R11: 0000000000000246 R12: 00000000000186a0 R13: 0000000000000003 R14: 00007f949d845000 R15: 0000000002800000 Task in /leaker killed as a result of limit of /leaker memory: usage 524288kB, limit 524288kB, failcnt 158965 memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0 kmem: usage 2016kB, limit 9007199254740988kB, failcnt 0 Memory cgroup stats for /leaker: cache:844KB rss:521136KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:132KB writeback:0KB inactive_anon:0KB active_anon:521224KB inactive_file:1012KB active_file:8KB unevictable:0KB Memory cgroup out of memory: Kill process 2746 (leaker) score 998 or sacrifice child Killed process 2746 (leaker) total-vm:536704kB, anon-rss:521176kB, file-rss:1208kB, shmem-rss:0kB oom_reaper: reaped process 2746 (leaker), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [2] leaker invoked oom-killer: gfp_mask=0x600040(GFP_NOFS), nodemask=(null), order=0, oom_score_adj=0 CPU: 1 PID: 2746 Comm: leaker Not tainted 4.18.0+ #20 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018 Call Trace: dump_stack+0x63/0x88 dump_header+0x67/0x27a ? mem_cgroup_scan_tasks+0x91/0xf0 oom_kill_process+0x210/0x410 out_of_memory+0x109/0x2d0 mem_cgroup_out_of_memory+0x46/0x80 try_charge+0x58d/0x650 ? __radix_tree_replace+0x81/0x100 mem_cgroup_try_charge+0x7a/0x100 __add_to_page_cache_locked+0x92/0x180 add_to_page_cache_lru+0x4d/0xf0 iomap_readpages_actor+0xde/0x1b0 ? iomap_zero_range_actor+0x1d0/0x1d0 iomap_apply+0xaf/0x130 iomap_readpages+0x9f/0x150 ? iomap_zero_range_actor+0x1d0/0x1d0 xfs_vm_readpages+0x18/0x20 [xfs] read_pages+0x60/0x140 __do_page_cache_readahead+0x193/0x1b0 ondemand_readahead+0x16d/0x2c0 page_cache_async_readahead+0x9a/0xd0 filemap_fault+0x403/0x620 ? alloc_set_pte+0x12c/0x540 ? _cond_resched+0x14/0x30 __xfs_filemap_fault+0x66/0x180 [xfs] xfs_filemap_fault+0x27/0x30 [xfs] __do_fault+0x19/0x40 __handle_mm_fault+0x8e8/0xb60 handle_mm_fault+0xfd/0x220 __do_page_fault+0x238/0x4e0 do_page_fault+0x22/0x30 ? page_fault+0x8/0x30 page_fault+0x1e/0x30 RIP: 0033:0x4009f0 Code: 03 00 00 00 e8 71 fd ff ff 48 83 f8 ff 49 89 c6 74 74 48 89 c6 bf c0 0c 40 00 31 c0 e8 69 fd ff ff 45 85 ff 7e 21 31 c9 66 90 <41> 0f be 14 0e 01 d3 f7 c1 ff 0f 00 00 75 05 41 c6 04 0e 2a 48 83 RSP: 002b:00007ffda45c9290 EFLAGS: 00010206 RAX: 000000000000001b RBX: 0000000000000000 RCX: 0000000001a1e000 RDX: 0000000000000000 RSI: 000000007fffffe5 RDI: 0000000000000000 RBP: 000000000000000c R08: 0000000000000000 R09: 00007f6d061ff20d R10: 0000000000000002 R11: 0000000000000246 R12: 00000000000186a0 R13: 0000000000000003 R14: 00007f6ce59b2000 R15: 0000000002800000 Task in /leaker killed as a result of limit of /leaker memory: usage 524288kB, limit 524288kB, failcnt 7221 memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0 kmem: usage 1944kB, limit 9007199254740988kB, failcnt 0 Memory cgroup stats for /leaker: cache:3632KB rss:518232KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB inactive_anon:0KB active_anon:518408KB inactive_file:3908KB active_file:12KB unevictable:0KB Memory cgroup out of memory: Kill process 2746 (leaker) score 992 or sacrifice child Killed process 2746 (leaker) total-vm:536704kB, anon-rss:518264kB, file-rss:1188kB, shmem-rss:0kB oom_reaper: reaped process 2746 (leaker), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [3] leaker invoked oom-killer: gfp_mask=0x50, order=0, oom_score_adj=0 leaker cpuset=/ mems_allowed=0 CPU: 1 PID: 3206 Comm: leaker Not tainted 3.10.0-957.27.2.el7.x86_64 #1 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018 Call Trace: [<ffffffffaf364147>] dump_stack+0x19/0x1b [<ffffffffaf35eb6a>] dump_header+0x90/0x229 [<ffffffffaedbb456>] ? find_lock_task_mm+0x56/0xc0 [<ffffffffaee32a38>] ? try_get_mem_cgroup_from_mm+0x28/0x60 [<ffffffffaedbb904>] oom_kill_process+0x254/0x3d0 [<ffffffffaee36c36>] mem_cgroup_oom_synchronize+0x546/0x570 [<ffffffffaee360b0>] ? mem_cgroup_charge_common+0xc0/0xc0 [<ffffffffaedbc194>] pagefault_out_of_memory+0x14/0x90 [<ffffffffaf35d072>] mm_fault_error+0x6a/0x157 [<ffffffffaf3717c8>] __do_page_fault+0x3c8/0x4f0 [<ffffffffaf371925>] do_page_fault+0x35/0x90 [<ffffffffaf36d768>] page_fault+0x28/0x30 Task in /leaker killed as a result of limit of /leaker memory: usage 524288kB, limit 524288kB, failcnt 20628 memory+swap: usage 524288kB, limit 9007199254740988kB, failcnt 0 kmem: usage 0kB, limit 9007199254740988kB, failcnt 0 Memory cgroup stats for /leaker: cache:840KB rss:523448KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:523448KB inactive_file:464KB active_file:376KB unevictable:0KB Memory cgroup out of memory: Kill process 3206 (leaker) score 970 or sacrifice child Killed process 3206 (leaker) total-vm:536692kB, anon-rss:523304kB, file-rss:412kB, shmem-rss:0kB Bisected by Masoud Sharbiani. Link: http://lkml.kernel.org/r/cbe54ed1-b6ba-a056-8899-2dc42526371d@i-love.sakura.ne.jp Fixes: 3da88fb3bacfaa33 (""mm, oom: move GFP_NOFS check to out_of_memory"") [necessary after 29ef680ae7c21110] Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-by: Masoud Sharbiani <msharbiani@apple.com> Tested-by: Masoud Sharbiani <msharbiani@apple.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> [4.19+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/f9c645621a28e37813a1de96d9cbd89cde94a1e4;" Therefore, invoking memcg OOM killer (despite GFP_NOFS) will be the only choice we can choose for now";yes;yes;no 45;MDY6Q29tbWl0MjMyNTI5ODpmOWM2NDU2MjFhMjhlMzc4MTNhMWRlOTZkOWNiZDg5Y2RlOTRhMWU0;Tetsuo Handa;Linus Torvalds;"memcg, oom: don't require __GFP_FS when invoking memcg OOM killer Masoud Sharbiani noticed that commit 29ef680ae7c21110 (""memcg, oom: move out_of_memory back to the charge path"") broke memcg OOM called from __xfs_filemap_fault() path. It turned out that try_charge() is retrying forever without making forward progress because mem_cgroup_oom(GFP_NOFS) cannot invoke the OOM killer due to commit 3da88fb3bacfaa33 (""mm, oom: move GFP_NOFS check to out_of_memory""). Allowing forced charge due to being unable to invoke memcg OOM killer will lead to global OOM situation. Also, just returning -ENOMEM will be risky because OOM path is lost and some paths (e.g. get_user_pages()) will leak -ENOMEM. Therefore, invoking memcg OOM killer (despite GFP_NOFS) will be the only choice we can choose for now. Until 29ef680ae7c21110, we were able to invoke memcg OOM killer when GFP_KERNEL reclaim failed [1]. But since 29ef680ae7c21110, we need to invoke memcg OOM killer when GFP_NOFS reclaim failed [2]. Although in the past we did invoke memcg OOM killer for GFP_NOFS [3], we might get pre-mature memcg OOM reports due to this patch. [1] leaker invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0 CPU: 0 PID: 2746 Comm: leaker Not tainted 4.18.0+ #19 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018 Call Trace: dump_stack+0x63/0x88 dump_header+0x67/0x27a ? mem_cgroup_scan_tasks+0x91/0xf0 oom_kill_process+0x210/0x410 out_of_memory+0x10a/0x2c0 mem_cgroup_out_of_memory+0x46/0x80 mem_cgroup_oom_synchronize+0x2e4/0x310 ? high_work_func+0x20/0x20 pagefault_out_of_memory+0x31/0x76 mm_fault_error+0x55/0x115 ? handle_mm_fault+0xfd/0x220 __do_page_fault+0x433/0x4e0 do_page_fault+0x22/0x30 ? page_fault+0x8/0x30 page_fault+0x1e/0x30 RIP: 0033:0x4009f0 Code: 03 00 00 00 e8 71 fd ff ff 48 83 f8 ff 49 89 c6 74 74 48 89 c6 bf c0 0c 40 00 31 c0 e8 69 fd ff ff 45 85 ff 7e 21 31 c9 66 90 <41> 0f be 14 0e 01 d3 f7 c1 ff 0f 00 00 75 05 41 c6 04 0e 2a 48 83 RSP: 002b:00007ffe29ae96f0 EFLAGS: 00010206 RAX: 000000000000001b RBX: 0000000000000000 RCX: 0000000001ce1000 RDX: 0000000000000000 RSI: 000000007fffffe5 RDI: 0000000000000000 RBP: 000000000000000c R08: 0000000000000000 R09: 00007f94be09220d R10: 0000000000000002 R11: 0000000000000246 R12: 00000000000186a0 R13: 0000000000000003 R14: 00007f949d845000 R15: 0000000002800000 Task in /leaker killed as a result of limit of /leaker memory: usage 524288kB, limit 524288kB, failcnt 158965 memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0 kmem: usage 2016kB, limit 9007199254740988kB, failcnt 0 Memory cgroup stats for /leaker: cache:844KB rss:521136KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:132KB writeback:0KB inactive_anon:0KB active_anon:521224KB inactive_file:1012KB active_file:8KB unevictable:0KB Memory cgroup out of memory: Kill process 2746 (leaker) score 998 or sacrifice child Killed process 2746 (leaker) total-vm:536704kB, anon-rss:521176kB, file-rss:1208kB, shmem-rss:0kB oom_reaper: reaped process 2746 (leaker), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [2] leaker invoked oom-killer: gfp_mask=0x600040(GFP_NOFS), nodemask=(null), order=0, oom_score_adj=0 CPU: 1 PID: 2746 Comm: leaker Not tainted 4.18.0+ #20 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018 Call Trace: dump_stack+0x63/0x88 dump_header+0x67/0x27a ? mem_cgroup_scan_tasks+0x91/0xf0 oom_kill_process+0x210/0x410 out_of_memory+0x109/0x2d0 mem_cgroup_out_of_memory+0x46/0x80 try_charge+0x58d/0x650 ? __radix_tree_replace+0x81/0x100 mem_cgroup_try_charge+0x7a/0x100 __add_to_page_cache_locked+0x92/0x180 add_to_page_cache_lru+0x4d/0xf0 iomap_readpages_actor+0xde/0x1b0 ? iomap_zero_range_actor+0x1d0/0x1d0 iomap_apply+0xaf/0x130 iomap_readpages+0x9f/0x150 ? iomap_zero_range_actor+0x1d0/0x1d0 xfs_vm_readpages+0x18/0x20 [xfs] read_pages+0x60/0x140 __do_page_cache_readahead+0x193/0x1b0 ondemand_readahead+0x16d/0x2c0 page_cache_async_readahead+0x9a/0xd0 filemap_fault+0x403/0x620 ? alloc_set_pte+0x12c/0x540 ? _cond_resched+0x14/0x30 __xfs_filemap_fault+0x66/0x180 [xfs] xfs_filemap_fault+0x27/0x30 [xfs] __do_fault+0x19/0x40 __handle_mm_fault+0x8e8/0xb60 handle_mm_fault+0xfd/0x220 __do_page_fault+0x238/0x4e0 do_page_fault+0x22/0x30 ? page_fault+0x8/0x30 page_fault+0x1e/0x30 RIP: 0033:0x4009f0 Code: 03 00 00 00 e8 71 fd ff ff 48 83 f8 ff 49 89 c6 74 74 48 89 c6 bf c0 0c 40 00 31 c0 e8 69 fd ff ff 45 85 ff 7e 21 31 c9 66 90 <41> 0f be 14 0e 01 d3 f7 c1 ff 0f 00 00 75 05 41 c6 04 0e 2a 48 83 RSP: 002b:00007ffda45c9290 EFLAGS: 00010206 RAX: 000000000000001b RBX: 0000000000000000 RCX: 0000000001a1e000 RDX: 0000000000000000 RSI: 000000007fffffe5 RDI: 0000000000000000 RBP: 000000000000000c R08: 0000000000000000 R09: 00007f6d061ff20d R10: 0000000000000002 R11: 0000000000000246 R12: 00000000000186a0 R13: 0000000000000003 R14: 00007f6ce59b2000 R15: 0000000002800000 Task in /leaker killed as a result of limit of /leaker memory: usage 524288kB, limit 524288kB, failcnt 7221 memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0 kmem: usage 1944kB, limit 9007199254740988kB, failcnt 0 Memory cgroup stats for /leaker: cache:3632KB rss:518232KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB inactive_anon:0KB active_anon:518408KB inactive_file:3908KB active_file:12KB unevictable:0KB Memory cgroup out of memory: Kill process 2746 (leaker) score 992 or sacrifice child Killed process 2746 (leaker) total-vm:536704kB, anon-rss:518264kB, file-rss:1188kB, shmem-rss:0kB oom_reaper: reaped process 2746 (leaker), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [3] leaker invoked oom-killer: gfp_mask=0x50, order=0, oom_score_adj=0 leaker cpuset=/ mems_allowed=0 CPU: 1 PID: 3206 Comm: leaker Not tainted 3.10.0-957.27.2.el7.x86_64 #1 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018 Call Trace: [<ffffffffaf364147>] dump_stack+0x19/0x1b [<ffffffffaf35eb6a>] dump_header+0x90/0x229 [<ffffffffaedbb456>] ? find_lock_task_mm+0x56/0xc0 [<ffffffffaee32a38>] ? try_get_mem_cgroup_from_mm+0x28/0x60 [<ffffffffaedbb904>] oom_kill_process+0x254/0x3d0 [<ffffffffaee36c36>] mem_cgroup_oom_synchronize+0x546/0x570 [<ffffffffaee360b0>] ? mem_cgroup_charge_common+0xc0/0xc0 [<ffffffffaedbc194>] pagefault_out_of_memory+0x14/0x90 [<ffffffffaf35d072>] mm_fault_error+0x6a/0x157 [<ffffffffaf3717c8>] __do_page_fault+0x3c8/0x4f0 [<ffffffffaf371925>] do_page_fault+0x35/0x90 [<ffffffffaf36d768>] page_fault+0x28/0x30 Task in /leaker killed as a result of limit of /leaker memory: usage 524288kB, limit 524288kB, failcnt 20628 memory+swap: usage 524288kB, limit 9007199254740988kB, failcnt 0 kmem: usage 0kB, limit 9007199254740988kB, failcnt 0 Memory cgroup stats for /leaker: cache:840KB rss:523448KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:523448KB inactive_file:464KB active_file:376KB unevictable:0KB Memory cgroup out of memory: Kill process 3206 (leaker) score 970 or sacrifice child Killed process 3206 (leaker) total-vm:536692kB, anon-rss:523304kB, file-rss:412kB, shmem-rss:0kB Bisected by Masoud Sharbiani. Link: http://lkml.kernel.org/r/cbe54ed1-b6ba-a056-8899-2dc42526371d@i-love.sakura.ne.jp Fixes: 3da88fb3bacfaa33 (""mm, oom: move GFP_NOFS check to out_of_memory"") [necessary after 29ef680ae7c21110] Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-by: Masoud Sharbiani <msharbiani@apple.com> Tested-by: Masoud Sharbiani <msharbiani@apple.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> [4.19+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/f9c645621a28e37813a1de96d9cbd89cde94a1e4;"Until 29ef680ae7c21110, we were able to invoke memcg OOM killer when GFP_KERNEL reclaim failed [1]";no;no;yes 45;MDY6Q29tbWl0MjMyNTI5ODpmOWM2NDU2MjFhMjhlMzc4MTNhMWRlOTZkOWNiZDg5Y2RlOTRhMWU0;Tetsuo Handa;Linus Torvalds;"memcg, oom: don't require __GFP_FS when invoking memcg OOM killer Masoud Sharbiani noticed that commit 29ef680ae7c21110 (""memcg, oom: move out_of_memory back to the charge path"") broke memcg OOM called from __xfs_filemap_fault() path. It turned out that try_charge() is retrying forever without making forward progress because mem_cgroup_oom(GFP_NOFS) cannot invoke the OOM killer due to commit 3da88fb3bacfaa33 (""mm, oom: move GFP_NOFS check to out_of_memory""). Allowing forced charge due to being unable to invoke memcg OOM killer will lead to global OOM situation. Also, just returning -ENOMEM will be risky because OOM path is lost and some paths (e.g. get_user_pages()) will leak -ENOMEM. Therefore, invoking memcg OOM killer (despite GFP_NOFS) will be the only choice we can choose for now. Until 29ef680ae7c21110, we were able to invoke memcg OOM killer when GFP_KERNEL reclaim failed [1]. But since 29ef680ae7c21110, we need to invoke memcg OOM killer when GFP_NOFS reclaim failed [2]. Although in the past we did invoke memcg OOM killer for GFP_NOFS [3], we might get pre-mature memcg OOM reports due to this patch. [1] leaker invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0 CPU: 0 PID: 2746 Comm: leaker Not tainted 4.18.0+ #19 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018 Call Trace: dump_stack+0x63/0x88 dump_header+0x67/0x27a ? mem_cgroup_scan_tasks+0x91/0xf0 oom_kill_process+0x210/0x410 out_of_memory+0x10a/0x2c0 mem_cgroup_out_of_memory+0x46/0x80 mem_cgroup_oom_synchronize+0x2e4/0x310 ? high_work_func+0x20/0x20 pagefault_out_of_memory+0x31/0x76 mm_fault_error+0x55/0x115 ? handle_mm_fault+0xfd/0x220 __do_page_fault+0x433/0x4e0 do_page_fault+0x22/0x30 ? page_fault+0x8/0x30 page_fault+0x1e/0x30 RIP: 0033:0x4009f0 Code: 03 00 00 00 e8 71 fd ff ff 48 83 f8 ff 49 89 c6 74 74 48 89 c6 bf c0 0c 40 00 31 c0 e8 69 fd ff ff 45 85 ff 7e 21 31 c9 66 90 <41> 0f be 14 0e 01 d3 f7 c1 ff 0f 00 00 75 05 41 c6 04 0e 2a 48 83 RSP: 002b:00007ffe29ae96f0 EFLAGS: 00010206 RAX: 000000000000001b RBX: 0000000000000000 RCX: 0000000001ce1000 RDX: 0000000000000000 RSI: 000000007fffffe5 RDI: 0000000000000000 RBP: 000000000000000c R08: 0000000000000000 R09: 00007f94be09220d R10: 0000000000000002 R11: 0000000000000246 R12: 00000000000186a0 R13: 0000000000000003 R14: 00007f949d845000 R15: 0000000002800000 Task in /leaker killed as a result of limit of /leaker memory: usage 524288kB, limit 524288kB, failcnt 158965 memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0 kmem: usage 2016kB, limit 9007199254740988kB, failcnt 0 Memory cgroup stats for /leaker: cache:844KB rss:521136KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:132KB writeback:0KB inactive_anon:0KB active_anon:521224KB inactive_file:1012KB active_file:8KB unevictable:0KB Memory cgroup out of memory: Kill process 2746 (leaker) score 998 or sacrifice child Killed process 2746 (leaker) total-vm:536704kB, anon-rss:521176kB, file-rss:1208kB, shmem-rss:0kB oom_reaper: reaped process 2746 (leaker), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [2] leaker invoked oom-killer: gfp_mask=0x600040(GFP_NOFS), nodemask=(null), order=0, oom_score_adj=0 CPU: 1 PID: 2746 Comm: leaker Not tainted 4.18.0+ #20 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018 Call Trace: dump_stack+0x63/0x88 dump_header+0x67/0x27a ? mem_cgroup_scan_tasks+0x91/0xf0 oom_kill_process+0x210/0x410 out_of_memory+0x109/0x2d0 mem_cgroup_out_of_memory+0x46/0x80 try_charge+0x58d/0x650 ? __radix_tree_replace+0x81/0x100 mem_cgroup_try_charge+0x7a/0x100 __add_to_page_cache_locked+0x92/0x180 add_to_page_cache_lru+0x4d/0xf0 iomap_readpages_actor+0xde/0x1b0 ? iomap_zero_range_actor+0x1d0/0x1d0 iomap_apply+0xaf/0x130 iomap_readpages+0x9f/0x150 ? iomap_zero_range_actor+0x1d0/0x1d0 xfs_vm_readpages+0x18/0x20 [xfs] read_pages+0x60/0x140 __do_page_cache_readahead+0x193/0x1b0 ondemand_readahead+0x16d/0x2c0 page_cache_async_readahead+0x9a/0xd0 filemap_fault+0x403/0x620 ? alloc_set_pte+0x12c/0x540 ? _cond_resched+0x14/0x30 __xfs_filemap_fault+0x66/0x180 [xfs] xfs_filemap_fault+0x27/0x30 [xfs] __do_fault+0x19/0x40 __handle_mm_fault+0x8e8/0xb60 handle_mm_fault+0xfd/0x220 __do_page_fault+0x238/0x4e0 do_page_fault+0x22/0x30 ? page_fault+0x8/0x30 page_fault+0x1e/0x30 RIP: 0033:0x4009f0 Code: 03 00 00 00 e8 71 fd ff ff 48 83 f8 ff 49 89 c6 74 74 48 89 c6 bf c0 0c 40 00 31 c0 e8 69 fd ff ff 45 85 ff 7e 21 31 c9 66 90 <41> 0f be 14 0e 01 d3 f7 c1 ff 0f 00 00 75 05 41 c6 04 0e 2a 48 83 RSP: 002b:00007ffda45c9290 EFLAGS: 00010206 RAX: 000000000000001b RBX: 0000000000000000 RCX: 0000000001a1e000 RDX: 0000000000000000 RSI: 000000007fffffe5 RDI: 0000000000000000 RBP: 000000000000000c R08: 0000000000000000 R09: 00007f6d061ff20d R10: 0000000000000002 R11: 0000000000000246 R12: 00000000000186a0 R13: 0000000000000003 R14: 00007f6ce59b2000 R15: 0000000002800000 Task in /leaker killed as a result of limit of /leaker memory: usage 524288kB, limit 524288kB, failcnt 7221 memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0 kmem: usage 1944kB, limit 9007199254740988kB, failcnt 0 Memory cgroup stats for /leaker: cache:3632KB rss:518232KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB inactive_anon:0KB active_anon:518408KB inactive_file:3908KB active_file:12KB unevictable:0KB Memory cgroup out of memory: Kill process 2746 (leaker) score 992 or sacrifice child Killed process 2746 (leaker) total-vm:536704kB, anon-rss:518264kB, file-rss:1188kB, shmem-rss:0kB oom_reaper: reaped process 2746 (leaker), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [3] leaker invoked oom-killer: gfp_mask=0x50, order=0, oom_score_adj=0 leaker cpuset=/ mems_allowed=0 CPU: 1 PID: 3206 Comm: leaker Not tainted 3.10.0-957.27.2.el7.x86_64 #1 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018 Call Trace: [<ffffffffaf364147>] dump_stack+0x19/0x1b [<ffffffffaf35eb6a>] dump_header+0x90/0x229 [<ffffffffaedbb456>] ? find_lock_task_mm+0x56/0xc0 [<ffffffffaee32a38>] ? try_get_mem_cgroup_from_mm+0x28/0x60 [<ffffffffaedbb904>] oom_kill_process+0x254/0x3d0 [<ffffffffaee36c36>] mem_cgroup_oom_synchronize+0x546/0x570 [<ffffffffaee360b0>] ? mem_cgroup_charge_common+0xc0/0xc0 [<ffffffffaedbc194>] pagefault_out_of_memory+0x14/0x90 [<ffffffffaf35d072>] mm_fault_error+0x6a/0x157 [<ffffffffaf3717c8>] __do_page_fault+0x3c8/0x4f0 [<ffffffffaf371925>] do_page_fault+0x35/0x90 [<ffffffffaf36d768>] page_fault+0x28/0x30 Task in /leaker killed as a result of limit of /leaker memory: usage 524288kB, limit 524288kB, failcnt 20628 memory+swap: usage 524288kB, limit 9007199254740988kB, failcnt 0 kmem: usage 0kB, limit 9007199254740988kB, failcnt 0 Memory cgroup stats for /leaker: cache:840KB rss:523448KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:523448KB inactive_file:464KB active_file:376KB unevictable:0KB Memory cgroup out of memory: Kill process 3206 (leaker) score 970 or sacrifice child Killed process 3206 (leaker) total-vm:536692kB, anon-rss:523304kB, file-rss:412kB, shmem-rss:0kB Bisected by Masoud Sharbiani. Link: http://lkml.kernel.org/r/cbe54ed1-b6ba-a056-8899-2dc42526371d@i-love.sakura.ne.jp Fixes: 3da88fb3bacfaa33 (""mm, oom: move GFP_NOFS check to out_of_memory"") [necessary after 29ef680ae7c21110] Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-by: Masoud Sharbiani <msharbiani@apple.com> Tested-by: Masoud Sharbiani <msharbiani@apple.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> [4.19+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/f9c645621a28e37813a1de96d9cbd89cde94a1e4;" But since 29ef680ae7c21110, we need to invoke memcg OOM killer when GFP_NOFS reclaim failed [2]";no;no;yes 45;MDY6Q29tbWl0MjMyNTI5ODpmOWM2NDU2MjFhMjhlMzc4MTNhMWRlOTZkOWNiZDg5Y2RlOTRhMWU0;Tetsuo Handa;Linus Torvalds;"memcg, oom: don't require __GFP_FS when invoking memcg OOM killer Masoud Sharbiani noticed that commit 29ef680ae7c21110 (""memcg, oom: move out_of_memory back to the charge path"") broke memcg OOM called from __xfs_filemap_fault() path. It turned out that try_charge() is retrying forever without making forward progress because mem_cgroup_oom(GFP_NOFS) cannot invoke the OOM killer due to commit 3da88fb3bacfaa33 (""mm, oom: move GFP_NOFS check to out_of_memory""). Allowing forced charge due to being unable to invoke memcg OOM killer will lead to global OOM situation. Also, just returning -ENOMEM will be risky because OOM path is lost and some paths (e.g. get_user_pages()) will leak -ENOMEM. Therefore, invoking memcg OOM killer (despite GFP_NOFS) will be the only choice we can choose for now. Until 29ef680ae7c21110, we were able to invoke memcg OOM killer when GFP_KERNEL reclaim failed [1]. But since 29ef680ae7c21110, we need to invoke memcg OOM killer when GFP_NOFS reclaim failed [2]. Although in the past we did invoke memcg OOM killer for GFP_NOFS [3], we might get pre-mature memcg OOM reports due to this patch. [1] leaker invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0 CPU: 0 PID: 2746 Comm: leaker Not tainted 4.18.0+ #19 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018 Call Trace: dump_stack+0x63/0x88 dump_header+0x67/0x27a ? mem_cgroup_scan_tasks+0x91/0xf0 oom_kill_process+0x210/0x410 out_of_memory+0x10a/0x2c0 mem_cgroup_out_of_memory+0x46/0x80 mem_cgroup_oom_synchronize+0x2e4/0x310 ? high_work_func+0x20/0x20 pagefault_out_of_memory+0x31/0x76 mm_fault_error+0x55/0x115 ? handle_mm_fault+0xfd/0x220 __do_page_fault+0x433/0x4e0 do_page_fault+0x22/0x30 ? page_fault+0x8/0x30 page_fault+0x1e/0x30 RIP: 0033:0x4009f0 Code: 03 00 00 00 e8 71 fd ff ff 48 83 f8 ff 49 89 c6 74 74 48 89 c6 bf c0 0c 40 00 31 c0 e8 69 fd ff ff 45 85 ff 7e 21 31 c9 66 90 <41> 0f be 14 0e 01 d3 f7 c1 ff 0f 00 00 75 05 41 c6 04 0e 2a 48 83 RSP: 002b:00007ffe29ae96f0 EFLAGS: 00010206 RAX: 000000000000001b RBX: 0000000000000000 RCX: 0000000001ce1000 RDX: 0000000000000000 RSI: 000000007fffffe5 RDI: 0000000000000000 RBP: 000000000000000c R08: 0000000000000000 R09: 00007f94be09220d R10: 0000000000000002 R11: 0000000000000246 R12: 00000000000186a0 R13: 0000000000000003 R14: 00007f949d845000 R15: 0000000002800000 Task in /leaker killed as a result of limit of /leaker memory: usage 524288kB, limit 524288kB, failcnt 158965 memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0 kmem: usage 2016kB, limit 9007199254740988kB, failcnt 0 Memory cgroup stats for /leaker: cache:844KB rss:521136KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:132KB writeback:0KB inactive_anon:0KB active_anon:521224KB inactive_file:1012KB active_file:8KB unevictable:0KB Memory cgroup out of memory: Kill process 2746 (leaker) score 998 or sacrifice child Killed process 2746 (leaker) total-vm:536704kB, anon-rss:521176kB, file-rss:1208kB, shmem-rss:0kB oom_reaper: reaped process 2746 (leaker), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [2] leaker invoked oom-killer: gfp_mask=0x600040(GFP_NOFS), nodemask=(null), order=0, oom_score_adj=0 CPU: 1 PID: 2746 Comm: leaker Not tainted 4.18.0+ #20 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018 Call Trace: dump_stack+0x63/0x88 dump_header+0x67/0x27a ? mem_cgroup_scan_tasks+0x91/0xf0 oom_kill_process+0x210/0x410 out_of_memory+0x109/0x2d0 mem_cgroup_out_of_memory+0x46/0x80 try_charge+0x58d/0x650 ? __radix_tree_replace+0x81/0x100 mem_cgroup_try_charge+0x7a/0x100 __add_to_page_cache_locked+0x92/0x180 add_to_page_cache_lru+0x4d/0xf0 iomap_readpages_actor+0xde/0x1b0 ? iomap_zero_range_actor+0x1d0/0x1d0 iomap_apply+0xaf/0x130 iomap_readpages+0x9f/0x150 ? iomap_zero_range_actor+0x1d0/0x1d0 xfs_vm_readpages+0x18/0x20 [xfs] read_pages+0x60/0x140 __do_page_cache_readahead+0x193/0x1b0 ondemand_readahead+0x16d/0x2c0 page_cache_async_readahead+0x9a/0xd0 filemap_fault+0x403/0x620 ? alloc_set_pte+0x12c/0x540 ? _cond_resched+0x14/0x30 __xfs_filemap_fault+0x66/0x180 [xfs] xfs_filemap_fault+0x27/0x30 [xfs] __do_fault+0x19/0x40 __handle_mm_fault+0x8e8/0xb60 handle_mm_fault+0xfd/0x220 __do_page_fault+0x238/0x4e0 do_page_fault+0x22/0x30 ? page_fault+0x8/0x30 page_fault+0x1e/0x30 RIP: 0033:0x4009f0 Code: 03 00 00 00 e8 71 fd ff ff 48 83 f8 ff 49 89 c6 74 74 48 89 c6 bf c0 0c 40 00 31 c0 e8 69 fd ff ff 45 85 ff 7e 21 31 c9 66 90 <41> 0f be 14 0e 01 d3 f7 c1 ff 0f 00 00 75 05 41 c6 04 0e 2a 48 83 RSP: 002b:00007ffda45c9290 EFLAGS: 00010206 RAX: 000000000000001b RBX: 0000000000000000 RCX: 0000000001a1e000 RDX: 0000000000000000 RSI: 000000007fffffe5 RDI: 0000000000000000 RBP: 000000000000000c R08: 0000000000000000 R09: 00007f6d061ff20d R10: 0000000000000002 R11: 0000000000000246 R12: 00000000000186a0 R13: 0000000000000003 R14: 00007f6ce59b2000 R15: 0000000002800000 Task in /leaker killed as a result of limit of /leaker memory: usage 524288kB, limit 524288kB, failcnt 7221 memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0 kmem: usage 1944kB, limit 9007199254740988kB, failcnt 0 Memory cgroup stats for /leaker: cache:3632KB rss:518232KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB inactive_anon:0KB active_anon:518408KB inactive_file:3908KB active_file:12KB unevictable:0KB Memory cgroup out of memory: Kill process 2746 (leaker) score 992 or sacrifice child Killed process 2746 (leaker) total-vm:536704kB, anon-rss:518264kB, file-rss:1188kB, shmem-rss:0kB oom_reaper: reaped process 2746 (leaker), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [3] leaker invoked oom-killer: gfp_mask=0x50, order=0, oom_score_adj=0 leaker cpuset=/ mems_allowed=0 CPU: 1 PID: 3206 Comm: leaker Not tainted 3.10.0-957.27.2.el7.x86_64 #1 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018 Call Trace: [<ffffffffaf364147>] dump_stack+0x19/0x1b [<ffffffffaf35eb6a>] dump_header+0x90/0x229 [<ffffffffaedbb456>] ? find_lock_task_mm+0x56/0xc0 [<ffffffffaee32a38>] ? try_get_mem_cgroup_from_mm+0x28/0x60 [<ffffffffaedbb904>] oom_kill_process+0x254/0x3d0 [<ffffffffaee36c36>] mem_cgroup_oom_synchronize+0x546/0x570 [<ffffffffaee360b0>] ? mem_cgroup_charge_common+0xc0/0xc0 [<ffffffffaedbc194>] pagefault_out_of_memory+0x14/0x90 [<ffffffffaf35d072>] mm_fault_error+0x6a/0x157 [<ffffffffaf3717c8>] __do_page_fault+0x3c8/0x4f0 [<ffffffffaf371925>] do_page_fault+0x35/0x90 [<ffffffffaf36d768>] page_fault+0x28/0x30 Task in /leaker killed as a result of limit of /leaker memory: usage 524288kB, limit 524288kB, failcnt 20628 memory+swap: usage 524288kB, limit 9007199254740988kB, failcnt 0 kmem: usage 0kB, limit 9007199254740988kB, failcnt 0 Memory cgroup stats for /leaker: cache:840KB rss:523448KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:523448KB inactive_file:464KB active_file:376KB unevictable:0KB Memory cgroup out of memory: Kill process 3206 (leaker) score 970 or sacrifice child Killed process 3206 (leaker) total-vm:536692kB, anon-rss:523304kB, file-rss:412kB, shmem-rss:0kB Bisected by Masoud Sharbiani. Link: http://lkml.kernel.org/r/cbe54ed1-b6ba-a056-8899-2dc42526371d@i-love.sakura.ne.jp Fixes: 3da88fb3bacfaa33 (""mm, oom: move GFP_NOFS check to out_of_memory"") [necessary after 29ef680ae7c21110] Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-by: Masoud Sharbiani <msharbiani@apple.com> Tested-by: Masoud Sharbiani <msharbiani@apple.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> [4.19+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/f9c645621a28e37813a1de96d9cbd89cde94a1e4;" Although in the past we did invoke memcg OOM killer for GFP_NOFS [3], we might get pre-mature memcg OOM reports due to this patch.";no;no;yes 46;MDY6Q29tbWl0MjMyNTI5ODo4YWMzZjhmZTkxYTIxMTk1MjJhNzNmYmM0MWQzNTQwNTcwNTRlNmVk;Joel Savitz;Linus Torvalds;"mm/oom_kill.c: add task UID to info message on an oom kill In the event of an oom kill, useful information about the killed process is printed to dmesg. Users, especially system administrators, will find it useful to immediately see the UID of the process. We already print uid when dumping eligible tasks so it is not overly hard to find that information in the oom report. However this information is unavailable when dumping of eligible tasks is disabled. In the following example, abuse_the_ram is the name of a program that attempts to iteratively allocate all available memory until it is stopped by force. Current message: Out of memory: Killed process 35389 (abuse_the_ram) total-vm:133718232kB, anon-rss:129624980kB, file-rss:0kB, shmem-rss:0kB Patched message: Out of memory: Killed process 2739 (abuse_the_ram), total-vm:133880028kB, anon-rss:129754836kB, file-rss:0kB, shmem-rss:0kB, UID:0 [akpm@linux-foundation.org: s/UID %d/UID:%u/ in printk] Link: http://lkml.kernel.org/r/1560362273-534-1-git-send-email-jsavitz@redhat.com Signed-off-by: Joel Savitz <jsavitz@redhat.com> Suggested-by: David Rientjes <rientjes@google.com> Acked-by: Rafael Aquini <aquini@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/8ac3f8fe91a2119522a73fbc41d354057054e6ed;mm/oom_kill.c: add task UID to info message on an oom kill;yes;no;no 46;MDY6Q29tbWl0MjMyNTI5ODo4YWMzZjhmZTkxYTIxMTk1MjJhNzNmYmM0MWQzNTQwNTcwNTRlNmVk;Joel Savitz;Linus Torvalds;"mm/oom_kill.c: add task UID to info message on an oom kill In the event of an oom kill, useful information about the killed process is printed to dmesg. Users, especially system administrators, will find it useful to immediately see the UID of the process. We already print uid when dumping eligible tasks so it is not overly hard to find that information in the oom report. However this information is unavailable when dumping of eligible tasks is disabled. In the following example, abuse_the_ram is the name of a program that attempts to iteratively allocate all available memory until it is stopped by force. Current message: Out of memory: Killed process 35389 (abuse_the_ram) total-vm:133718232kB, anon-rss:129624980kB, file-rss:0kB, shmem-rss:0kB Patched message: Out of memory: Killed process 2739 (abuse_the_ram), total-vm:133880028kB, anon-rss:129754836kB, file-rss:0kB, shmem-rss:0kB, UID:0 [akpm@linux-foundation.org: s/UID %d/UID:%u/ in printk] Link: http://lkml.kernel.org/r/1560362273-534-1-git-send-email-jsavitz@redhat.com Signed-off-by: Joel Savitz <jsavitz@redhat.com> Suggested-by: David Rientjes <rientjes@google.com> Acked-by: Rafael Aquini <aquini@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/8ac3f8fe91a2119522a73fbc41d354057054e6ed;"In the event of an oom kill, useful information about the killed process is printed to dmesg";yes;yes;yes 46;MDY6Q29tbWl0MjMyNTI5ODo4YWMzZjhmZTkxYTIxMTk1MjJhNzNmYmM0MWQzNTQwNTcwNTRlNmVk;Joel Savitz;Linus Torvalds;"mm/oom_kill.c: add task UID to info message on an oom kill In the event of an oom kill, useful information about the killed process is printed to dmesg. Users, especially system administrators, will find it useful to immediately see the UID of the process. We already print uid when dumping eligible tasks so it is not overly hard to find that information in the oom report. However this information is unavailable when dumping of eligible tasks is disabled. In the following example, abuse_the_ram is the name of a program that attempts to iteratively allocate all available memory until it is stopped by force. Current message: Out of memory: Killed process 35389 (abuse_the_ram) total-vm:133718232kB, anon-rss:129624980kB, file-rss:0kB, shmem-rss:0kB Patched message: Out of memory: Killed process 2739 (abuse_the_ram), total-vm:133880028kB, anon-rss:129754836kB, file-rss:0kB, shmem-rss:0kB, UID:0 [akpm@linux-foundation.org: s/UID %d/UID:%u/ in printk] Link: http://lkml.kernel.org/r/1560362273-534-1-git-send-email-jsavitz@redhat.com Signed-off-by: Joel Savitz <jsavitz@redhat.com> Suggested-by: David Rientjes <rientjes@google.com> Acked-by: Rafael Aquini <aquini@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/8ac3f8fe91a2119522a73fbc41d354057054e6ed;" Users, especially system administrators, will find it useful to immediately see the UID of the process";yes;yes;yes 46;MDY6Q29tbWl0MjMyNTI5ODo4YWMzZjhmZTkxYTIxMTk1MjJhNzNmYmM0MWQzNTQwNTcwNTRlNmVk;Joel Savitz;Linus Torvalds;"mm/oom_kill.c: add task UID to info message on an oom kill In the event of an oom kill, useful information about the killed process is printed to dmesg. Users, especially system administrators, will find it useful to immediately see the UID of the process. We already print uid when dumping eligible tasks so it is not overly hard to find that information in the oom report. However this information is unavailable when dumping of eligible tasks is disabled. In the following example, abuse_the_ram is the name of a program that attempts to iteratively allocate all available memory until it is stopped by force. Current message: Out of memory: Killed process 35389 (abuse_the_ram) total-vm:133718232kB, anon-rss:129624980kB, file-rss:0kB, shmem-rss:0kB Patched message: Out of memory: Killed process 2739 (abuse_the_ram), total-vm:133880028kB, anon-rss:129754836kB, file-rss:0kB, shmem-rss:0kB, UID:0 [akpm@linux-foundation.org: s/UID %d/UID:%u/ in printk] Link: http://lkml.kernel.org/r/1560362273-534-1-git-send-email-jsavitz@redhat.com Signed-off-by: Joel Savitz <jsavitz@redhat.com> Suggested-by: David Rientjes <rientjes@google.com> Acked-by: Rafael Aquini <aquini@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/8ac3f8fe91a2119522a73fbc41d354057054e6ed;"We already print uid when dumping eligible tasks so it is not overly hard to find that information in the oom report";no;no;yes 46;MDY6Q29tbWl0MjMyNTI5ODo4YWMzZjhmZTkxYTIxMTk1MjJhNzNmYmM0MWQzNTQwNTcwNTRlNmVk;Joel Savitz;Linus Torvalds;"mm/oom_kill.c: add task UID to info message on an oom kill In the event of an oom kill, useful information about the killed process is printed to dmesg. Users, especially system administrators, will find it useful to immediately see the UID of the process. We already print uid when dumping eligible tasks so it is not overly hard to find that information in the oom report. However this information is unavailable when dumping of eligible tasks is disabled. In the following example, abuse_the_ram is the name of a program that attempts to iteratively allocate all available memory until it is stopped by force. Current message: Out of memory: Killed process 35389 (abuse_the_ram) total-vm:133718232kB, anon-rss:129624980kB, file-rss:0kB, shmem-rss:0kB Patched message: Out of memory: Killed process 2739 (abuse_the_ram), total-vm:133880028kB, anon-rss:129754836kB, file-rss:0kB, shmem-rss:0kB, UID:0 [akpm@linux-foundation.org: s/UID %d/UID:%u/ in printk] Link: http://lkml.kernel.org/r/1560362273-534-1-git-send-email-jsavitz@redhat.com Signed-off-by: Joel Savitz <jsavitz@redhat.com> Suggested-by: David Rientjes <rientjes@google.com> Acked-by: Rafael Aquini <aquini@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/8ac3f8fe91a2119522a73fbc41d354057054e6ed;" However this information is unavailable when dumping of eligible tasks is disabled";no;no;yes 46;MDY6Q29tbWl0MjMyNTI5ODo4YWMzZjhmZTkxYTIxMTk1MjJhNzNmYmM0MWQzNTQwNTcwNTRlNmVk;Joel Savitz;Linus Torvalds;"mm/oom_kill.c: add task UID to info message on an oom kill In the event of an oom kill, useful information about the killed process is printed to dmesg. Users, especially system administrators, will find it useful to immediately see the UID of the process. We already print uid when dumping eligible tasks so it is not overly hard to find that information in the oom report. However this information is unavailable when dumping of eligible tasks is disabled. In the following example, abuse_the_ram is the name of a program that attempts to iteratively allocate all available memory until it is stopped by force. Current message: Out of memory: Killed process 35389 (abuse_the_ram) total-vm:133718232kB, anon-rss:129624980kB, file-rss:0kB, shmem-rss:0kB Patched message: Out of memory: Killed process 2739 (abuse_the_ram), total-vm:133880028kB, anon-rss:129754836kB, file-rss:0kB, shmem-rss:0kB, UID:0 [akpm@linux-foundation.org: s/UID %d/UID:%u/ in printk] Link: http://lkml.kernel.org/r/1560362273-534-1-git-send-email-jsavitz@redhat.com Signed-off-by: Joel Savitz <jsavitz@redhat.com> Suggested-by: David Rientjes <rientjes@google.com> Acked-by: Rafael Aquini <aquini@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/8ac3f8fe91a2119522a73fbc41d354057054e6ed;"In the following example, abuse_the_ram is the name of a program that attempts to iteratively allocate all available memory until it is stopped by force";no;no;yes 47;MDY6Q29tbWl0MjMyNTI5ODoyYzIwNzk4NWYzNTRkZmI1NDllNWE1NDMxMDJhM2UwODRlZWE4MWY2;Tetsuo Handa;Linus Torvalds;"mm/oom_kill.c: remove redundant OOM score normalization in select_bad_process() Since commit bbbe48029720 (""mm, oom: remove 'prefer children over parent' heuristic"") removed the ""%s: Kill process %d (%s) score %u or sacrifice child\n"" line, oc->chosen_points is no longer used after select_bad_process(). Link: http://lkml.kernel.org/r/1560853435-15575-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Roman Gushchin <guro@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/2c207985f354dfb549e5a543102a3e084eea81f6;mm/oom_kill.c: remove redundant OOM score normalization in select_bad_process();yes;yes;no 47;MDY6Q29tbWl0MjMyNTI5ODoyYzIwNzk4NWYzNTRkZmI1NDllNWE1NDMxMDJhM2UwODRlZWE4MWY2;Tetsuo Handa;Linus Torvalds;"mm/oom_kill.c: remove redundant OOM score normalization in select_bad_process() Since commit bbbe48029720 (""mm, oom: remove 'prefer children over parent' heuristic"") removed the ""%s: Kill process %d (%s) score %u or sacrifice child\n"" line, oc->chosen_points is no longer used after select_bad_process(). Link: http://lkml.kernel.org/r/1560853435-15575-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Roman Gushchin <guro@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/2c207985f354dfb549e5a543102a3e084eea81f6;"Since commit bbbe48029720 (""mm, oom: remove 'prefer children over parent' heuristic"") removed the ""%s: Kill process %d (%s) score %u or sacrifice child\n"" line, oc->chosen_points is no longer used after select_bad_process().";no;yes;yes 48;MDY6Q29tbWl0MjMyNTI5ODphYzMxMWExNGM2ODJkY2Q4YTEyMGE2MjQ0ZDA1NDJlYzY1NGUzZDkz;Shakeel Butt;Linus Torvalds;"oom: decouple mems_allowed from oom_unkillable_task Commit ef08e3b4981a (""[PATCH] cpusets: confine oom_killer to mem_exclusive cpuset"") introduces a heuristic where a potential oom-killer victim is skipped if the intersection of the potential victim and the current (the process triggered the oom) is empty based on the reason that killing such victim most probably will not help the current allocating process. However the commit 7887a3da753e (""[PATCH] oom: cpuset hint"") changed the heuristic to just decrease the oom_badness scores of such potential victim based on the reason that the cpuset of such processes might have changed and previously they may have allocated memory on mems where the current allocating process can allocate from. Unintentionally 7887a3da753e (""[PATCH] oom: cpuset hint"") introduced a side effect as the oom_badness is also exposed to the user space through /proc/[pid]/oom_score, so, readers with different cpusets can read different oom_score of the same process. Later, commit 6cf86ac6f36b (""oom: filter tasks not sharing the same cpuset"") fixed the side effect introduced by 7887a3da753e by moving the cpuset intersection back to only oom-killer context and out of oom_badness. However the combination of ab290adbaf8f (""oom: make oom_unkillable_task() helper function"") and 26ebc984913b (""oom: /proc/<pid>/oom_score treat kernel thread honestly"") unintentionally brought back the cpuset intersection check into the oom_badness calculation function. Other than doing cpuset/mempolicy intersection from oom_badness, the memcg oom context is also doing cpuset/mempolicy intersection which is quite wrong and is caught by syzcaller with the following report: kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 28426 Comm: syz-executor.5 Not tainted 5.2.0-rc3-next-20190607 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000607304 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 Call Trace: oom_evaluate_task+0x49/0x520 mm/oom_kill.c:321 mem_cgroup_scan_tasks+0xcc/0x180 mm/memcontrol.c:1169 select_bad_process mm/oom_kill.c:374 [inline] out_of_memory mm/oom_kill.c:1088 [inline] out_of_memory+0x6b2/0x1280 mm/oom_kill.c:1035 mem_cgroup_out_of_memory+0x1ca/0x230 mm/memcontrol.c:1573 mem_cgroup_oom mm/memcontrol.c:1905 [inline] try_charge+0xfbe/0x1480 mm/memcontrol.c:2468 mem_cgroup_try_charge+0x24d/0x5e0 mm/memcontrol.c:6073 mem_cgroup_try_charge_delay+0x1f/0xa0 mm/memcontrol.c:6088 do_huge_pmd_wp_page_fallback+0x24f/0x1680 mm/huge_memory.c:1201 do_huge_pmd_wp_page+0x7fc/0x2160 mm/huge_memory.c:1359 wp_huge_pmd mm/memory.c:3793 [inline] __handle_mm_fault+0x164c/0x3eb0 mm/memory.c:4006 handle_mm_fault+0x3b7/0xa90 mm/memory.c:4053 do_user_addr_fault arch/x86/mm/fault.c:1455 [inline] __do_page_fault+0x5ef/0xda0 arch/x86/mm/fault.c:1521 do_page_fault+0x71/0x57d arch/x86/mm/fault.c:1552 page_fault+0x1e/0x30 arch/x86/entry/entry_64.S:1156 RIP: 0033:0x400590 Code: 06 e9 49 01 00 00 48 8b 44 24 10 48 0b 44 24 28 75 1f 48 8b 14 24 48 8b 7c 24 20 be 04 00 00 00 e8 f5 56 00 00 48 8b 74 24 08 <89> 06 e9 1e 01 00 00 48 8b 44 24 08 48 8b 14 24 be 04 00 00 00 8b RSP: 002b:00007fff7bc49780 EFLAGS: 00010206 RAX: 0000000000000001 RBX: 0000000000760000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 000000002000cffc RDI: 0000000000000001 RBP: fffffffffffffffe R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000075 R11: 0000000000000246 R12: 0000000000760008 R13: 00000000004c55f2 R14: 0000000000000000 R15: 00007fff7bc499b0 Modules linked in: ---[ end trace a65689219582ffff ]--- RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000001b2f823000 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 The fix is to decouple the cpuset/mempolicy intersection check from oom_unkillable_task() and make sure cpuset/mempolicy intersection check is only done in the global oom context. [shakeelb@google.com: change function name and update comment] Link: http://lkml.kernel.org/r/20190628152421.198994-3-shakeelb@google.com Link: http://lkml.kernel.org/r/20190624212631.87212-3-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Reported-by: syzbot+d0fc9d3c166bc5e4a94b@syzkaller.appspotmail.com Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Paul Jackson <pj@sgi.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/ac311a14c682dcd8a120a6244d0542ec654e3d93;oom: decouple mems_allowed from oom_unkillable_task;yes;yes;no 48;MDY6Q29tbWl0MjMyNTI5ODphYzMxMWExNGM2ODJkY2Q4YTEyMGE2MjQ0ZDA1NDJlYzY1NGUzZDkz;Shakeel Butt;Linus Torvalds;"oom: decouple mems_allowed from oom_unkillable_task Commit ef08e3b4981a (""[PATCH] cpusets: confine oom_killer to mem_exclusive cpuset"") introduces a heuristic where a potential oom-killer victim is skipped if the intersection of the potential victim and the current (the process triggered the oom) is empty based on the reason that killing such victim most probably will not help the current allocating process. However the commit 7887a3da753e (""[PATCH] oom: cpuset hint"") changed the heuristic to just decrease the oom_badness scores of such potential victim based on the reason that the cpuset of such processes might have changed and previously they may have allocated memory on mems where the current allocating process can allocate from. Unintentionally 7887a3da753e (""[PATCH] oom: cpuset hint"") introduced a side effect as the oom_badness is also exposed to the user space through /proc/[pid]/oom_score, so, readers with different cpusets can read different oom_score of the same process. Later, commit 6cf86ac6f36b (""oom: filter tasks not sharing the same cpuset"") fixed the side effect introduced by 7887a3da753e by moving the cpuset intersection back to only oom-killer context and out of oom_badness. However the combination of ab290adbaf8f (""oom: make oom_unkillable_task() helper function"") and 26ebc984913b (""oom: /proc/<pid>/oom_score treat kernel thread honestly"") unintentionally brought back the cpuset intersection check into the oom_badness calculation function. Other than doing cpuset/mempolicy intersection from oom_badness, the memcg oom context is also doing cpuset/mempolicy intersection which is quite wrong and is caught by syzcaller with the following report: kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 28426 Comm: syz-executor.5 Not tainted 5.2.0-rc3-next-20190607 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000607304 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 Call Trace: oom_evaluate_task+0x49/0x520 mm/oom_kill.c:321 mem_cgroup_scan_tasks+0xcc/0x180 mm/memcontrol.c:1169 select_bad_process mm/oom_kill.c:374 [inline] out_of_memory mm/oom_kill.c:1088 [inline] out_of_memory+0x6b2/0x1280 mm/oom_kill.c:1035 mem_cgroup_out_of_memory+0x1ca/0x230 mm/memcontrol.c:1573 mem_cgroup_oom mm/memcontrol.c:1905 [inline] try_charge+0xfbe/0x1480 mm/memcontrol.c:2468 mem_cgroup_try_charge+0x24d/0x5e0 mm/memcontrol.c:6073 mem_cgroup_try_charge_delay+0x1f/0xa0 mm/memcontrol.c:6088 do_huge_pmd_wp_page_fallback+0x24f/0x1680 mm/huge_memory.c:1201 do_huge_pmd_wp_page+0x7fc/0x2160 mm/huge_memory.c:1359 wp_huge_pmd mm/memory.c:3793 [inline] __handle_mm_fault+0x164c/0x3eb0 mm/memory.c:4006 handle_mm_fault+0x3b7/0xa90 mm/memory.c:4053 do_user_addr_fault arch/x86/mm/fault.c:1455 [inline] __do_page_fault+0x5ef/0xda0 arch/x86/mm/fault.c:1521 do_page_fault+0x71/0x57d arch/x86/mm/fault.c:1552 page_fault+0x1e/0x30 arch/x86/entry/entry_64.S:1156 RIP: 0033:0x400590 Code: 06 e9 49 01 00 00 48 8b 44 24 10 48 0b 44 24 28 75 1f 48 8b 14 24 48 8b 7c 24 20 be 04 00 00 00 e8 f5 56 00 00 48 8b 74 24 08 <89> 06 e9 1e 01 00 00 48 8b 44 24 08 48 8b 14 24 be 04 00 00 00 8b RSP: 002b:00007fff7bc49780 EFLAGS: 00010206 RAX: 0000000000000001 RBX: 0000000000760000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 000000002000cffc RDI: 0000000000000001 RBP: fffffffffffffffe R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000075 R11: 0000000000000246 R12: 0000000000760008 R13: 00000000004c55f2 R14: 0000000000000000 R15: 00007fff7bc499b0 Modules linked in: ---[ end trace a65689219582ffff ]--- RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000001b2f823000 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 The fix is to decouple the cpuset/mempolicy intersection check from oom_unkillable_task() and make sure cpuset/mempolicy intersection check is only done in the global oom context. [shakeelb@google.com: change function name and update comment] Link: http://lkml.kernel.org/r/20190628152421.198994-3-shakeelb@google.com Link: http://lkml.kernel.org/r/20190624212631.87212-3-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Reported-by: syzbot+d0fc9d3c166bc5e4a94b@syzkaller.appspotmail.com Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Paul Jackson <pj@sgi.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/ac311a14c682dcd8a120a6244d0542ec654e3d93;"Commit ef08e3b4981a (""[PATCH] cpusets: confine oom_killer to mem_exclusive cpuset"") introduces a heuristic where a potential oom-killer victim is skipped if the intersection of the potential victim and the current (the process triggered the oom) is empty based on the reason that killing such victim most probably will not help the current allocating process";no;no;yes 48;MDY6Q29tbWl0MjMyNTI5ODphYzMxMWExNGM2ODJkY2Q4YTEyMGE2MjQ0ZDA1NDJlYzY1NGUzZDkz;Shakeel Butt;Linus Torvalds;"oom: decouple mems_allowed from oom_unkillable_task Commit ef08e3b4981a (""[PATCH] cpusets: confine oom_killer to mem_exclusive cpuset"") introduces a heuristic where a potential oom-killer victim is skipped if the intersection of the potential victim and the current (the process triggered the oom) is empty based on the reason that killing such victim most probably will not help the current allocating process. However the commit 7887a3da753e (""[PATCH] oom: cpuset hint"") changed the heuristic to just decrease the oom_badness scores of such potential victim based on the reason that the cpuset of such processes might have changed and previously they may have allocated memory on mems where the current allocating process can allocate from. Unintentionally 7887a3da753e (""[PATCH] oom: cpuset hint"") introduced a side effect as the oom_badness is also exposed to the user space through /proc/[pid]/oom_score, so, readers with different cpusets can read different oom_score of the same process. Later, commit 6cf86ac6f36b (""oom: filter tasks not sharing the same cpuset"") fixed the side effect introduced by 7887a3da753e by moving the cpuset intersection back to only oom-killer context and out of oom_badness. However the combination of ab290adbaf8f (""oom: make oom_unkillable_task() helper function"") and 26ebc984913b (""oom: /proc/<pid>/oom_score treat kernel thread honestly"") unintentionally brought back the cpuset intersection check into the oom_badness calculation function. Other than doing cpuset/mempolicy intersection from oom_badness, the memcg oom context is also doing cpuset/mempolicy intersection which is quite wrong and is caught by syzcaller with the following report: kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 28426 Comm: syz-executor.5 Not tainted 5.2.0-rc3-next-20190607 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000607304 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 Call Trace: oom_evaluate_task+0x49/0x520 mm/oom_kill.c:321 mem_cgroup_scan_tasks+0xcc/0x180 mm/memcontrol.c:1169 select_bad_process mm/oom_kill.c:374 [inline] out_of_memory mm/oom_kill.c:1088 [inline] out_of_memory+0x6b2/0x1280 mm/oom_kill.c:1035 mem_cgroup_out_of_memory+0x1ca/0x230 mm/memcontrol.c:1573 mem_cgroup_oom mm/memcontrol.c:1905 [inline] try_charge+0xfbe/0x1480 mm/memcontrol.c:2468 mem_cgroup_try_charge+0x24d/0x5e0 mm/memcontrol.c:6073 mem_cgroup_try_charge_delay+0x1f/0xa0 mm/memcontrol.c:6088 do_huge_pmd_wp_page_fallback+0x24f/0x1680 mm/huge_memory.c:1201 do_huge_pmd_wp_page+0x7fc/0x2160 mm/huge_memory.c:1359 wp_huge_pmd mm/memory.c:3793 [inline] __handle_mm_fault+0x164c/0x3eb0 mm/memory.c:4006 handle_mm_fault+0x3b7/0xa90 mm/memory.c:4053 do_user_addr_fault arch/x86/mm/fault.c:1455 [inline] __do_page_fault+0x5ef/0xda0 arch/x86/mm/fault.c:1521 do_page_fault+0x71/0x57d arch/x86/mm/fault.c:1552 page_fault+0x1e/0x30 arch/x86/entry/entry_64.S:1156 RIP: 0033:0x400590 Code: 06 e9 49 01 00 00 48 8b 44 24 10 48 0b 44 24 28 75 1f 48 8b 14 24 48 8b 7c 24 20 be 04 00 00 00 e8 f5 56 00 00 48 8b 74 24 08 <89> 06 e9 1e 01 00 00 48 8b 44 24 08 48 8b 14 24 be 04 00 00 00 8b RSP: 002b:00007fff7bc49780 EFLAGS: 00010206 RAX: 0000000000000001 RBX: 0000000000760000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 000000002000cffc RDI: 0000000000000001 RBP: fffffffffffffffe R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000075 R11: 0000000000000246 R12: 0000000000760008 R13: 00000000004c55f2 R14: 0000000000000000 R15: 00007fff7bc499b0 Modules linked in: ---[ end trace a65689219582ffff ]--- RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000001b2f823000 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 The fix is to decouple the cpuset/mempolicy intersection check from oom_unkillable_task() and make sure cpuset/mempolicy intersection check is only done in the global oom context. [shakeelb@google.com: change function name and update comment] Link: http://lkml.kernel.org/r/20190628152421.198994-3-shakeelb@google.com Link: http://lkml.kernel.org/r/20190624212631.87212-3-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Reported-by: syzbot+d0fc9d3c166bc5e4a94b@syzkaller.appspotmail.com Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Paul Jackson <pj@sgi.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/ac311a14c682dcd8a120a6244d0542ec654e3d93;"However the commit 7887a3da753e (""[PATCH] oom: cpuset hint"") changed the heuristic to just decrease the oom_badness scores of such potential victim based on the reason that the cpuset of such processes might have changed and previously they may have allocated memory on mems where the current allocating process can allocate from";no;no;yes 48;MDY6Q29tbWl0MjMyNTI5ODphYzMxMWExNGM2ODJkY2Q4YTEyMGE2MjQ0ZDA1NDJlYzY1NGUzZDkz;Shakeel Butt;Linus Torvalds;"oom: decouple mems_allowed from oom_unkillable_task Commit ef08e3b4981a (""[PATCH] cpusets: confine oom_killer to mem_exclusive cpuset"") introduces a heuristic where a potential oom-killer victim is skipped if the intersection of the potential victim and the current (the process triggered the oom) is empty based on the reason that killing such victim most probably will not help the current allocating process. However the commit 7887a3da753e (""[PATCH] oom: cpuset hint"") changed the heuristic to just decrease the oom_badness scores of such potential victim based on the reason that the cpuset of such processes might have changed and previously they may have allocated memory on mems where the current allocating process can allocate from. Unintentionally 7887a3da753e (""[PATCH] oom: cpuset hint"") introduced a side effect as the oom_badness is also exposed to the user space through /proc/[pid]/oom_score, so, readers with different cpusets can read different oom_score of the same process. Later, commit 6cf86ac6f36b (""oom: filter tasks not sharing the same cpuset"") fixed the side effect introduced by 7887a3da753e by moving the cpuset intersection back to only oom-killer context and out of oom_badness. However the combination of ab290adbaf8f (""oom: make oom_unkillable_task() helper function"") and 26ebc984913b (""oom: /proc/<pid>/oom_score treat kernel thread honestly"") unintentionally brought back the cpuset intersection check into the oom_badness calculation function. Other than doing cpuset/mempolicy intersection from oom_badness, the memcg oom context is also doing cpuset/mempolicy intersection which is quite wrong and is caught by syzcaller with the following report: kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 28426 Comm: syz-executor.5 Not tainted 5.2.0-rc3-next-20190607 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000607304 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 Call Trace: oom_evaluate_task+0x49/0x520 mm/oom_kill.c:321 mem_cgroup_scan_tasks+0xcc/0x180 mm/memcontrol.c:1169 select_bad_process mm/oom_kill.c:374 [inline] out_of_memory mm/oom_kill.c:1088 [inline] out_of_memory+0x6b2/0x1280 mm/oom_kill.c:1035 mem_cgroup_out_of_memory+0x1ca/0x230 mm/memcontrol.c:1573 mem_cgroup_oom mm/memcontrol.c:1905 [inline] try_charge+0xfbe/0x1480 mm/memcontrol.c:2468 mem_cgroup_try_charge+0x24d/0x5e0 mm/memcontrol.c:6073 mem_cgroup_try_charge_delay+0x1f/0xa0 mm/memcontrol.c:6088 do_huge_pmd_wp_page_fallback+0x24f/0x1680 mm/huge_memory.c:1201 do_huge_pmd_wp_page+0x7fc/0x2160 mm/huge_memory.c:1359 wp_huge_pmd mm/memory.c:3793 [inline] __handle_mm_fault+0x164c/0x3eb0 mm/memory.c:4006 handle_mm_fault+0x3b7/0xa90 mm/memory.c:4053 do_user_addr_fault arch/x86/mm/fault.c:1455 [inline] __do_page_fault+0x5ef/0xda0 arch/x86/mm/fault.c:1521 do_page_fault+0x71/0x57d arch/x86/mm/fault.c:1552 page_fault+0x1e/0x30 arch/x86/entry/entry_64.S:1156 RIP: 0033:0x400590 Code: 06 e9 49 01 00 00 48 8b 44 24 10 48 0b 44 24 28 75 1f 48 8b 14 24 48 8b 7c 24 20 be 04 00 00 00 e8 f5 56 00 00 48 8b 74 24 08 <89> 06 e9 1e 01 00 00 48 8b 44 24 08 48 8b 14 24 be 04 00 00 00 8b RSP: 002b:00007fff7bc49780 EFLAGS: 00010206 RAX: 0000000000000001 RBX: 0000000000760000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 000000002000cffc RDI: 0000000000000001 RBP: fffffffffffffffe R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000075 R11: 0000000000000246 R12: 0000000000760008 R13: 00000000004c55f2 R14: 0000000000000000 R15: 00007fff7bc499b0 Modules linked in: ---[ end trace a65689219582ffff ]--- RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000001b2f823000 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 The fix is to decouple the cpuset/mempolicy intersection check from oom_unkillable_task() and make sure cpuset/mempolicy intersection check is only done in the global oom context. [shakeelb@google.com: change function name and update comment] Link: http://lkml.kernel.org/r/20190628152421.198994-3-shakeelb@google.com Link: http://lkml.kernel.org/r/20190624212631.87212-3-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Reported-by: syzbot+d0fc9d3c166bc5e4a94b@syzkaller.appspotmail.com Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Paul Jackson <pj@sgi.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/ac311a14c682dcd8a120a6244d0542ec654e3d93;"Unintentionally 7887a3da753e (""[PATCH] oom: cpuset hint"") introduced a side effect as the oom_badness is also exposed to the user space through /proc/[pid]/oom_score, so, readers with different cpusets can read different oom_score of the same process";no;yes;yes 48;MDY6Q29tbWl0MjMyNTI5ODphYzMxMWExNGM2ODJkY2Q4YTEyMGE2MjQ0ZDA1NDJlYzY1NGUzZDkz;Shakeel Butt;Linus Torvalds;"oom: decouple mems_allowed from oom_unkillable_task Commit ef08e3b4981a (""[PATCH] cpusets: confine oom_killer to mem_exclusive cpuset"") introduces a heuristic where a potential oom-killer victim is skipped if the intersection of the potential victim and the current (the process triggered the oom) is empty based on the reason that killing such victim most probably will not help the current allocating process. However the commit 7887a3da753e (""[PATCH] oom: cpuset hint"") changed the heuristic to just decrease the oom_badness scores of such potential victim based on the reason that the cpuset of such processes might have changed and previously they may have allocated memory on mems where the current allocating process can allocate from. Unintentionally 7887a3da753e (""[PATCH] oom: cpuset hint"") introduced a side effect as the oom_badness is also exposed to the user space through /proc/[pid]/oom_score, so, readers with different cpusets can read different oom_score of the same process. Later, commit 6cf86ac6f36b (""oom: filter tasks not sharing the same cpuset"") fixed the side effect introduced by 7887a3da753e by moving the cpuset intersection back to only oom-killer context and out of oom_badness. However the combination of ab290adbaf8f (""oom: make oom_unkillable_task() helper function"") and 26ebc984913b (""oom: /proc/<pid>/oom_score treat kernel thread honestly"") unintentionally brought back the cpuset intersection check into the oom_badness calculation function. Other than doing cpuset/mempolicy intersection from oom_badness, the memcg oom context is also doing cpuset/mempolicy intersection which is quite wrong and is caught by syzcaller with the following report: kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 28426 Comm: syz-executor.5 Not tainted 5.2.0-rc3-next-20190607 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000607304 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 Call Trace: oom_evaluate_task+0x49/0x520 mm/oom_kill.c:321 mem_cgroup_scan_tasks+0xcc/0x180 mm/memcontrol.c:1169 select_bad_process mm/oom_kill.c:374 [inline] out_of_memory mm/oom_kill.c:1088 [inline] out_of_memory+0x6b2/0x1280 mm/oom_kill.c:1035 mem_cgroup_out_of_memory+0x1ca/0x230 mm/memcontrol.c:1573 mem_cgroup_oom mm/memcontrol.c:1905 [inline] try_charge+0xfbe/0x1480 mm/memcontrol.c:2468 mem_cgroup_try_charge+0x24d/0x5e0 mm/memcontrol.c:6073 mem_cgroup_try_charge_delay+0x1f/0xa0 mm/memcontrol.c:6088 do_huge_pmd_wp_page_fallback+0x24f/0x1680 mm/huge_memory.c:1201 do_huge_pmd_wp_page+0x7fc/0x2160 mm/huge_memory.c:1359 wp_huge_pmd mm/memory.c:3793 [inline] __handle_mm_fault+0x164c/0x3eb0 mm/memory.c:4006 handle_mm_fault+0x3b7/0xa90 mm/memory.c:4053 do_user_addr_fault arch/x86/mm/fault.c:1455 [inline] __do_page_fault+0x5ef/0xda0 arch/x86/mm/fault.c:1521 do_page_fault+0x71/0x57d arch/x86/mm/fault.c:1552 page_fault+0x1e/0x30 arch/x86/entry/entry_64.S:1156 RIP: 0033:0x400590 Code: 06 e9 49 01 00 00 48 8b 44 24 10 48 0b 44 24 28 75 1f 48 8b 14 24 48 8b 7c 24 20 be 04 00 00 00 e8 f5 56 00 00 48 8b 74 24 08 <89> 06 e9 1e 01 00 00 48 8b 44 24 08 48 8b 14 24 be 04 00 00 00 8b RSP: 002b:00007fff7bc49780 EFLAGS: 00010206 RAX: 0000000000000001 RBX: 0000000000760000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 000000002000cffc RDI: 0000000000000001 RBP: fffffffffffffffe R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000075 R11: 0000000000000246 R12: 0000000000760008 R13: 00000000004c55f2 R14: 0000000000000000 R15: 00007fff7bc499b0 Modules linked in: ---[ end trace a65689219582ffff ]--- RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000001b2f823000 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 The fix is to decouple the cpuset/mempolicy intersection check from oom_unkillable_task() and make sure cpuset/mempolicy intersection check is only done in the global oom context. [shakeelb@google.com: change function name and update comment] Link: http://lkml.kernel.org/r/20190628152421.198994-3-shakeelb@google.com Link: http://lkml.kernel.org/r/20190624212631.87212-3-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Reported-by: syzbot+d0fc9d3c166bc5e4a94b@syzkaller.appspotmail.com Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Paul Jackson <pj@sgi.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/ac311a14c682dcd8a120a6244d0542ec654e3d93;"Later, commit 6cf86ac6f36b (""oom: filter tasks not sharing the same cpuset"") fixed the side effect introduced by 7887a3da753e by moving the cpuset intersection back to only oom-killer context and out of oom_badness";no;yes;yes 48;MDY6Q29tbWl0MjMyNTI5ODphYzMxMWExNGM2ODJkY2Q4YTEyMGE2MjQ0ZDA1NDJlYzY1NGUzZDkz;Shakeel Butt;Linus Torvalds;"oom: decouple mems_allowed from oom_unkillable_task Commit ef08e3b4981a (""[PATCH] cpusets: confine oom_killer to mem_exclusive cpuset"") introduces a heuristic where a potential oom-killer victim is skipped if the intersection of the potential victim and the current (the process triggered the oom) is empty based on the reason that killing such victim most probably will not help the current allocating process. However the commit 7887a3da753e (""[PATCH] oom: cpuset hint"") changed the heuristic to just decrease the oom_badness scores of such potential victim based on the reason that the cpuset of such processes might have changed and previously they may have allocated memory on mems where the current allocating process can allocate from. Unintentionally 7887a3da753e (""[PATCH] oom: cpuset hint"") introduced a side effect as the oom_badness is also exposed to the user space through /proc/[pid]/oom_score, so, readers with different cpusets can read different oom_score of the same process. Later, commit 6cf86ac6f36b (""oom: filter tasks not sharing the same cpuset"") fixed the side effect introduced by 7887a3da753e by moving the cpuset intersection back to only oom-killer context and out of oom_badness. However the combination of ab290adbaf8f (""oom: make oom_unkillable_task() helper function"") and 26ebc984913b (""oom: /proc/<pid>/oom_score treat kernel thread honestly"") unintentionally brought back the cpuset intersection check into the oom_badness calculation function. Other than doing cpuset/mempolicy intersection from oom_badness, the memcg oom context is also doing cpuset/mempolicy intersection which is quite wrong and is caught by syzcaller with the following report: kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 28426 Comm: syz-executor.5 Not tainted 5.2.0-rc3-next-20190607 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000607304 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 Call Trace: oom_evaluate_task+0x49/0x520 mm/oom_kill.c:321 mem_cgroup_scan_tasks+0xcc/0x180 mm/memcontrol.c:1169 select_bad_process mm/oom_kill.c:374 [inline] out_of_memory mm/oom_kill.c:1088 [inline] out_of_memory+0x6b2/0x1280 mm/oom_kill.c:1035 mem_cgroup_out_of_memory+0x1ca/0x230 mm/memcontrol.c:1573 mem_cgroup_oom mm/memcontrol.c:1905 [inline] try_charge+0xfbe/0x1480 mm/memcontrol.c:2468 mem_cgroup_try_charge+0x24d/0x5e0 mm/memcontrol.c:6073 mem_cgroup_try_charge_delay+0x1f/0xa0 mm/memcontrol.c:6088 do_huge_pmd_wp_page_fallback+0x24f/0x1680 mm/huge_memory.c:1201 do_huge_pmd_wp_page+0x7fc/0x2160 mm/huge_memory.c:1359 wp_huge_pmd mm/memory.c:3793 [inline] __handle_mm_fault+0x164c/0x3eb0 mm/memory.c:4006 handle_mm_fault+0x3b7/0xa90 mm/memory.c:4053 do_user_addr_fault arch/x86/mm/fault.c:1455 [inline] __do_page_fault+0x5ef/0xda0 arch/x86/mm/fault.c:1521 do_page_fault+0x71/0x57d arch/x86/mm/fault.c:1552 page_fault+0x1e/0x30 arch/x86/entry/entry_64.S:1156 RIP: 0033:0x400590 Code: 06 e9 49 01 00 00 48 8b 44 24 10 48 0b 44 24 28 75 1f 48 8b 14 24 48 8b 7c 24 20 be 04 00 00 00 e8 f5 56 00 00 48 8b 74 24 08 <89> 06 e9 1e 01 00 00 48 8b 44 24 08 48 8b 14 24 be 04 00 00 00 8b RSP: 002b:00007fff7bc49780 EFLAGS: 00010206 RAX: 0000000000000001 RBX: 0000000000760000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 000000002000cffc RDI: 0000000000000001 RBP: fffffffffffffffe R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000075 R11: 0000000000000246 R12: 0000000000760008 R13: 00000000004c55f2 R14: 0000000000000000 R15: 00007fff7bc499b0 Modules linked in: ---[ end trace a65689219582ffff ]--- RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000001b2f823000 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 The fix is to decouple the cpuset/mempolicy intersection check from oom_unkillable_task() and make sure cpuset/mempolicy intersection check is only done in the global oom context. [shakeelb@google.com: change function name and update comment] Link: http://lkml.kernel.org/r/20190628152421.198994-3-shakeelb@google.com Link: http://lkml.kernel.org/r/20190624212631.87212-3-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Reported-by: syzbot+d0fc9d3c166bc5e4a94b@syzkaller.appspotmail.com Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Paul Jackson <pj@sgi.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/ac311a14c682dcd8a120a6244d0542ec654e3d93;" However the combination of ab290adbaf8f (""oom: make oom_unkillable_task() helper function"") and 26ebc984913b (""oom";no;no;yes 48;MDY6Q29tbWl0MjMyNTI5ODphYzMxMWExNGM2ODJkY2Q4YTEyMGE2MjQ0ZDA1NDJlYzY1NGUzZDkz;Shakeel Butt;Linus Torvalds;"oom: decouple mems_allowed from oom_unkillable_task Commit ef08e3b4981a (""[PATCH] cpusets: confine oom_killer to mem_exclusive cpuset"") introduces a heuristic where a potential oom-killer victim is skipped if the intersection of the potential victim and the current (the process triggered the oom) is empty based on the reason that killing such victim most probably will not help the current allocating process. However the commit 7887a3da753e (""[PATCH] oom: cpuset hint"") changed the heuristic to just decrease the oom_badness scores of such potential victim based on the reason that the cpuset of such processes might have changed and previously they may have allocated memory on mems where the current allocating process can allocate from. Unintentionally 7887a3da753e (""[PATCH] oom: cpuset hint"") introduced a side effect as the oom_badness is also exposed to the user space through /proc/[pid]/oom_score, so, readers with different cpusets can read different oom_score of the same process. Later, commit 6cf86ac6f36b (""oom: filter tasks not sharing the same cpuset"") fixed the side effect introduced by 7887a3da753e by moving the cpuset intersection back to only oom-killer context and out of oom_badness. However the combination of ab290adbaf8f (""oom: make oom_unkillable_task() helper function"") and 26ebc984913b (""oom: /proc/<pid>/oom_score treat kernel thread honestly"") unintentionally brought back the cpuset intersection check into the oom_badness calculation function. Other than doing cpuset/mempolicy intersection from oom_badness, the memcg oom context is also doing cpuset/mempolicy intersection which is quite wrong and is caught by syzcaller with the following report: kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 28426 Comm: syz-executor.5 Not tainted 5.2.0-rc3-next-20190607 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000607304 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 Call Trace: oom_evaluate_task+0x49/0x520 mm/oom_kill.c:321 mem_cgroup_scan_tasks+0xcc/0x180 mm/memcontrol.c:1169 select_bad_process mm/oom_kill.c:374 [inline] out_of_memory mm/oom_kill.c:1088 [inline] out_of_memory+0x6b2/0x1280 mm/oom_kill.c:1035 mem_cgroup_out_of_memory+0x1ca/0x230 mm/memcontrol.c:1573 mem_cgroup_oom mm/memcontrol.c:1905 [inline] try_charge+0xfbe/0x1480 mm/memcontrol.c:2468 mem_cgroup_try_charge+0x24d/0x5e0 mm/memcontrol.c:6073 mem_cgroup_try_charge_delay+0x1f/0xa0 mm/memcontrol.c:6088 do_huge_pmd_wp_page_fallback+0x24f/0x1680 mm/huge_memory.c:1201 do_huge_pmd_wp_page+0x7fc/0x2160 mm/huge_memory.c:1359 wp_huge_pmd mm/memory.c:3793 [inline] __handle_mm_fault+0x164c/0x3eb0 mm/memory.c:4006 handle_mm_fault+0x3b7/0xa90 mm/memory.c:4053 do_user_addr_fault arch/x86/mm/fault.c:1455 [inline] __do_page_fault+0x5ef/0xda0 arch/x86/mm/fault.c:1521 do_page_fault+0x71/0x57d arch/x86/mm/fault.c:1552 page_fault+0x1e/0x30 arch/x86/entry/entry_64.S:1156 RIP: 0033:0x400590 Code: 06 e9 49 01 00 00 48 8b 44 24 10 48 0b 44 24 28 75 1f 48 8b 14 24 48 8b 7c 24 20 be 04 00 00 00 e8 f5 56 00 00 48 8b 74 24 08 <89> 06 e9 1e 01 00 00 48 8b 44 24 08 48 8b 14 24 be 04 00 00 00 8b RSP: 002b:00007fff7bc49780 EFLAGS: 00010206 RAX: 0000000000000001 RBX: 0000000000760000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 000000002000cffc RDI: 0000000000000001 RBP: fffffffffffffffe R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000075 R11: 0000000000000246 R12: 0000000000760008 R13: 00000000004c55f2 R14: 0000000000000000 R15: 00007fff7bc499b0 Modules linked in: ---[ end trace a65689219582ffff ]--- RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000001b2f823000 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 The fix is to decouple the cpuset/mempolicy intersection check from oom_unkillable_task() and make sure cpuset/mempolicy intersection check is only done in the global oom context. [shakeelb@google.com: change function name and update comment] Link: http://lkml.kernel.org/r/20190628152421.198994-3-shakeelb@google.com Link: http://lkml.kernel.org/r/20190624212631.87212-3-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Reported-by: syzbot+d0fc9d3c166bc5e4a94b@syzkaller.appspotmail.com Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Paul Jackson <pj@sgi.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/ac311a14c682dcd8a120a6244d0542ec654e3d93;"/proc/<pid>/oom_score treat kernel thread honestly"") unintentionally brought back the cpuset intersection check into the oom_badness calculation function";no;yes;yes 48;MDY6Q29tbWl0MjMyNTI5ODphYzMxMWExNGM2ODJkY2Q4YTEyMGE2MjQ0ZDA1NDJlYzY1NGUzZDkz;Shakeel Butt;Linus Torvalds;"oom: decouple mems_allowed from oom_unkillable_task Commit ef08e3b4981a (""[PATCH] cpusets: confine oom_killer to mem_exclusive cpuset"") introduces a heuristic where a potential oom-killer victim is skipped if the intersection of the potential victim and the current (the process triggered the oom) is empty based on the reason that killing such victim most probably will not help the current allocating process. However the commit 7887a3da753e (""[PATCH] oom: cpuset hint"") changed the heuristic to just decrease the oom_badness scores of such potential victim based on the reason that the cpuset of such processes might have changed and previously they may have allocated memory on mems where the current allocating process can allocate from. Unintentionally 7887a3da753e (""[PATCH] oom: cpuset hint"") introduced a side effect as the oom_badness is also exposed to the user space through /proc/[pid]/oom_score, so, readers with different cpusets can read different oom_score of the same process. Later, commit 6cf86ac6f36b (""oom: filter tasks not sharing the same cpuset"") fixed the side effect introduced by 7887a3da753e by moving the cpuset intersection back to only oom-killer context and out of oom_badness. However the combination of ab290adbaf8f (""oom: make oom_unkillable_task() helper function"") and 26ebc984913b (""oom: /proc/<pid>/oom_score treat kernel thread honestly"") unintentionally brought back the cpuset intersection check into the oom_badness calculation function. Other than doing cpuset/mempolicy intersection from oom_badness, the memcg oom context is also doing cpuset/mempolicy intersection which is quite wrong and is caught by syzcaller with the following report: kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 28426 Comm: syz-executor.5 Not tainted 5.2.0-rc3-next-20190607 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000607304 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 Call Trace: oom_evaluate_task+0x49/0x520 mm/oom_kill.c:321 mem_cgroup_scan_tasks+0xcc/0x180 mm/memcontrol.c:1169 select_bad_process mm/oom_kill.c:374 [inline] out_of_memory mm/oom_kill.c:1088 [inline] out_of_memory+0x6b2/0x1280 mm/oom_kill.c:1035 mem_cgroup_out_of_memory+0x1ca/0x230 mm/memcontrol.c:1573 mem_cgroup_oom mm/memcontrol.c:1905 [inline] try_charge+0xfbe/0x1480 mm/memcontrol.c:2468 mem_cgroup_try_charge+0x24d/0x5e0 mm/memcontrol.c:6073 mem_cgroup_try_charge_delay+0x1f/0xa0 mm/memcontrol.c:6088 do_huge_pmd_wp_page_fallback+0x24f/0x1680 mm/huge_memory.c:1201 do_huge_pmd_wp_page+0x7fc/0x2160 mm/huge_memory.c:1359 wp_huge_pmd mm/memory.c:3793 [inline] __handle_mm_fault+0x164c/0x3eb0 mm/memory.c:4006 handle_mm_fault+0x3b7/0xa90 mm/memory.c:4053 do_user_addr_fault arch/x86/mm/fault.c:1455 [inline] __do_page_fault+0x5ef/0xda0 arch/x86/mm/fault.c:1521 do_page_fault+0x71/0x57d arch/x86/mm/fault.c:1552 page_fault+0x1e/0x30 arch/x86/entry/entry_64.S:1156 RIP: 0033:0x400590 Code: 06 e9 49 01 00 00 48 8b 44 24 10 48 0b 44 24 28 75 1f 48 8b 14 24 48 8b 7c 24 20 be 04 00 00 00 e8 f5 56 00 00 48 8b 74 24 08 <89> 06 e9 1e 01 00 00 48 8b 44 24 08 48 8b 14 24 be 04 00 00 00 8b RSP: 002b:00007fff7bc49780 EFLAGS: 00010206 RAX: 0000000000000001 RBX: 0000000000760000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 000000002000cffc RDI: 0000000000000001 RBP: fffffffffffffffe R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000075 R11: 0000000000000246 R12: 0000000000760008 R13: 00000000004c55f2 R14: 0000000000000000 R15: 00007fff7bc499b0 Modules linked in: ---[ end trace a65689219582ffff ]--- RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000001b2f823000 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 The fix is to decouple the cpuset/mempolicy intersection check from oom_unkillable_task() and make sure cpuset/mempolicy intersection check is only done in the global oom context. [shakeelb@google.com: change function name and update comment] Link: http://lkml.kernel.org/r/20190628152421.198994-3-shakeelb@google.com Link: http://lkml.kernel.org/r/20190624212631.87212-3-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Reported-by: syzbot+d0fc9d3c166bc5e4a94b@syzkaller.appspotmail.com Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Paul Jackson <pj@sgi.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/ac311a14c682dcd8a120a6244d0542ec654e3d93;"Other than doing cpuset/mempolicy intersection from oom_badness, the memcg oom context is also doing cpuset/mempolicy intersection which is quite wrong and is caught by syzcaller with the following report";no;yes;yes 49;MDY6Q29tbWl0MjMyNTI5ODo2YmE3NDllZTc4ZWY0MmZmZGY0Yjk1YzA0MmZjNTc0YTM3ZDIyOWQ5;Shakeel Butt;Linus Torvalds;"mm, oom: remove redundant task_in_mem_cgroup() check oom_unkillable_task() can be called from three different contexts i.e. global OOM, memcg OOM and oom_score procfs interface. At the moment oom_unkillable_task() does a task_in_mem_cgroup() check on the given process. Since there is no reason to perform task_in_mem_cgroup() check for global OOM and oom_score procfs interface, those contexts provide NULL memcg and skips the task_in_mem_cgroup() check. However for memcg OOM context, the oom_unkillable_task() is always called from mem_cgroup_scan_tasks() and thus task_in_mem_cgroup() check becomes redundant and effectively dead code. So, just remove the task_in_mem_cgroup() check altogether. Link: http://lkml.kernel.org/r/20190624212631.87212-2-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Paul Jackson <pj@sgi.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/6ba749ee78ef42ffdf4b95c042fc574a37d229d9;mm, oom: remove redundant task_in_mem_cgroup() check;yes;yes;no 49;MDY6Q29tbWl0MjMyNTI5ODo2YmE3NDllZTc4ZWY0MmZmZGY0Yjk1YzA0MmZjNTc0YTM3ZDIyOWQ5;Shakeel Butt;Linus Torvalds;"mm, oom: remove redundant task_in_mem_cgroup() check oom_unkillable_task() can be called from three different contexts i.e. global OOM, memcg OOM and oom_score procfs interface. At the moment oom_unkillable_task() does a task_in_mem_cgroup() check on the given process. Since there is no reason to perform task_in_mem_cgroup() check for global OOM and oom_score procfs interface, those contexts provide NULL memcg and skips the task_in_mem_cgroup() check. However for memcg OOM context, the oom_unkillable_task() is always called from mem_cgroup_scan_tasks() and thus task_in_mem_cgroup() check becomes redundant and effectively dead code. So, just remove the task_in_mem_cgroup() check altogether. Link: http://lkml.kernel.org/r/20190624212631.87212-2-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Paul Jackson <pj@sgi.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/6ba749ee78ef42ffdf4b95c042fc574a37d229d9;oom_unkillable_task() can be called from three different contexts i.e;no;no;yes 49;MDY6Q29tbWl0MjMyNTI5ODo2YmE3NDllZTc4ZWY0MmZmZGY0Yjk1YzA0MmZjNTc0YTM3ZDIyOWQ5;Shakeel Butt;Linus Torvalds;"mm, oom: remove redundant task_in_mem_cgroup() check oom_unkillable_task() can be called from three different contexts i.e. global OOM, memcg OOM and oom_score procfs interface. At the moment oom_unkillable_task() does a task_in_mem_cgroup() check on the given process. Since there is no reason to perform task_in_mem_cgroup() check for global OOM and oom_score procfs interface, those contexts provide NULL memcg and skips the task_in_mem_cgroup() check. However for memcg OOM context, the oom_unkillable_task() is always called from mem_cgroup_scan_tasks() and thus task_in_mem_cgroup() check becomes redundant and effectively dead code. So, just remove the task_in_mem_cgroup() check altogether. Link: http://lkml.kernel.org/r/20190624212631.87212-2-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Paul Jackson <pj@sgi.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/6ba749ee78ef42ffdf4b95c042fc574a37d229d9;global OOM, memcg OOM and oom_score procfs interface;no;no;yes 49;MDY6Q29tbWl0MjMyNTI5ODo2YmE3NDllZTc4ZWY0MmZmZGY0Yjk1YzA0MmZjNTc0YTM3ZDIyOWQ5;Shakeel Butt;Linus Torvalds;"mm, oom: remove redundant task_in_mem_cgroup() check oom_unkillable_task() can be called from three different contexts i.e. global OOM, memcg OOM and oom_score procfs interface. At the moment oom_unkillable_task() does a task_in_mem_cgroup() check on the given process. Since there is no reason to perform task_in_mem_cgroup() check for global OOM and oom_score procfs interface, those contexts provide NULL memcg and skips the task_in_mem_cgroup() check. However for memcg OOM context, the oom_unkillable_task() is always called from mem_cgroup_scan_tasks() and thus task_in_mem_cgroup() check becomes redundant and effectively dead code. So, just remove the task_in_mem_cgroup() check altogether. Link: http://lkml.kernel.org/r/20190624212631.87212-2-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Paul Jackson <pj@sgi.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/6ba749ee78ef42ffdf4b95c042fc574a37d229d9;" At the moment oom_unkillable_task() does a task_in_mem_cgroup() check on the given process";no;no;yes 49;MDY6Q29tbWl0MjMyNTI5ODo2YmE3NDllZTc4ZWY0MmZmZGY0Yjk1YzA0MmZjNTc0YTM3ZDIyOWQ5;Shakeel Butt;Linus Torvalds;"mm, oom: remove redundant task_in_mem_cgroup() check oom_unkillable_task() can be called from three different contexts i.e. global OOM, memcg OOM and oom_score procfs interface. At the moment oom_unkillable_task() does a task_in_mem_cgroup() check on the given process. Since there is no reason to perform task_in_mem_cgroup() check for global OOM and oom_score procfs interface, those contexts provide NULL memcg and skips the task_in_mem_cgroup() check. However for memcg OOM context, the oom_unkillable_task() is always called from mem_cgroup_scan_tasks() and thus task_in_mem_cgroup() check becomes redundant and effectively dead code. So, just remove the task_in_mem_cgroup() check altogether. Link: http://lkml.kernel.org/r/20190624212631.87212-2-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Paul Jackson <pj@sgi.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/6ba749ee78ef42ffdf4b95c042fc574a37d229d9;" Since there is no reason to perform task_in_mem_cgroup() check for global OOM and oom_score procfs interface, those contexts provide NULL memcg and skips the task_in_mem_cgroup() check";no;no;yes 49;MDY6Q29tbWl0MjMyNTI5ODo2YmE3NDllZTc4ZWY0MmZmZGY0Yjk1YzA0MmZjNTc0YTM3ZDIyOWQ5;Shakeel Butt;Linus Torvalds;"mm, oom: remove redundant task_in_mem_cgroup() check oom_unkillable_task() can be called from three different contexts i.e. global OOM, memcg OOM and oom_score procfs interface. At the moment oom_unkillable_task() does a task_in_mem_cgroup() check on the given process. Since there is no reason to perform task_in_mem_cgroup() check for global OOM and oom_score procfs interface, those contexts provide NULL memcg and skips the task_in_mem_cgroup() check. However for memcg OOM context, the oom_unkillable_task() is always called from mem_cgroup_scan_tasks() and thus task_in_mem_cgroup() check becomes redundant and effectively dead code. So, just remove the task_in_mem_cgroup() check altogether. Link: http://lkml.kernel.org/r/20190624212631.87212-2-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Paul Jackson <pj@sgi.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/6ba749ee78ef42ffdf4b95c042fc574a37d229d9;" However for memcg OOM context, the oom_unkillable_task() is always called from mem_cgroup_scan_tasks() and thus task_in_mem_cgroup() check becomes redundant and effectively dead code";no;yes;yes 49;MDY6Q29tbWl0MjMyNTI5ODo2YmE3NDllZTc4ZWY0MmZmZGY0Yjk1YzA0MmZjNTc0YTM3ZDIyOWQ5;Shakeel Butt;Linus Torvalds;"mm, oom: remove redundant task_in_mem_cgroup() check oom_unkillable_task() can be called from three different contexts i.e. global OOM, memcg OOM and oom_score procfs interface. At the moment oom_unkillable_task() does a task_in_mem_cgroup() check on the given process. Since there is no reason to perform task_in_mem_cgroup() check for global OOM and oom_score procfs interface, those contexts provide NULL memcg and skips the task_in_mem_cgroup() check. However for memcg OOM context, the oom_unkillable_task() is always called from mem_cgroup_scan_tasks() and thus task_in_mem_cgroup() check becomes redundant and effectively dead code. So, just remove the task_in_mem_cgroup() check altogether. Link: http://lkml.kernel.org/r/20190624212631.87212-2-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Paul Jackson <pj@sgi.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/6ba749ee78ef42ffdf4b95c042fc574a37d229d9;" So, just remove the task_in_mem_cgroup() check altogether.";yes;no;no 51;MDY6Q29tbWl0MjMyNTI5ODpmMTY4YTlhNTRlYzM5YjNmODMyYzM1MzczMzg5OGI3MTNiNmI1YzFm;Tetsuo Handa;Linus Torvalds;"mm: memcontrol: use CSS_TASK_ITER_PROCS at mem_cgroup_scan_tasks() Since commit c03cd7738a83 (""cgroup: Include dying leaders with live threads in PROCS iterations"") corrected how CSS_TASK_ITER_PROCS works, mem_cgroup_scan_tasks() can use CSS_TASK_ITER_PROCS in order to check only one thread from each thread group. [penguin-kernel@I-love.SAKURA.ne.jp: remove thread group leader check in oom_evaluate_task()] Link: http://lkml.kernel.org/r/1560853257-14934-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp Link: http://lkml.kernel.org/r/c763afc8-f0ae-756a-56a7-395f625b95fc@i-love.sakura.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/f168a9a54ec39b3f832c353733898b713b6b5c1f;mm: memcontrol: use CSS_TASK_ITER_PROCS at mem_cgroup_scan_tasks();yes;no;no 51;MDY6Q29tbWl0MjMyNTI5ODpmMTY4YTlhNTRlYzM5YjNmODMyYzM1MzczMzg5OGI3MTNiNmI1YzFm;Tetsuo Handa;Linus Torvalds;"mm: memcontrol: use CSS_TASK_ITER_PROCS at mem_cgroup_scan_tasks() Since commit c03cd7738a83 (""cgroup: Include dying leaders with live threads in PROCS iterations"") corrected how CSS_TASK_ITER_PROCS works, mem_cgroup_scan_tasks() can use CSS_TASK_ITER_PROCS in order to check only one thread from each thread group. [penguin-kernel@I-love.SAKURA.ne.jp: remove thread group leader check in oom_evaluate_task()] Link: http://lkml.kernel.org/r/1560853257-14934-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp Link: http://lkml.kernel.org/r/c763afc8-f0ae-756a-56a7-395f625b95fc@i-love.sakura.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/f168a9a54ec39b3f832c353733898b713b6b5c1f;"Since commit c03cd7738a83 (""cgroup: Include dying leaders with live threads in PROCS iterations"") corrected how CSS_TASK_ITER_PROCS works, mem_cgroup_scan_tasks() can use CSS_TASK_ITER_PROCS in order to check only one thread from each thread group.";yes;yes;yes 52;MDY6Q29tbWl0MjMyNTI5ODo0MzJiMWRlMGRlMDJhODNmNjQ2OTVlNjlhMmQ4M2NiZWUxMGMyMzZm;Yafang Shao;Linus Torvalds;"mm/oom_kill.c: fix uninitialized oc->constraint In dump_oom_summary() oc->constraint is used to show oom_constraint_text, but it hasn't been set before. So the value of it is always the default value 0. We should inititialize it before. Bellow is the output when memcg oom occurs, before this patch: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null), cpuset=/,mems_allowed=0,oom_memcg=/foo,task_memcg=/foo,task=bash,pid=7997,uid=0 after this patch: oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null), cpuset=/,mems_allowed=0,oom_memcg=/foo,task_memcg=/foo,task=bash,pid=13681,uid=0 Link: http://lkml.kernel.org/r/1560522038-15879-1-git-send-email-laoar.shao@gmail.com Fixes: ef8444ea01d7 (""mm, oom: reorganize the oom report in dump_header"") Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Wind Yu <yuzhoujian@didichuxing.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/432b1de0de02a83f64695e69a2d83cbee10c236f;mm/oom_kill.c: fix uninitialized oc->constraint;yes;yes;no 52;MDY6Q29tbWl0MjMyNTI5ODo0MzJiMWRlMGRlMDJhODNmNjQ2OTVlNjlhMmQ4M2NiZWUxMGMyMzZm;Yafang Shao;Linus Torvalds;"mm/oom_kill.c: fix uninitialized oc->constraint In dump_oom_summary() oc->constraint is used to show oom_constraint_text, but it hasn't been set before. So the value of it is always the default value 0. We should inititialize it before. Bellow is the output when memcg oom occurs, before this patch: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null), cpuset=/,mems_allowed=0,oom_memcg=/foo,task_memcg=/foo,task=bash,pid=7997,uid=0 after this patch: oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null), cpuset=/,mems_allowed=0,oom_memcg=/foo,task_memcg=/foo,task=bash,pid=13681,uid=0 Link: http://lkml.kernel.org/r/1560522038-15879-1-git-send-email-laoar.shao@gmail.com Fixes: ef8444ea01d7 (""mm, oom: reorganize the oom report in dump_header"") Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Wind Yu <yuzhoujian@didichuxing.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/432b1de0de02a83f64695e69a2d83cbee10c236f;"In dump_oom_summary() oc->constraint is used to show oom_constraint_text, but it hasn't been set before";no;yes;yes 52;MDY6Q29tbWl0MjMyNTI5ODo0MzJiMWRlMGRlMDJhODNmNjQ2OTVlNjlhMmQ4M2NiZWUxMGMyMzZm;Yafang Shao;Linus Torvalds;"mm/oom_kill.c: fix uninitialized oc->constraint In dump_oom_summary() oc->constraint is used to show oom_constraint_text, but it hasn't been set before. So the value of it is always the default value 0. We should inititialize it before. Bellow is the output when memcg oom occurs, before this patch: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null), cpuset=/,mems_allowed=0,oom_memcg=/foo,task_memcg=/foo,task=bash,pid=7997,uid=0 after this patch: oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null), cpuset=/,mems_allowed=0,oom_memcg=/foo,task_memcg=/foo,task=bash,pid=13681,uid=0 Link: http://lkml.kernel.org/r/1560522038-15879-1-git-send-email-laoar.shao@gmail.com Fixes: ef8444ea01d7 (""mm, oom: reorganize the oom report in dump_header"") Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Wind Yu <yuzhoujian@didichuxing.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/432b1de0de02a83f64695e69a2d83cbee10c236f;" So the value of it is always the default value 0";no;yes;yes 52;MDY6Q29tbWl0MjMyNTI5ODo0MzJiMWRlMGRlMDJhODNmNjQ2OTVlNjlhMmQ4M2NiZWUxMGMyMzZm;Yafang Shao;Linus Torvalds;"mm/oom_kill.c: fix uninitialized oc->constraint In dump_oom_summary() oc->constraint is used to show oom_constraint_text, but it hasn't been set before. So the value of it is always the default value 0. We should inititialize it before. Bellow is the output when memcg oom occurs, before this patch: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null), cpuset=/,mems_allowed=0,oom_memcg=/foo,task_memcg=/foo,task=bash,pid=7997,uid=0 after this patch: oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null), cpuset=/,mems_allowed=0,oom_memcg=/foo,task_memcg=/foo,task=bash,pid=13681,uid=0 Link: http://lkml.kernel.org/r/1560522038-15879-1-git-send-email-laoar.shao@gmail.com Fixes: ef8444ea01d7 (""mm, oom: reorganize the oom report in dump_header"") Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Wind Yu <yuzhoujian@didichuxing.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/432b1de0de02a83f64695e69a2d83cbee10c236f; We should inititialize it before;yes;yes;no 52;MDY6Q29tbWl0MjMyNTI5ODo0MzJiMWRlMGRlMDJhODNmNjQ2OTVlNjlhMmQ4M2NiZWUxMGMyMzZm;Yafang Shao;Linus Torvalds;"mm/oom_kill.c: fix uninitialized oc->constraint In dump_oom_summary() oc->constraint is used to show oom_constraint_text, but it hasn't been set before. So the value of it is always the default value 0. We should inititialize it before. Bellow is the output when memcg oom occurs, before this patch: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null), cpuset=/,mems_allowed=0,oom_memcg=/foo,task_memcg=/foo,task=bash,pid=7997,uid=0 after this patch: oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null), cpuset=/,mems_allowed=0,oom_memcg=/foo,task_memcg=/foo,task=bash,pid=13681,uid=0 Link: http://lkml.kernel.org/r/1560522038-15879-1-git-send-email-laoar.shao@gmail.com Fixes: ef8444ea01d7 (""mm, oom: reorganize the oom report in dump_header"") Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Wind Yu <yuzhoujian@didichuxing.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/432b1de0de02a83f64695e69a2d83cbee10c236f;"Bellow is the output when memcg oom occurs, before this patch";no;no;yes 53;MDY6Q29tbWl0MjMyNTI5ODo0NTdjODk5NjUzOTkxMTVlNWNkOGJmMzhmOWM1OTcyOTM0MDU3MDNk;Thomas Gleixner;Greg Kroah-Hartman;"treewide: Add SPDX license identifier for missed files Add SPDX license identifiers to all files which: - Have no license information of any form - Have EXPORT_.*_SYMBOL_GPL inside which was used in the initial scan/conversion to ignore the file These files fall under the project license, GPL v2 only. The resulting SPDX license identifier is: GPL-2.0-only Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/457c89965399115e5cd8bf38f9c597293405703d;treewide: Add SPDX license identifier for missed files;yes;yes;no 53;MDY6Q29tbWl0MjMyNTI5ODo0NTdjODk5NjUzOTkxMTVlNWNkOGJmMzhmOWM1OTcyOTM0MDU3MDNk;Thomas Gleixner;Greg Kroah-Hartman;"treewide: Add SPDX license identifier for missed files Add SPDX license identifiers to all files which: - Have no license information of any form - Have EXPORT_.*_SYMBOL_GPL inside which was used in the initial scan/conversion to ignore the file These files fall under the project license, GPL v2 only. The resulting SPDX license identifier is: GPL-2.0-only Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/457c89965399115e5cd8bf38f9c597293405703d;Add SPDX license identifiers to all files which;yes;no;no 53;MDY6Q29tbWl0MjMyNTI5ODo0NTdjODk5NjUzOTkxMTVlNWNkOGJmMzhmOWM1OTcyOTM0MDU3MDNk;Thomas Gleixner;Greg Kroah-Hartman;"treewide: Add SPDX license identifier for missed files Add SPDX license identifiers to all files which: - Have no license information of any form - Have EXPORT_.*_SYMBOL_GPL inside which was used in the initial scan/conversion to ignore the file These files fall under the project license, GPL v2 only. The resulting SPDX license identifier is: GPL-2.0-only Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/457c89965399115e5cd8bf38f9c597293405703d;" - Have no license information of any form - Have EXPORT_.*_SYMBOL_GPL inside which was used in the initial scan/conversion to ignore the file These files fall under the project license, GPL v2 only";yes;yes;yes 53;MDY6Q29tbWl0MjMyNTI5ODo0NTdjODk5NjUzOTkxMTVlNWNkOGJmMzhmOWM1OTcyOTM0MDU3MDNk;Thomas Gleixner;Greg Kroah-Hartman;"treewide: Add SPDX license identifier for missed files Add SPDX license identifiers to all files which: - Have no license information of any form - Have EXPORT_.*_SYMBOL_GPL inside which was used in the initial scan/conversion to ignore the file These files fall under the project license, GPL v2 only. The resulting SPDX license identifier is: GPL-2.0-only Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/457c89965399115e5cd8bf38f9c597293405703d;"The resulting SPDX license identifier is";yes;no;yes 53;MDY6Q29tbWl0MjMyNTI5ODo0NTdjODk5NjUzOTkxMTVlNWNkOGJmMzhmOWM1OTcyOTM0MDU3MDNk;Thomas Gleixner;Greg Kroah-Hartman;"treewide: Add SPDX license identifier for missed files Add SPDX license identifiers to all files which: - Have no license information of any form - Have EXPORT_.*_SYMBOL_GPL inside which was used in the initial scan/conversion to ignore the file These files fall under the project license, GPL v2 only. The resulting SPDX license identifier is: GPL-2.0-only Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/457c89965399115e5cd8bf38f9c597293405703d; GPL-2.0-only;yes;no;no 54;MDY6Q29tbWl0MjMyNTI5ODo2ZjRmMTNlOGQ5ZTI3Y2VmZDJjZDg4ZGQ0ZmQ4MGFhNmQ2OGI5MTMx;Jérôme Glisse;Linus Torvalds;"mm/mmu_notifier: contextual information for event triggering invalidation CPU page table update can happens for many reasons, not only as a result of a syscall (munmap(), mprotect(), mremap(), madvise(), ...) but also as a result of kernel activities (memory compression, reclaim, migration, ...). Users of mmu notifier API track changes to the CPU page table and take specific action for them. While current API only provide range of virtual address affected by the change, not why the changes is happening. This patchset do the initial mechanical convertion of all the places that calls mmu_notifier_range_init to also provide the default MMU_NOTIFY_UNMAP event as well as the vma if it is know (most invalidation happens against a given vma). Passing down the vma allows the users of mmu notifier to inspect the new vma page protection. The MMU_NOTIFY_UNMAP is always the safe default as users of mmu notifier should assume that every for the range is going away when that event happens. A latter patch do convert mm call path to use a more appropriate events for each call. This is done as 2 patches so that no call site is forgotten especialy as it uses this following coccinelle patch: %<---------------------------------------------------------------------- @@ identifier I1, I2, I3, I4; @@ static inline void mmu_notifier_range_init(struct mmu_notifier_range *I1, +enum mmu_notifier_event event, +unsigned flags, +struct vm_area_struct *vma, struct mm_struct *I2, unsigned long I3, unsigned long I4) { ... } @@ @@ -#define mmu_notifier_range_init(range, mm, start, end) +#define mmu_notifier_range_init(range, event, flags, vma, mm, start, end) @@ expression E1, E3, E4; identifier I1; @@ <... mmu_notifier_range_init(E1, +MMU_NOTIFY_UNMAP, 0, I1, I1->vm_mm, E3, E4) ...> @@ expression E1, E2, E3, E4; identifier FN, VMA; @@ FN(..., struct vm_area_struct *VMA, ...) { <... mmu_notifier_range_init(E1, +MMU_NOTIFY_UNMAP, 0, VMA, E2, E3, E4) ...> } @@ expression E1, E2, E3, E4; identifier FN, VMA; @@ FN(...) { struct vm_area_struct *VMA; <... mmu_notifier_range_init(E1, +MMU_NOTIFY_UNMAP, 0, VMA, E2, E3, E4) ...> } @@ expression E1, E2, E3, E4; identifier FN; @@ FN(...) { <... mmu_notifier_range_init(E1, +MMU_NOTIFY_UNMAP, 0, NULL, E2, E3, E4) ...> } ---------------------------------------------------------------------->% Applied with: spatch --all-includes --sp-file mmu-notifier.spatch fs/proc/task_mmu.c --in-place spatch --sp-file mmu-notifier.spatch --dir kernel/events/ --in-place spatch --sp-file mmu-notifier.spatch --dir mm --in-place Link: http://lkml.kernel.org/r/20190326164747.24405-6-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Cc: Christian König <christian.koenig@amd.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Ross Zwisler <zwisler@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krcmar <rkrcmar@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Christian Koenig <christian.koenig@amd.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/6f4f13e8d9e27cefd2cd88dd4fd80aa6d68b9131;mm/mmu_notifier: contextual information for event triggering invalidation;no;no;no 54;MDY6Q29tbWl0MjMyNTI5ODo2ZjRmMTNlOGQ5ZTI3Y2VmZDJjZDg4ZGQ0ZmQ4MGFhNmQ2OGI5MTMx;Jérôme Glisse;Linus Torvalds;"mm/mmu_notifier: contextual information for event triggering invalidation CPU page table update can happens for many reasons, not only as a result of a syscall (munmap(), mprotect(), mremap(), madvise(), ...) but also as a result of kernel activities (memory compression, reclaim, migration, ...). Users of mmu notifier API track changes to the CPU page table and take specific action for them. While current API only provide range of virtual address affected by the change, not why the changes is happening. This patchset do the initial mechanical convertion of all the places that calls mmu_notifier_range_init to also provide the default MMU_NOTIFY_UNMAP event as well as the vma if it is know (most invalidation happens against a given vma). Passing down the vma allows the users of mmu notifier to inspect the new vma page protection. The MMU_NOTIFY_UNMAP is always the safe default as users of mmu notifier should assume that every for the range is going away when that event happens. A latter patch do convert mm call path to use a more appropriate events for each call. This is done as 2 patches so that no call site is forgotten especialy as it uses this following coccinelle patch: %<---------------------------------------------------------------------- @@ identifier I1, I2, I3, I4; @@ static inline void mmu_notifier_range_init(struct mmu_notifier_range *I1, +enum mmu_notifier_event event, +unsigned flags, +struct vm_area_struct *vma, struct mm_struct *I2, unsigned long I3, unsigned long I4) { ... } @@ @@ -#define mmu_notifier_range_init(range, mm, start, end) +#define mmu_notifier_range_init(range, event, flags, vma, mm, start, end) @@ expression E1, E3, E4; identifier I1; @@ <... mmu_notifier_range_init(E1, +MMU_NOTIFY_UNMAP, 0, I1, I1->vm_mm, E3, E4) ...> @@ expression E1, E2, E3, E4; identifier FN, VMA; @@ FN(..., struct vm_area_struct *VMA, ...) { <... mmu_notifier_range_init(E1, +MMU_NOTIFY_UNMAP, 0, VMA, E2, E3, E4) ...> } @@ expression E1, E2, E3, E4; identifier FN, VMA; @@ FN(...) { struct vm_area_struct *VMA; <... mmu_notifier_range_init(E1, +MMU_NOTIFY_UNMAP, 0, VMA, E2, E3, E4) ...> } @@ expression E1, E2, E3, E4; identifier FN; @@ FN(...) { <... mmu_notifier_range_init(E1, +MMU_NOTIFY_UNMAP, 0, NULL, E2, E3, E4) ...> } ---------------------------------------------------------------------->% Applied with: spatch --all-includes --sp-file mmu-notifier.spatch fs/proc/task_mmu.c --in-place spatch --sp-file mmu-notifier.spatch --dir kernel/events/ --in-place spatch --sp-file mmu-notifier.spatch --dir mm --in-place Link: http://lkml.kernel.org/r/20190326164747.24405-6-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Cc: Christian König <christian.koenig@amd.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Ross Zwisler <zwisler@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krcmar <rkrcmar@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Christian Koenig <christian.koenig@amd.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/6f4f13e8d9e27cefd2cd88dd4fd80aa6d68b9131;"CPU page table update can happens for many reasons, not only as a result of a syscall (munmap(), mprotect(), mremap(), madvise(), ...) but also as a result of kernel activities (memory compression, reclaim, migration, Users of mmu notifier API track changes to the CPU page table and take specific action for them";no;no;yes 54;MDY6Q29tbWl0MjMyNTI5ODo2ZjRmMTNlOGQ5ZTI3Y2VmZDJjZDg4ZGQ0ZmQ4MGFhNmQ2OGI5MTMx;Jérôme Glisse;Linus Torvalds;"mm/mmu_notifier: contextual information for event triggering invalidation CPU page table update can happens for many reasons, not only as a result of a syscall (munmap(), mprotect(), mremap(), madvise(), ...) but also as a result of kernel activities (memory compression, reclaim, migration, ...). Users of mmu notifier API track changes to the CPU page table and take specific action for them. While current API only provide range of virtual address affected by the change, not why the changes is happening. This patchset do the initial mechanical convertion of all the places that calls mmu_notifier_range_init to also provide the default MMU_NOTIFY_UNMAP event as well as the vma if it is know (most invalidation happens against a given vma). Passing down the vma allows the users of mmu notifier to inspect the new vma page protection. The MMU_NOTIFY_UNMAP is always the safe default as users of mmu notifier should assume that every for the range is going away when that event happens. A latter patch do convert mm call path to use a more appropriate events for each call. This is done as 2 patches so that no call site is forgotten especialy as it uses this following coccinelle patch: %<---------------------------------------------------------------------- @@ identifier I1, I2, I3, I4; @@ static inline void mmu_notifier_range_init(struct mmu_notifier_range *I1, +enum mmu_notifier_event event, +unsigned flags, +struct vm_area_struct *vma, struct mm_struct *I2, unsigned long I3, unsigned long I4) { ... } @@ @@ -#define mmu_notifier_range_init(range, mm, start, end) +#define mmu_notifier_range_init(range, event, flags, vma, mm, start, end) @@ expression E1, E3, E4; identifier I1; @@ <... mmu_notifier_range_init(E1, +MMU_NOTIFY_UNMAP, 0, I1, I1->vm_mm, E3, E4) ...> @@ expression E1, E2, E3, E4; identifier FN, VMA; @@ FN(..., struct vm_area_struct *VMA, ...) { <... mmu_notifier_range_init(E1, +MMU_NOTIFY_UNMAP, 0, VMA, E2, E3, E4) ...> } @@ expression E1, E2, E3, E4; identifier FN, VMA; @@ FN(...) { struct vm_area_struct *VMA; <... mmu_notifier_range_init(E1, +MMU_NOTIFY_UNMAP, 0, VMA, E2, E3, E4) ...> } @@ expression E1, E2, E3, E4; identifier FN; @@ FN(...) { <... mmu_notifier_range_init(E1, +MMU_NOTIFY_UNMAP, 0, NULL, E2, E3, E4) ...> } ---------------------------------------------------------------------->% Applied with: spatch --all-includes --sp-file mmu-notifier.spatch fs/proc/task_mmu.c --in-place spatch --sp-file mmu-notifier.spatch --dir kernel/events/ --in-place spatch --sp-file mmu-notifier.spatch --dir mm --in-place Link: http://lkml.kernel.org/r/20190326164747.24405-6-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Cc: Christian König <christian.koenig@amd.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Ross Zwisler <zwisler@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krcmar <rkrcmar@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Christian Koenig <christian.koenig@amd.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/6f4f13e8d9e27cefd2cd88dd4fd80aa6d68b9131;" While current API only provide range of virtual address affected by the change, not why the changes is happening";no;yes;yes 54;MDY6Q29tbWl0MjMyNTI5ODo2ZjRmMTNlOGQ5ZTI3Y2VmZDJjZDg4ZGQ0ZmQ4MGFhNmQ2OGI5MTMx;Jérôme Glisse;Linus Torvalds;"mm/mmu_notifier: contextual information for event triggering invalidation CPU page table update can happens for many reasons, not only as a result of a syscall (munmap(), mprotect(), mremap(), madvise(), ...) but also as a result of kernel activities (memory compression, reclaim, migration, ...). Users of mmu notifier API track changes to the CPU page table and take specific action for them. While current API only provide range of virtual address affected by the change, not why the changes is happening. This patchset do the initial mechanical convertion of all the places that calls mmu_notifier_range_init to also provide the default MMU_NOTIFY_UNMAP event as well as the vma if it is know (most invalidation happens against a given vma). Passing down the vma allows the users of mmu notifier to inspect the new vma page protection. The MMU_NOTIFY_UNMAP is always the safe default as users of mmu notifier should assume that every for the range is going away when that event happens. A latter patch do convert mm call path to use a more appropriate events for each call. This is done as 2 patches so that no call site is forgotten especialy as it uses this following coccinelle patch: %<---------------------------------------------------------------------- @@ identifier I1, I2, I3, I4; @@ static inline void mmu_notifier_range_init(struct mmu_notifier_range *I1, +enum mmu_notifier_event event, +unsigned flags, +struct vm_area_struct *vma, struct mm_struct *I2, unsigned long I3, unsigned long I4) { ... } @@ @@ -#define mmu_notifier_range_init(range, mm, start, end) +#define mmu_notifier_range_init(range, event, flags, vma, mm, start, end) @@ expression E1, E3, E4; identifier I1; @@ <... mmu_notifier_range_init(E1, +MMU_NOTIFY_UNMAP, 0, I1, I1->vm_mm, E3, E4) ...> @@ expression E1, E2, E3, E4; identifier FN, VMA; @@ FN(..., struct vm_area_struct *VMA, ...) { <... mmu_notifier_range_init(E1, +MMU_NOTIFY_UNMAP, 0, VMA, E2, E3, E4) ...> } @@ expression E1, E2, E3, E4; identifier FN, VMA; @@ FN(...) { struct vm_area_struct *VMA; <... mmu_notifier_range_init(E1, +MMU_NOTIFY_UNMAP, 0, VMA, E2, E3, E4) ...> } @@ expression E1, E2, E3, E4; identifier FN; @@ FN(...) { <... mmu_notifier_range_init(E1, +MMU_NOTIFY_UNMAP, 0, NULL, E2, E3, E4) ...> } ---------------------------------------------------------------------->% Applied with: spatch --all-includes --sp-file mmu-notifier.spatch fs/proc/task_mmu.c --in-place spatch --sp-file mmu-notifier.spatch --dir kernel/events/ --in-place spatch --sp-file mmu-notifier.spatch --dir mm --in-place Link: http://lkml.kernel.org/r/20190326164747.24405-6-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Cc: Christian König <christian.koenig@amd.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Ross Zwisler <zwisler@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krcmar <rkrcmar@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Christian Koenig <christian.koenig@amd.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/6f4f13e8d9e27cefd2cd88dd4fd80aa6d68b9131;"This patchset do the initial mechanical convertion of all the places that calls mmu_notifier_range_init to also provide the default MMU_NOTIFY_UNMAP event as well as the vma if it is know (most invalidation happens against a given vma)";yes;yes;yes 54;MDY6Q29tbWl0MjMyNTI5ODo2ZjRmMTNlOGQ5ZTI3Y2VmZDJjZDg4ZGQ0ZmQ4MGFhNmQ2OGI5MTMx;Jérôme Glisse;Linus Torvalds;"mm/mmu_notifier: contextual information for event triggering invalidation CPU page table update can happens for many reasons, not only as a result of a syscall (munmap(), mprotect(), mremap(), madvise(), ...) but also as a result of kernel activities (memory compression, reclaim, migration, ...). Users of mmu notifier API track changes to the CPU page table and take specific action for them. While current API only provide range of virtual address affected by the change, not why the changes is happening. This patchset do the initial mechanical convertion of all the places that calls mmu_notifier_range_init to also provide the default MMU_NOTIFY_UNMAP event as well as the vma if it is know (most invalidation happens against a given vma). Passing down the vma allows the users of mmu notifier to inspect the new vma page protection. The MMU_NOTIFY_UNMAP is always the safe default as users of mmu notifier should assume that every for the range is going away when that event happens. A latter patch do convert mm call path to use a more appropriate events for each call. This is done as 2 patches so that no call site is forgotten especialy as it uses this following coccinelle patch: %<---------------------------------------------------------------------- @@ identifier I1, I2, I3, I4; @@ static inline void mmu_notifier_range_init(struct mmu_notifier_range *I1, +enum mmu_notifier_event event, +unsigned flags, +struct vm_area_struct *vma, struct mm_struct *I2, unsigned long I3, unsigned long I4) { ... } @@ @@ -#define mmu_notifier_range_init(range, mm, start, end) +#define mmu_notifier_range_init(range, event, flags, vma, mm, start, end) @@ expression E1, E3, E4; identifier I1; @@ <... mmu_notifier_range_init(E1, +MMU_NOTIFY_UNMAP, 0, I1, I1->vm_mm, E3, E4) ...> @@ expression E1, E2, E3, E4; identifier FN, VMA; @@ FN(..., struct vm_area_struct *VMA, ...) { <... mmu_notifier_range_init(E1, +MMU_NOTIFY_UNMAP, 0, VMA, E2, E3, E4) ...> } @@ expression E1, E2, E3, E4; identifier FN, VMA; @@ FN(...) { struct vm_area_struct *VMA; <... mmu_notifier_range_init(E1, +MMU_NOTIFY_UNMAP, 0, VMA, E2, E3, E4) ...> } @@ expression E1, E2, E3, E4; identifier FN; @@ FN(...) { <... mmu_notifier_range_init(E1, +MMU_NOTIFY_UNMAP, 0, NULL, E2, E3, E4) ...> } ---------------------------------------------------------------------->% Applied with: spatch --all-includes --sp-file mmu-notifier.spatch fs/proc/task_mmu.c --in-place spatch --sp-file mmu-notifier.spatch --dir kernel/events/ --in-place spatch --sp-file mmu-notifier.spatch --dir mm --in-place Link: http://lkml.kernel.org/r/20190326164747.24405-6-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Cc: Christian König <christian.koenig@amd.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Ross Zwisler <zwisler@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krcmar <rkrcmar@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Christian Koenig <christian.koenig@amd.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/6f4f13e8d9e27cefd2cd88dd4fd80aa6d68b9131;" Passing down the vma allows the users of mmu notifier to inspect the new vma page protection";yes;yes;no 54;MDY6Q29tbWl0MjMyNTI5ODo2ZjRmMTNlOGQ5ZTI3Y2VmZDJjZDg4ZGQ0ZmQ4MGFhNmQ2OGI5MTMx;Jérôme Glisse;Linus Torvalds;"mm/mmu_notifier: contextual information for event triggering invalidation CPU page table update can happens for many reasons, not only as a result of a syscall (munmap(), mprotect(), mremap(), madvise(), ...) but also as a result of kernel activities (memory compression, reclaim, migration, ...). Users of mmu notifier API track changes to the CPU page table and take specific action for them. While current API only provide range of virtual address affected by the change, not why the changes is happening. This patchset do the initial mechanical convertion of all the places that calls mmu_notifier_range_init to also provide the default MMU_NOTIFY_UNMAP event as well as the vma if it is know (most invalidation happens against a given vma). Passing down the vma allows the users of mmu notifier to inspect the new vma page protection. The MMU_NOTIFY_UNMAP is always the safe default as users of mmu notifier should assume that every for the range is going away when that event happens. A latter patch do convert mm call path to use a more appropriate events for each call. This is done as 2 patches so that no call site is forgotten especialy as it uses this following coccinelle patch: %<---------------------------------------------------------------------- @@ identifier I1, I2, I3, I4; @@ static inline void mmu_notifier_range_init(struct mmu_notifier_range *I1, +enum mmu_notifier_event event, +unsigned flags, +struct vm_area_struct *vma, struct mm_struct *I2, unsigned long I3, unsigned long I4) { ... } @@ @@ -#define mmu_notifier_range_init(range, mm, start, end) +#define mmu_notifier_range_init(range, event, flags, vma, mm, start, end) @@ expression E1, E3, E4; identifier I1; @@ <... mmu_notifier_range_init(E1, +MMU_NOTIFY_UNMAP, 0, I1, I1->vm_mm, E3, E4) ...> @@ expression E1, E2, E3, E4; identifier FN, VMA; @@ FN(..., struct vm_area_struct *VMA, ...) { <... mmu_notifier_range_init(E1, +MMU_NOTIFY_UNMAP, 0, VMA, E2, E3, E4) ...> } @@ expression E1, E2, E3, E4; identifier FN, VMA; @@ FN(...) { struct vm_area_struct *VMA; <... mmu_notifier_range_init(E1, +MMU_NOTIFY_UNMAP, 0, VMA, E2, E3, E4) ...> } @@ expression E1, E2, E3, E4; identifier FN; @@ FN(...) { <... mmu_notifier_range_init(E1, +MMU_NOTIFY_UNMAP, 0, NULL, E2, E3, E4) ...> } ---------------------------------------------------------------------->% Applied with: spatch --all-includes --sp-file mmu-notifier.spatch fs/proc/task_mmu.c --in-place spatch --sp-file mmu-notifier.spatch --dir kernel/events/ --in-place spatch --sp-file mmu-notifier.spatch --dir mm --in-place Link: http://lkml.kernel.org/r/20190326164747.24405-6-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Cc: Christian König <christian.koenig@amd.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Ross Zwisler <zwisler@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krcmar <rkrcmar@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Christian Koenig <christian.koenig@amd.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/6f4f13e8d9e27cefd2cd88dd4fd80aa6d68b9131;"The MMU_NOTIFY_UNMAP is always the safe default as users of mmu notifier should assume that every for the range is going away when that event happens";yes;yes;yes 54;MDY6Q29tbWl0MjMyNTI5ODo2ZjRmMTNlOGQ5ZTI3Y2VmZDJjZDg4ZGQ0ZmQ4MGFhNmQ2OGI5MTMx;Jérôme Glisse;Linus Torvalds;"mm/mmu_notifier: contextual information for event triggering invalidation CPU page table update can happens for many reasons, not only as a result of a syscall (munmap(), mprotect(), mremap(), madvise(), ...) but also as a result of kernel activities (memory compression, reclaim, migration, ...). Users of mmu notifier API track changes to the CPU page table and take specific action for them. While current API only provide range of virtual address affected by the change, not why the changes is happening. This patchset do the initial mechanical convertion of all the places that calls mmu_notifier_range_init to also provide the default MMU_NOTIFY_UNMAP event as well as the vma if it is know (most invalidation happens against a given vma). Passing down the vma allows the users of mmu notifier to inspect the new vma page protection. The MMU_NOTIFY_UNMAP is always the safe default as users of mmu notifier should assume that every for the range is going away when that event happens. A latter patch do convert mm call path to use a more appropriate events for each call. This is done as 2 patches so that no call site is forgotten especialy as it uses this following coccinelle patch: %<---------------------------------------------------------------------- @@ identifier I1, I2, I3, I4; @@ static inline void mmu_notifier_range_init(struct mmu_notifier_range *I1, +enum mmu_notifier_event event, +unsigned flags, +struct vm_area_struct *vma, struct mm_struct *I2, unsigned long I3, unsigned long I4) { ... } @@ @@ -#define mmu_notifier_range_init(range, mm, start, end) +#define mmu_notifier_range_init(range, event, flags, vma, mm, start, end) @@ expression E1, E3, E4; identifier I1; @@ <... mmu_notifier_range_init(E1, +MMU_NOTIFY_UNMAP, 0, I1, I1->vm_mm, E3, E4) ...> @@ expression E1, E2, E3, E4; identifier FN, VMA; @@ FN(..., struct vm_area_struct *VMA, ...) { <... mmu_notifier_range_init(E1, +MMU_NOTIFY_UNMAP, 0, VMA, E2, E3, E4) ...> } @@ expression E1, E2, E3, E4; identifier FN, VMA; @@ FN(...) { struct vm_area_struct *VMA; <... mmu_notifier_range_init(E1, +MMU_NOTIFY_UNMAP, 0, VMA, E2, E3, E4) ...> } @@ expression E1, E2, E3, E4; identifier FN; @@ FN(...) { <... mmu_notifier_range_init(E1, +MMU_NOTIFY_UNMAP, 0, NULL, E2, E3, E4) ...> } ---------------------------------------------------------------------->% Applied with: spatch --all-includes --sp-file mmu-notifier.spatch fs/proc/task_mmu.c --in-place spatch --sp-file mmu-notifier.spatch --dir kernel/events/ --in-place spatch --sp-file mmu-notifier.spatch --dir mm --in-place Link: http://lkml.kernel.org/r/20190326164747.24405-6-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Cc: Christian König <christian.koenig@amd.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Ross Zwisler <zwisler@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krcmar <rkrcmar@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Christian Koenig <christian.koenig@amd.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/6f4f13e8d9e27cefd2cd88dd4fd80aa6d68b9131;" A latter patch do convert mm call path to use a more appropriate events for each call";no;yes;yes 54;MDY6Q29tbWl0MjMyNTI5ODo2ZjRmMTNlOGQ5ZTI3Y2VmZDJjZDg4ZGQ0ZmQ4MGFhNmQ2OGI5MTMx;Jérôme Glisse;Linus Torvalds;"mm/mmu_notifier: contextual information for event triggering invalidation CPU page table update can happens for many reasons, not only as a result of a syscall (munmap(), mprotect(), mremap(), madvise(), ...) but also as a result of kernel activities (memory compression, reclaim, migration, ...). Users of mmu notifier API track changes to the CPU page table and take specific action for them. While current API only provide range of virtual address affected by the change, not why the changes is happening. This patchset do the initial mechanical convertion of all the places that calls mmu_notifier_range_init to also provide the default MMU_NOTIFY_UNMAP event as well as the vma if it is know (most invalidation happens against a given vma). Passing down the vma allows the users of mmu notifier to inspect the new vma page protection. The MMU_NOTIFY_UNMAP is always the safe default as users of mmu notifier should assume that every for the range is going away when that event happens. A latter patch do convert mm call path to use a more appropriate events for each call. This is done as 2 patches so that no call site is forgotten especialy as it uses this following coccinelle patch: %<---------------------------------------------------------------------- @@ identifier I1, I2, I3, I4; @@ static inline void mmu_notifier_range_init(struct mmu_notifier_range *I1, +enum mmu_notifier_event event, +unsigned flags, +struct vm_area_struct *vma, struct mm_struct *I2, unsigned long I3, unsigned long I4) { ... } @@ @@ -#define mmu_notifier_range_init(range, mm, start, end) +#define mmu_notifier_range_init(range, event, flags, vma, mm, start, end) @@ expression E1, E3, E4; identifier I1; @@ <... mmu_notifier_range_init(E1, +MMU_NOTIFY_UNMAP, 0, I1, I1->vm_mm, E3, E4) ...> @@ expression E1, E2, E3, E4; identifier FN, VMA; @@ FN(..., struct vm_area_struct *VMA, ...) { <... mmu_notifier_range_init(E1, +MMU_NOTIFY_UNMAP, 0, VMA, E2, E3, E4) ...> } @@ expression E1, E2, E3, E4; identifier FN, VMA; @@ FN(...) { struct vm_area_struct *VMA; <... mmu_notifier_range_init(E1, +MMU_NOTIFY_UNMAP, 0, VMA, E2, E3, E4) ...> } @@ expression E1, E2, E3, E4; identifier FN; @@ FN(...) { <... mmu_notifier_range_init(E1, +MMU_NOTIFY_UNMAP, 0, NULL, E2, E3, E4) ...> } ---------------------------------------------------------------------->% Applied with: spatch --all-includes --sp-file mmu-notifier.spatch fs/proc/task_mmu.c --in-place spatch --sp-file mmu-notifier.spatch --dir kernel/events/ --in-place spatch --sp-file mmu-notifier.spatch --dir mm --in-place Link: http://lkml.kernel.org/r/20190326164747.24405-6-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Cc: Christian König <christian.koenig@amd.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Ross Zwisler <zwisler@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krcmar <rkrcmar@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Christian Koenig <christian.koenig@amd.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/6f4f13e8d9e27cefd2cd88dd4fd80aa6d68b9131;"This is done as 2 patches so that no call site is forgotten especialy as it uses this following coccinelle patch";yes;yes;no 55;MDY6Q29tbWl0MjMyNTI5ODpkMzQyYTBiMzg2NzQ4NjdlYTY3ZmRlNDdiMGUxZTYwZmZlOWYxN2Ey;Tetsuo Handa;Linus Torvalds;"mm,oom: don't kill global init via memory.oom.group Since setting global init process to some memory cgroup is technically possible, oom_kill_memcg_member() must check it. Tasks in /test1 are going to be killed due to memory.oom.group set Memory cgroup out of memory: Killed process 1 (systemd) total-vm:43400kB, anon-rss:1228kB, file-rss:3992kB, shmem-rss:0kB oom_reaper: reaped process 1 (systemd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000008b #include <stdio.h> #include <string.h> #include <unistd.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> int main(int argc, char *argv[]) { static char buffer[10485760]; static int pipe_fd[2] = { EOF, EOF }; unsigned int i; int fd; char buf[64] = { }; if (pipe(pipe_fd)) return 1; if (chdir(""/sys/fs/cgroup/"")) return 1; fd = open(""cgroup.subtree_control"", O_WRONLY); write(fd, ""+memory"", 7); close(fd); mkdir(""test1"", 0755); fd = open(""test1/memory.oom.group"", O_WRONLY); write(fd, ""1"", 1); close(fd); fd = open(""test1/cgroup.procs"", O_WRONLY); write(fd, ""1"", 1); snprintf(buf, sizeof(buf) - 1, ""%d"", getpid()); write(fd, buf, strlen(buf)); close(fd); snprintf(buf, sizeof(buf) - 1, ""%lu"", sizeof(buffer) * 5); fd = open(""test1/memory.max"", O_WRONLY); write(fd, buf, strlen(buf)); close(fd); for (i = 0; i < 10; i++) if (fork() == 0) { char c; close(pipe_fd[1]); read(pipe_fd[0], &c, 1); memset(buffer, 0, sizeof(buffer)); sleep(3); _exit(0); } close(pipe_fd[0]); close(pipe_fd[1]); sleep(3); return 0; } [ 37.052923][ T9185] a.out invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0 [ 37.056169][ T9185] CPU: 4 PID: 9185 Comm: a.out Kdump: loaded Not tainted 5.0.0-rc4-next-20190131 #280 [ 37.059205][ T9185] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018 [ 37.062954][ T9185] Call Trace: [ 37.063976][ T9185] dump_stack+0x67/0x95 [ 37.065263][ T9185] dump_header+0x51/0x570 [ 37.066619][ T9185] ? trace_hardirqs_on+0x3f/0x110 [ 37.068171][ T9185] ? _raw_spin_unlock_irqrestore+0x3d/0x70 [ 37.069967][ T9185] oom_kill_process+0x18d/0x210 [ 37.071515][ T9185] out_of_memory+0x11b/0x380 [ 37.072936][ T9185] mem_cgroup_out_of_memory+0xb6/0xd0 [ 37.074601][ T9185] try_charge+0x790/0x820 [ 37.076021][ T9185] mem_cgroup_try_charge+0x42/0x1d0 [ 37.077629][ T9185] mem_cgroup_try_charge_delay+0x11/0x30 [ 37.079370][ T9185] do_anonymous_page+0x105/0x5e0 [ 37.080939][ T9185] __handle_mm_fault+0x9cb/0x1070 [ 37.082485][ T9185] handle_mm_fault+0x1b2/0x3a0 [ 37.083819][ T9185] ? handle_mm_fault+0x47/0x3a0 [ 37.085181][ T9185] __do_page_fault+0x255/0x4c0 [ 37.086529][ T9185] do_page_fault+0x28/0x260 [ 37.087788][ T9185] ? page_fault+0x8/0x30 [ 37.088978][ T9185] page_fault+0x1e/0x30 [ 37.090142][ T9185] RIP: 0033:0x7f8b183aefe0 [ 37.091433][ T9185] Code: 20 f3 44 0f 7f 44 17 d0 f3 44 0f 7f 47 30 f3 44 0f 7f 44 17 c0 48 01 fa 48 83 e2 c0 48 39 d1 74 a3 66 0f 1f 84 00 00 00 00 00 <66> 44 0f 7f 01 66 44 0f 7f 41 10 66 44 0f 7f 41 20 66 44 0f 7f 41 [ 37.096917][ T9185] RSP: 002b:00007fffc5d329e8 EFLAGS: 00010206 [ 37.098615][ T9185] RAX: 00000000006010e0 RBX: 0000000000000008 RCX: 0000000000c30000 [ 37.100905][ T9185] RDX: 00000000010010c0 RSI: 0000000000000000 RDI: 00000000006010e0 [ 37.103349][ T9185] RBP: 0000000000000000 R08: 00007f8b188f4740 R09: 0000000000000000 [ 37.105797][ T9185] R10: 00007fffc5d32420 R11: 00007f8b183aef40 R12: 0000000000000005 [ 37.108228][ T9185] R13: 0000000000000000 R14: ffffffffffffffff R15: 0000000000000000 [ 37.110840][ T9185] memory: usage 51200kB, limit 51200kB, failcnt 125 [ 37.113045][ T9185] memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0 [ 37.115808][ T9185] kmem: usage 0kB, limit 9007199254740988kB, failcnt 0 [ 37.117660][ T9185] Memory cgroup stats for /test1: cache:0KB rss:49484KB rss_huge:30720KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB inactive_anon:0KB active_anon:49700KB inactive_file:0KB active_file:0KB unevictable:0KB [ 37.123371][ T9185] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,oom_memcg=/test1,task_memcg=/test1,task=a.out,pid=9188,uid=0 [ 37.128158][ T9185] Memory cgroup out of memory: Killed process 9188 (a.out) total-vm:14456kB, anon-rss:10324kB, file-rss:504kB, shmem-rss:0kB [ 37.132710][ T9185] Tasks in /test1 are going to be killed due to memory.oom.group set [ 37.132833][ T54] oom_reaper: reaped process 9188 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [ 37.135498][ T9185] Memory cgroup out of memory: Killed process 1 (systemd) total-vm:43400kB, anon-rss:1228kB, file-rss:3992kB, shmem-rss:0kB [ 37.143434][ T9185] Memory cgroup out of memory: Killed process 9182 (a.out) total-vm:14456kB, anon-rss:76kB, file-rss:588kB, shmem-rss:0kB [ 37.144328][ T54] oom_reaper: reaped process 1 (systemd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [ 37.147585][ T9185] Memory cgroup out of memory: Killed process 9183 (a.out) total-vm:14456kB, anon-rss:6228kB, file-rss:512kB, shmem-rss:0kB [ 37.157222][ T9185] Memory cgroup out of memory: Killed process 9184 (a.out) total-vm:14456kB, anon-rss:6228kB, file-rss:508kB, shmem-rss:0kB [ 37.157259][ T9185] Memory cgroup out of memory: Killed process 9185 (a.out) total-vm:14456kB, anon-rss:6228kB, file-rss:512kB, shmem-rss:0kB [ 37.157291][ T9185] Memory cgroup out of memory: Killed process 9186 (a.out) total-vm:14456kB, anon-rss:4180kB, file-rss:508kB, shmem-rss:0kB [ 37.157306][ T54] oom_reaper: reaped process 9183 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [ 37.157328][ T9185] Memory cgroup out of memory: Killed process 9187 (a.out) total-vm:14456kB, anon-rss:4180kB, file-rss:512kB, shmem-rss:0kB [ 37.157452][ T9185] Memory cgroup out of memory: Killed process 9189 (a.out) total-vm:14456kB, anon-rss:6228kB, file-rss:512kB, shmem-rss:0kB [ 37.158733][ T9185] Memory cgroup out of memory: Killed process 9190 (a.out) total-vm:14456kB, anon-rss:552kB, file-rss:512kB, shmem-rss:0kB [ 37.160083][ T54] oom_reaper: reaped process 9186 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [ 37.160187][ T54] oom_reaper: reaped process 9189 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [ 37.206941][ T54] oom_reaper: reaped process 9185 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [ 37.212300][ T9185] Memory cgroup out of memory: Killed process 9191 (a.out) total-vm:14456kB, anon-rss:4180kB, file-rss:512kB, shmem-rss:0kB [ 37.212317][ T54] oom_reaper: reaped process 9190 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [ 37.218860][ T9185] Memory cgroup out of memory: Killed process 9192 (a.out) total-vm:14456kB, anon-rss:1080kB, file-rss:512kB, shmem-rss:0kB [ 37.227667][ T54] oom_reaper: reaped process 9192 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [ 37.292323][ T9193] abrt-hook-ccpp (9193) used greatest stack depth: 10480 bytes left [ 37.351843][ T1] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000008b [ 37.354833][ T1] CPU: 7 PID: 1 Comm: systemd Kdump: loaded Not tainted 5.0.0-rc4-next-20190131 #280 [ 37.357876][ T1] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018 [ 37.361685][ T1] Call Trace: [ 37.363239][ T1] dump_stack+0x67/0x95 [ 37.365010][ T1] panic+0xfc/0x2b0 [ 37.366853][ T1] do_exit+0xd55/0xd60 [ 37.368595][ T1] do_group_exit+0x47/0xc0 [ 37.370415][ T1] get_signal+0x32a/0x920 [ 37.372449][ T1] ? _raw_spin_unlock_irqrestore+0x3d/0x70 [ 37.374596][ T1] do_signal+0x32/0x6e0 [ 37.376430][ T1] ? exit_to_usermode_loop+0x26/0x9b [ 37.378418][ T1] ? prepare_exit_to_usermode+0xa8/0xd0 [ 37.380571][ T1] exit_to_usermode_loop+0x3e/0x9b [ 37.382588][ T1] prepare_exit_to_usermode+0xa8/0xd0 [ 37.384594][ T1] ? page_fault+0x8/0x30 [ 37.386453][ T1] retint_user+0x8/0x18 [ 37.388160][ T1] RIP: 0033:0x7f42c06974a8 [ 37.389922][ T1] Code: Bad RIP value. [ 37.391788][ T1] RSP: 002b:00007ffc3effd388 EFLAGS: 00010213 [ 37.394075][ T1] RAX: 000000000000000e RBX: 00007ffc3effd390 RCX: 0000000000000000 [ 37.396963][ T1] RDX: 000000000000002a RSI: 00007ffc3effd390 RDI: 0000000000000004 [ 37.399550][ T1] RBP: 00007ffc3effd680 R08: 0000000000000000 R09: 0000000000000000 [ 37.402334][ T1] R10: 00000000ffffffff R11: 0000000000000246 R12: 0000000000000001 [ 37.404890][ T1] R13: ffffffffffffffff R14: 0000000000000884 R15: 000056460b1ac3b0 Link: http://lkml.kernel.org/r/201902010336.x113a4EO027170@www262.sakura.ne.jp Fixes: 3d8b38eb81cac813 (""mm, oom: introduce memory.oom.group"") Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <guro@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/d342a0b38674867ea67fde47b0e1e60ffe9f17a2;mm,oom: don't kill global init via memory.oom.group;yes;no;no 55;MDY6Q29tbWl0MjMyNTI5ODpkMzQyYTBiMzg2NzQ4NjdlYTY3ZmRlNDdiMGUxZTYwZmZlOWYxN2Ey;Tetsuo Handa;Linus Torvalds;"mm,oom: don't kill global init via memory.oom.group Since setting global init process to some memory cgroup is technically possible, oom_kill_memcg_member() must check it. Tasks in /test1 are going to be killed due to memory.oom.group set Memory cgroup out of memory: Killed process 1 (systemd) total-vm:43400kB, anon-rss:1228kB, file-rss:3992kB, shmem-rss:0kB oom_reaper: reaped process 1 (systemd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000008b #include <stdio.h> #include <string.h> #include <unistd.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> int main(int argc, char *argv[]) { static char buffer[10485760]; static int pipe_fd[2] = { EOF, EOF }; unsigned int i; int fd; char buf[64] = { }; if (pipe(pipe_fd)) return 1; if (chdir(""/sys/fs/cgroup/"")) return 1; fd = open(""cgroup.subtree_control"", O_WRONLY); write(fd, ""+memory"", 7); close(fd); mkdir(""test1"", 0755); fd = open(""test1/memory.oom.group"", O_WRONLY); write(fd, ""1"", 1); close(fd); fd = open(""test1/cgroup.procs"", O_WRONLY); write(fd, ""1"", 1); snprintf(buf, sizeof(buf) - 1, ""%d"", getpid()); write(fd, buf, strlen(buf)); close(fd); snprintf(buf, sizeof(buf) - 1, ""%lu"", sizeof(buffer) * 5); fd = open(""test1/memory.max"", O_WRONLY); write(fd, buf, strlen(buf)); close(fd); for (i = 0; i < 10; i++) if (fork() == 0) { char c; close(pipe_fd[1]); read(pipe_fd[0], &c, 1); memset(buffer, 0, sizeof(buffer)); sleep(3); _exit(0); } close(pipe_fd[0]); close(pipe_fd[1]); sleep(3); return 0; } [ 37.052923][ T9185] a.out invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0 [ 37.056169][ T9185] CPU: 4 PID: 9185 Comm: a.out Kdump: loaded Not tainted 5.0.0-rc4-next-20190131 #280 [ 37.059205][ T9185] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018 [ 37.062954][ T9185] Call Trace: [ 37.063976][ T9185] dump_stack+0x67/0x95 [ 37.065263][ T9185] dump_header+0x51/0x570 [ 37.066619][ T9185] ? trace_hardirqs_on+0x3f/0x110 [ 37.068171][ T9185] ? _raw_spin_unlock_irqrestore+0x3d/0x70 [ 37.069967][ T9185] oom_kill_process+0x18d/0x210 [ 37.071515][ T9185] out_of_memory+0x11b/0x380 [ 37.072936][ T9185] mem_cgroup_out_of_memory+0xb6/0xd0 [ 37.074601][ T9185] try_charge+0x790/0x820 [ 37.076021][ T9185] mem_cgroup_try_charge+0x42/0x1d0 [ 37.077629][ T9185] mem_cgroup_try_charge_delay+0x11/0x30 [ 37.079370][ T9185] do_anonymous_page+0x105/0x5e0 [ 37.080939][ T9185] __handle_mm_fault+0x9cb/0x1070 [ 37.082485][ T9185] handle_mm_fault+0x1b2/0x3a0 [ 37.083819][ T9185] ? handle_mm_fault+0x47/0x3a0 [ 37.085181][ T9185] __do_page_fault+0x255/0x4c0 [ 37.086529][ T9185] do_page_fault+0x28/0x260 [ 37.087788][ T9185] ? page_fault+0x8/0x30 [ 37.088978][ T9185] page_fault+0x1e/0x30 [ 37.090142][ T9185] RIP: 0033:0x7f8b183aefe0 [ 37.091433][ T9185] Code: 20 f3 44 0f 7f 44 17 d0 f3 44 0f 7f 47 30 f3 44 0f 7f 44 17 c0 48 01 fa 48 83 e2 c0 48 39 d1 74 a3 66 0f 1f 84 00 00 00 00 00 <66> 44 0f 7f 01 66 44 0f 7f 41 10 66 44 0f 7f 41 20 66 44 0f 7f 41 [ 37.096917][ T9185] RSP: 002b:00007fffc5d329e8 EFLAGS: 00010206 [ 37.098615][ T9185] RAX: 00000000006010e0 RBX: 0000000000000008 RCX: 0000000000c30000 [ 37.100905][ T9185] RDX: 00000000010010c0 RSI: 0000000000000000 RDI: 00000000006010e0 [ 37.103349][ T9185] RBP: 0000000000000000 R08: 00007f8b188f4740 R09: 0000000000000000 [ 37.105797][ T9185] R10: 00007fffc5d32420 R11: 00007f8b183aef40 R12: 0000000000000005 [ 37.108228][ T9185] R13: 0000000000000000 R14: ffffffffffffffff R15: 0000000000000000 [ 37.110840][ T9185] memory: usage 51200kB, limit 51200kB, failcnt 125 [ 37.113045][ T9185] memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0 [ 37.115808][ T9185] kmem: usage 0kB, limit 9007199254740988kB, failcnt 0 [ 37.117660][ T9185] Memory cgroup stats for /test1: cache:0KB rss:49484KB rss_huge:30720KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB inactive_anon:0KB active_anon:49700KB inactive_file:0KB active_file:0KB unevictable:0KB [ 37.123371][ T9185] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,oom_memcg=/test1,task_memcg=/test1,task=a.out,pid=9188,uid=0 [ 37.128158][ T9185] Memory cgroup out of memory: Killed process 9188 (a.out) total-vm:14456kB, anon-rss:10324kB, file-rss:504kB, shmem-rss:0kB [ 37.132710][ T9185] Tasks in /test1 are going to be killed due to memory.oom.group set [ 37.132833][ T54] oom_reaper: reaped process 9188 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [ 37.135498][ T9185] Memory cgroup out of memory: Killed process 1 (systemd) total-vm:43400kB, anon-rss:1228kB, file-rss:3992kB, shmem-rss:0kB [ 37.143434][ T9185] Memory cgroup out of memory: Killed process 9182 (a.out) total-vm:14456kB, anon-rss:76kB, file-rss:588kB, shmem-rss:0kB [ 37.144328][ T54] oom_reaper: reaped process 1 (systemd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [ 37.147585][ T9185] Memory cgroup out of memory: Killed process 9183 (a.out) total-vm:14456kB, anon-rss:6228kB, file-rss:512kB, shmem-rss:0kB [ 37.157222][ T9185] Memory cgroup out of memory: Killed process 9184 (a.out) total-vm:14456kB, anon-rss:6228kB, file-rss:508kB, shmem-rss:0kB [ 37.157259][ T9185] Memory cgroup out of memory: Killed process 9185 (a.out) total-vm:14456kB, anon-rss:6228kB, file-rss:512kB, shmem-rss:0kB [ 37.157291][ T9185] Memory cgroup out of memory: Killed process 9186 (a.out) total-vm:14456kB, anon-rss:4180kB, file-rss:508kB, shmem-rss:0kB [ 37.157306][ T54] oom_reaper: reaped process 9183 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [ 37.157328][ T9185] Memory cgroup out of memory: Killed process 9187 (a.out) total-vm:14456kB, anon-rss:4180kB, file-rss:512kB, shmem-rss:0kB [ 37.157452][ T9185] Memory cgroup out of memory: Killed process 9189 (a.out) total-vm:14456kB, anon-rss:6228kB, file-rss:512kB, shmem-rss:0kB [ 37.158733][ T9185] Memory cgroup out of memory: Killed process 9190 (a.out) total-vm:14456kB, anon-rss:552kB, file-rss:512kB, shmem-rss:0kB [ 37.160083][ T54] oom_reaper: reaped process 9186 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [ 37.160187][ T54] oom_reaper: reaped process 9189 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [ 37.206941][ T54] oom_reaper: reaped process 9185 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [ 37.212300][ T9185] Memory cgroup out of memory: Killed process 9191 (a.out) total-vm:14456kB, anon-rss:4180kB, file-rss:512kB, shmem-rss:0kB [ 37.212317][ T54] oom_reaper: reaped process 9190 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [ 37.218860][ T9185] Memory cgroup out of memory: Killed process 9192 (a.out) total-vm:14456kB, anon-rss:1080kB, file-rss:512kB, shmem-rss:0kB [ 37.227667][ T54] oom_reaper: reaped process 9192 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [ 37.292323][ T9193] abrt-hook-ccpp (9193) used greatest stack depth: 10480 bytes left [ 37.351843][ T1] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000008b [ 37.354833][ T1] CPU: 7 PID: 1 Comm: systemd Kdump: loaded Not tainted 5.0.0-rc4-next-20190131 #280 [ 37.357876][ T1] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018 [ 37.361685][ T1] Call Trace: [ 37.363239][ T1] dump_stack+0x67/0x95 [ 37.365010][ T1] panic+0xfc/0x2b0 [ 37.366853][ T1] do_exit+0xd55/0xd60 [ 37.368595][ T1] do_group_exit+0x47/0xc0 [ 37.370415][ T1] get_signal+0x32a/0x920 [ 37.372449][ T1] ? _raw_spin_unlock_irqrestore+0x3d/0x70 [ 37.374596][ T1] do_signal+0x32/0x6e0 [ 37.376430][ T1] ? exit_to_usermode_loop+0x26/0x9b [ 37.378418][ T1] ? prepare_exit_to_usermode+0xa8/0xd0 [ 37.380571][ T1] exit_to_usermode_loop+0x3e/0x9b [ 37.382588][ T1] prepare_exit_to_usermode+0xa8/0xd0 [ 37.384594][ T1] ? page_fault+0x8/0x30 [ 37.386453][ T1] retint_user+0x8/0x18 [ 37.388160][ T1] RIP: 0033:0x7f42c06974a8 [ 37.389922][ T1] Code: Bad RIP value. [ 37.391788][ T1] RSP: 002b:00007ffc3effd388 EFLAGS: 00010213 [ 37.394075][ T1] RAX: 000000000000000e RBX: 00007ffc3effd390 RCX: 0000000000000000 [ 37.396963][ T1] RDX: 000000000000002a RSI: 00007ffc3effd390 RDI: 0000000000000004 [ 37.399550][ T1] RBP: 00007ffc3effd680 R08: 0000000000000000 R09: 0000000000000000 [ 37.402334][ T1] R10: 00000000ffffffff R11: 0000000000000246 R12: 0000000000000001 [ 37.404890][ T1] R13: ffffffffffffffff R14: 0000000000000884 R15: 000056460b1ac3b0 Link: http://lkml.kernel.org/r/201902010336.x113a4EO027170@www262.sakura.ne.jp Fixes: 3d8b38eb81cac813 (""mm, oom: introduce memory.oom.group"") Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <guro@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/d342a0b38674867ea67fde47b0e1e60ffe9f17a2;"Since setting global init process to some memory cgroup is technically possible, oom_kill_memcg_member() must check it";yes;yes;yes 55;MDY6Q29tbWl0MjMyNTI5ODpkMzQyYTBiMzg2NzQ4NjdlYTY3ZmRlNDdiMGUxZTYwZmZlOWYxN2Ey;Tetsuo Handa;Linus Torvalds;"mm,oom: don't kill global init via memory.oom.group Since setting global init process to some memory cgroup is technically possible, oom_kill_memcg_member() must check it. Tasks in /test1 are going to be killed due to memory.oom.group set Memory cgroup out of memory: Killed process 1 (systemd) total-vm:43400kB, anon-rss:1228kB, file-rss:3992kB, shmem-rss:0kB oom_reaper: reaped process 1 (systemd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000008b #include <stdio.h> #include <string.h> #include <unistd.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> int main(int argc, char *argv[]) { static char buffer[10485760]; static int pipe_fd[2] = { EOF, EOF }; unsigned int i; int fd; char buf[64] = { }; if (pipe(pipe_fd)) return 1; if (chdir(""/sys/fs/cgroup/"")) return 1; fd = open(""cgroup.subtree_control"", O_WRONLY); write(fd, ""+memory"", 7); close(fd); mkdir(""test1"", 0755); fd = open(""test1/memory.oom.group"", O_WRONLY); write(fd, ""1"", 1); close(fd); fd = open(""test1/cgroup.procs"", O_WRONLY); write(fd, ""1"", 1); snprintf(buf, sizeof(buf) - 1, ""%d"", getpid()); write(fd, buf, strlen(buf)); close(fd); snprintf(buf, sizeof(buf) - 1, ""%lu"", sizeof(buffer) * 5); fd = open(""test1/memory.max"", O_WRONLY); write(fd, buf, strlen(buf)); close(fd); for (i = 0; i < 10; i++) if (fork() == 0) { char c; close(pipe_fd[1]); read(pipe_fd[0], &c, 1); memset(buffer, 0, sizeof(buffer)); sleep(3); _exit(0); } close(pipe_fd[0]); close(pipe_fd[1]); sleep(3); return 0; } [ 37.052923][ T9185] a.out invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0 [ 37.056169][ T9185] CPU: 4 PID: 9185 Comm: a.out Kdump: loaded Not tainted 5.0.0-rc4-next-20190131 #280 [ 37.059205][ T9185] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018 [ 37.062954][ T9185] Call Trace: [ 37.063976][ T9185] dump_stack+0x67/0x95 [ 37.065263][ T9185] dump_header+0x51/0x570 [ 37.066619][ T9185] ? trace_hardirqs_on+0x3f/0x110 [ 37.068171][ T9185] ? _raw_spin_unlock_irqrestore+0x3d/0x70 [ 37.069967][ T9185] oom_kill_process+0x18d/0x210 [ 37.071515][ T9185] out_of_memory+0x11b/0x380 [ 37.072936][ T9185] mem_cgroup_out_of_memory+0xb6/0xd0 [ 37.074601][ T9185] try_charge+0x790/0x820 [ 37.076021][ T9185] mem_cgroup_try_charge+0x42/0x1d0 [ 37.077629][ T9185] mem_cgroup_try_charge_delay+0x11/0x30 [ 37.079370][ T9185] do_anonymous_page+0x105/0x5e0 [ 37.080939][ T9185] __handle_mm_fault+0x9cb/0x1070 [ 37.082485][ T9185] handle_mm_fault+0x1b2/0x3a0 [ 37.083819][ T9185] ? handle_mm_fault+0x47/0x3a0 [ 37.085181][ T9185] __do_page_fault+0x255/0x4c0 [ 37.086529][ T9185] do_page_fault+0x28/0x260 [ 37.087788][ T9185] ? page_fault+0x8/0x30 [ 37.088978][ T9185] page_fault+0x1e/0x30 [ 37.090142][ T9185] RIP: 0033:0x7f8b183aefe0 [ 37.091433][ T9185] Code: 20 f3 44 0f 7f 44 17 d0 f3 44 0f 7f 47 30 f3 44 0f 7f 44 17 c0 48 01 fa 48 83 e2 c0 48 39 d1 74 a3 66 0f 1f 84 00 00 00 00 00 <66> 44 0f 7f 01 66 44 0f 7f 41 10 66 44 0f 7f 41 20 66 44 0f 7f 41 [ 37.096917][ T9185] RSP: 002b:00007fffc5d329e8 EFLAGS: 00010206 [ 37.098615][ T9185] RAX: 00000000006010e0 RBX: 0000000000000008 RCX: 0000000000c30000 [ 37.100905][ T9185] RDX: 00000000010010c0 RSI: 0000000000000000 RDI: 00000000006010e0 [ 37.103349][ T9185] RBP: 0000000000000000 R08: 00007f8b188f4740 R09: 0000000000000000 [ 37.105797][ T9185] R10: 00007fffc5d32420 R11: 00007f8b183aef40 R12: 0000000000000005 [ 37.108228][ T9185] R13: 0000000000000000 R14: ffffffffffffffff R15: 0000000000000000 [ 37.110840][ T9185] memory: usage 51200kB, limit 51200kB, failcnt 125 [ 37.113045][ T9185] memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0 [ 37.115808][ T9185] kmem: usage 0kB, limit 9007199254740988kB, failcnt 0 [ 37.117660][ T9185] Memory cgroup stats for /test1: cache:0KB rss:49484KB rss_huge:30720KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB inactive_anon:0KB active_anon:49700KB inactive_file:0KB active_file:0KB unevictable:0KB [ 37.123371][ T9185] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,oom_memcg=/test1,task_memcg=/test1,task=a.out,pid=9188,uid=0 [ 37.128158][ T9185] Memory cgroup out of memory: Killed process 9188 (a.out) total-vm:14456kB, anon-rss:10324kB, file-rss:504kB, shmem-rss:0kB [ 37.132710][ T9185] Tasks in /test1 are going to be killed due to memory.oom.group set [ 37.132833][ T54] oom_reaper: reaped process 9188 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [ 37.135498][ T9185] Memory cgroup out of memory: Killed process 1 (systemd) total-vm:43400kB, anon-rss:1228kB, file-rss:3992kB, shmem-rss:0kB [ 37.143434][ T9185] Memory cgroup out of memory: Killed process 9182 (a.out) total-vm:14456kB, anon-rss:76kB, file-rss:588kB, shmem-rss:0kB [ 37.144328][ T54] oom_reaper: reaped process 1 (systemd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [ 37.147585][ T9185] Memory cgroup out of memory: Killed process 9183 (a.out) total-vm:14456kB, anon-rss:6228kB, file-rss:512kB, shmem-rss:0kB [ 37.157222][ T9185] Memory cgroup out of memory: Killed process 9184 (a.out) total-vm:14456kB, anon-rss:6228kB, file-rss:508kB, shmem-rss:0kB [ 37.157259][ T9185] Memory cgroup out of memory: Killed process 9185 (a.out) total-vm:14456kB, anon-rss:6228kB, file-rss:512kB, shmem-rss:0kB [ 37.157291][ T9185] Memory cgroup out of memory: Killed process 9186 (a.out) total-vm:14456kB, anon-rss:4180kB, file-rss:508kB, shmem-rss:0kB [ 37.157306][ T54] oom_reaper: reaped process 9183 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [ 37.157328][ T9185] Memory cgroup out of memory: Killed process 9187 (a.out) total-vm:14456kB, anon-rss:4180kB, file-rss:512kB, shmem-rss:0kB [ 37.157452][ T9185] Memory cgroup out of memory: Killed process 9189 (a.out) total-vm:14456kB, anon-rss:6228kB, file-rss:512kB, shmem-rss:0kB [ 37.158733][ T9185] Memory cgroup out of memory: Killed process 9190 (a.out) total-vm:14456kB, anon-rss:552kB, file-rss:512kB, shmem-rss:0kB [ 37.160083][ T54] oom_reaper: reaped process 9186 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [ 37.160187][ T54] oom_reaper: reaped process 9189 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [ 37.206941][ T54] oom_reaper: reaped process 9185 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [ 37.212300][ T9185] Memory cgroup out of memory: Killed process 9191 (a.out) total-vm:14456kB, anon-rss:4180kB, file-rss:512kB, shmem-rss:0kB [ 37.212317][ T54] oom_reaper: reaped process 9190 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [ 37.218860][ T9185] Memory cgroup out of memory: Killed process 9192 (a.out) total-vm:14456kB, anon-rss:1080kB, file-rss:512kB, shmem-rss:0kB [ 37.227667][ T54] oom_reaper: reaped process 9192 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [ 37.292323][ T9193] abrt-hook-ccpp (9193) used greatest stack depth: 10480 bytes left [ 37.351843][ T1] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000008b [ 37.354833][ T1] CPU: 7 PID: 1 Comm: systemd Kdump: loaded Not tainted 5.0.0-rc4-next-20190131 #280 [ 37.357876][ T1] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018 [ 37.361685][ T1] Call Trace: [ 37.363239][ T1] dump_stack+0x67/0x95 [ 37.365010][ T1] panic+0xfc/0x2b0 [ 37.366853][ T1] do_exit+0xd55/0xd60 [ 37.368595][ T1] do_group_exit+0x47/0xc0 [ 37.370415][ T1] get_signal+0x32a/0x920 [ 37.372449][ T1] ? _raw_spin_unlock_irqrestore+0x3d/0x70 [ 37.374596][ T1] do_signal+0x32/0x6e0 [ 37.376430][ T1] ? exit_to_usermode_loop+0x26/0x9b [ 37.378418][ T1] ? prepare_exit_to_usermode+0xa8/0xd0 [ 37.380571][ T1] exit_to_usermode_loop+0x3e/0x9b [ 37.382588][ T1] prepare_exit_to_usermode+0xa8/0xd0 [ 37.384594][ T1] ? page_fault+0x8/0x30 [ 37.386453][ T1] retint_user+0x8/0x18 [ 37.388160][ T1] RIP: 0033:0x7f42c06974a8 [ 37.389922][ T1] Code: Bad RIP value. [ 37.391788][ T1] RSP: 002b:00007ffc3effd388 EFLAGS: 00010213 [ 37.394075][ T1] RAX: 000000000000000e RBX: 00007ffc3effd390 RCX: 0000000000000000 [ 37.396963][ T1] RDX: 000000000000002a RSI: 00007ffc3effd390 RDI: 0000000000000004 [ 37.399550][ T1] RBP: 00007ffc3effd680 R08: 0000000000000000 R09: 0000000000000000 [ 37.402334][ T1] R10: 00000000ffffffff R11: 0000000000000246 R12: 0000000000000001 [ 37.404890][ T1] R13: ffffffffffffffff R14: 0000000000000884 R15: 000056460b1ac3b0 Link: http://lkml.kernel.org/r/201902010336.x113a4EO027170@www262.sakura.ne.jp Fixes: 3d8b38eb81cac813 (""mm, oom: introduce memory.oom.group"") Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <guro@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/d342a0b38674867ea67fde47b0e1e60ffe9f17a2;" Tasks in /test1 are going to be killed due to memory.oom.group set Memory cgroup out of memory: Killed process 1 (systemd) total-vm:43400kB, anon-rss:1228kB, file-rss:3992kB, shmem-rss:0kB oom_reaper: reaped process 1 (systemd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000008b int main(int argc, char *argv[]) for (i = 0; i < 10; i++)";no;no;yes 56;MDY6Q29tbWl0MjMyNTI5ODpiYmJlNDgwMjk3MjBkMmM2YjY3MzNmNzhkMDI1NzFhMjgxNTExYWRi;Shakeel Butt;Linus Torvalds;"mm, oom: remove 'prefer children over parent' heuristic Since the start of the git history of Linux, the kernel after selecting the worst process to be oom-killed, prefer to kill its child (if the child does not share mm with the parent). Later it was changed to prefer to kill a child who is worst. If the parent is still the worst then the parent will be killed. This heuristic assumes that the children did less work than their parent and by killing one of them, the work lost will be less. However this is very workload dependent. If there is a workload which can benefit from this heuristic, can use oom_score_adj to prefer children to be killed before the parent. The select_bad_process() has already selected the worst process in the system/memcg. There is no need to recheck the badness of its children and hoping to find a worse candidate. That's a lot of unneeded racy work. Also the heuristic is dangerous because it make fork bomb like workloads to recover much later because we constantly pick and kill processes which are not memory hogs. So, let's remove this whole heuristic. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/20190121215850.221745-2-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/bbbe48029720d2c6b6733f78d02571a281511adb;mm, oom: remove 'prefer children over parent' heuristic;yes;no;no 56;MDY6Q29tbWl0MjMyNTI5ODpiYmJlNDgwMjk3MjBkMmM2YjY3MzNmNzhkMDI1NzFhMjgxNTExYWRi;Shakeel Butt;Linus Torvalds;"mm, oom: remove 'prefer children over parent' heuristic Since the start of the git history of Linux, the kernel after selecting the worst process to be oom-killed, prefer to kill its child (if the child does not share mm with the parent). Later it was changed to prefer to kill a child who is worst. If the parent is still the worst then the parent will be killed. This heuristic assumes that the children did less work than their parent and by killing one of them, the work lost will be less. However this is very workload dependent. If there is a workload which can benefit from this heuristic, can use oom_score_adj to prefer children to be killed before the parent. The select_bad_process() has already selected the worst process in the system/memcg. There is no need to recheck the badness of its children and hoping to find a worse candidate. That's a lot of unneeded racy work. Also the heuristic is dangerous because it make fork bomb like workloads to recover much later because we constantly pick and kill processes which are not memory hogs. So, let's remove this whole heuristic. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/20190121215850.221745-2-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/bbbe48029720d2c6b6733f78d02571a281511adb;"Since the start of the git history of Linux, the kernel after selecting the worst process to be oom-killed, prefer to kill its child (if the child does not share mm with the parent)";no;no;yes 56;MDY6Q29tbWl0MjMyNTI5ODpiYmJlNDgwMjk3MjBkMmM2YjY3MzNmNzhkMDI1NzFhMjgxNTExYWRi;Shakeel Butt;Linus Torvalds;"mm, oom: remove 'prefer children over parent' heuristic Since the start of the git history of Linux, the kernel after selecting the worst process to be oom-killed, prefer to kill its child (if the child does not share mm with the parent). Later it was changed to prefer to kill a child who is worst. If the parent is still the worst then the parent will be killed. This heuristic assumes that the children did less work than their parent and by killing one of them, the work lost will be less. However this is very workload dependent. If there is a workload which can benefit from this heuristic, can use oom_score_adj to prefer children to be killed before the parent. The select_bad_process() has already selected the worst process in the system/memcg. There is no need to recheck the badness of its children and hoping to find a worse candidate. That's a lot of unneeded racy work. Also the heuristic is dangerous because it make fork bomb like workloads to recover much later because we constantly pick and kill processes which are not memory hogs. So, let's remove this whole heuristic. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/20190121215850.221745-2-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/bbbe48029720d2c6b6733f78d02571a281511adb;" Later it was changed to prefer to kill a child who is worst";no;no;yes 56;MDY6Q29tbWl0MjMyNTI5ODpiYmJlNDgwMjk3MjBkMmM2YjY3MzNmNzhkMDI1NzFhMjgxNTExYWRi;Shakeel Butt;Linus Torvalds;"mm, oom: remove 'prefer children over parent' heuristic Since the start of the git history of Linux, the kernel after selecting the worst process to be oom-killed, prefer to kill its child (if the child does not share mm with the parent). Later it was changed to prefer to kill a child who is worst. If the parent is still the worst then the parent will be killed. This heuristic assumes that the children did less work than their parent and by killing one of them, the work lost will be less. However this is very workload dependent. If there is a workload which can benefit from this heuristic, can use oom_score_adj to prefer children to be killed before the parent. The select_bad_process() has already selected the worst process in the system/memcg. There is no need to recheck the badness of its children and hoping to find a worse candidate. That's a lot of unneeded racy work. Also the heuristic is dangerous because it make fork bomb like workloads to recover much later because we constantly pick and kill processes which are not memory hogs. So, let's remove this whole heuristic. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/20190121215850.221745-2-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/bbbe48029720d2c6b6733f78d02571a281511adb;" If the parent is still the worst then the parent will be killed";no;no;yes 56;MDY6Q29tbWl0MjMyNTI5ODpiYmJlNDgwMjk3MjBkMmM2YjY3MzNmNzhkMDI1NzFhMjgxNTExYWRi;Shakeel Butt;Linus Torvalds;"mm, oom: remove 'prefer children over parent' heuristic Since the start of the git history of Linux, the kernel after selecting the worst process to be oom-killed, prefer to kill its child (if the child does not share mm with the parent). Later it was changed to prefer to kill a child who is worst. If the parent is still the worst then the parent will be killed. This heuristic assumes that the children did less work than their parent and by killing one of them, the work lost will be less. However this is very workload dependent. If there is a workload which can benefit from this heuristic, can use oom_score_adj to prefer children to be killed before the parent. The select_bad_process() has already selected the worst process in the system/memcg. There is no need to recheck the badness of its children and hoping to find a worse candidate. That's a lot of unneeded racy work. Also the heuristic is dangerous because it make fork bomb like workloads to recover much later because we constantly pick and kill processes which are not memory hogs. So, let's remove this whole heuristic. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/20190121215850.221745-2-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/bbbe48029720d2c6b6733f78d02571a281511adb;"This heuristic assumes that the children did less work than their parent and by killing one of them, the work lost will be less";no;no;yes 56;MDY6Q29tbWl0MjMyNTI5ODpiYmJlNDgwMjk3MjBkMmM2YjY3MzNmNzhkMDI1NzFhMjgxNTExYWRi;Shakeel Butt;Linus Torvalds;"mm, oom: remove 'prefer children over parent' heuristic Since the start of the git history of Linux, the kernel after selecting the worst process to be oom-killed, prefer to kill its child (if the child does not share mm with the parent). Later it was changed to prefer to kill a child who is worst. If the parent is still the worst then the parent will be killed. This heuristic assumes that the children did less work than their parent and by killing one of them, the work lost will be less. However this is very workload dependent. If there is a workload which can benefit from this heuristic, can use oom_score_adj to prefer children to be killed before the parent. The select_bad_process() has already selected the worst process in the system/memcg. There is no need to recheck the badness of its children and hoping to find a worse candidate. That's a lot of unneeded racy work. Also the heuristic is dangerous because it make fork bomb like workloads to recover much later because we constantly pick and kill processes which are not memory hogs. So, let's remove this whole heuristic. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/20190121215850.221745-2-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/bbbe48029720d2c6b6733f78d02571a281511adb;" However this is very workload dependent";no;yes;yes 56;MDY6Q29tbWl0MjMyNTI5ODpiYmJlNDgwMjk3MjBkMmM2YjY3MzNmNzhkMDI1NzFhMjgxNTExYWRi;Shakeel Butt;Linus Torvalds;"mm, oom: remove 'prefer children over parent' heuristic Since the start of the git history of Linux, the kernel after selecting the worst process to be oom-killed, prefer to kill its child (if the child does not share mm with the parent). Later it was changed to prefer to kill a child who is worst. If the parent is still the worst then the parent will be killed. This heuristic assumes that the children did less work than their parent and by killing one of them, the work lost will be less. However this is very workload dependent. If there is a workload which can benefit from this heuristic, can use oom_score_adj to prefer children to be killed before the parent. The select_bad_process() has already selected the worst process in the system/memcg. There is no need to recheck the badness of its children and hoping to find a worse candidate. That's a lot of unneeded racy work. Also the heuristic is dangerous because it make fork bomb like workloads to recover much later because we constantly pick and kill processes which are not memory hogs. So, let's remove this whole heuristic. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/20190121215850.221745-2-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/bbbe48029720d2c6b6733f78d02571a281511adb;" If there is a workload which can benefit from this heuristic, can use oom_score_adj to prefer children to be killed before the parent";no;yes;yes 56;MDY6Q29tbWl0MjMyNTI5ODpiYmJlNDgwMjk3MjBkMmM2YjY3MzNmNzhkMDI1NzFhMjgxNTExYWRi;Shakeel Butt;Linus Torvalds;"mm, oom: remove 'prefer children over parent' heuristic Since the start of the git history of Linux, the kernel after selecting the worst process to be oom-killed, prefer to kill its child (if the child does not share mm with the parent). Later it was changed to prefer to kill a child who is worst. If the parent is still the worst then the parent will be killed. This heuristic assumes that the children did less work than their parent and by killing one of them, the work lost will be less. However this is very workload dependent. If there is a workload which can benefit from this heuristic, can use oom_score_adj to prefer children to be killed before the parent. The select_bad_process() has already selected the worst process in the system/memcg. There is no need to recheck the badness of its children and hoping to find a worse candidate. That's a lot of unneeded racy work. Also the heuristic is dangerous because it make fork bomb like workloads to recover much later because we constantly pick and kill processes which are not memory hogs. So, let's remove this whole heuristic. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/20190121215850.221745-2-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/bbbe48029720d2c6b6733f78d02571a281511adb;"The select_bad_process() has already selected the worst process in the system/memcg";no;no;yes 56;MDY6Q29tbWl0MjMyNTI5ODpiYmJlNDgwMjk3MjBkMmM2YjY3MzNmNzhkMDI1NzFhMjgxNTExYWRi;Shakeel Butt;Linus Torvalds;"mm, oom: remove 'prefer children over parent' heuristic Since the start of the git history of Linux, the kernel after selecting the worst process to be oom-killed, prefer to kill its child (if the child does not share mm with the parent). Later it was changed to prefer to kill a child who is worst. If the parent is still the worst then the parent will be killed. This heuristic assumes that the children did less work than their parent and by killing one of them, the work lost will be less. However this is very workload dependent. If there is a workload which can benefit from this heuristic, can use oom_score_adj to prefer children to be killed before the parent. The select_bad_process() has already selected the worst process in the system/memcg. There is no need to recheck the badness of its children and hoping to find a worse candidate. That's a lot of unneeded racy work. Also the heuristic is dangerous because it make fork bomb like workloads to recover much later because we constantly pick and kill processes which are not memory hogs. So, let's remove this whole heuristic. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/20190121215850.221745-2-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/bbbe48029720d2c6b6733f78d02571a281511adb;" There is no need to recheck the badness of its children and hoping to find a worse candidate";no;yes;no 56;MDY6Q29tbWl0MjMyNTI5ODpiYmJlNDgwMjk3MjBkMmM2YjY3MzNmNzhkMDI1NzFhMjgxNTExYWRi;Shakeel Butt;Linus Torvalds;"mm, oom: remove 'prefer children over parent' heuristic Since the start of the git history of Linux, the kernel after selecting the worst process to be oom-killed, prefer to kill its child (if the child does not share mm with the parent). Later it was changed to prefer to kill a child who is worst. If the parent is still the worst then the parent will be killed. This heuristic assumes that the children did less work than their parent and by killing one of them, the work lost will be less. However this is very workload dependent. If there is a workload which can benefit from this heuristic, can use oom_score_adj to prefer children to be killed before the parent. The select_bad_process() has already selected the worst process in the system/memcg. There is no need to recheck the badness of its children and hoping to find a worse candidate. That's a lot of unneeded racy work. Also the heuristic is dangerous because it make fork bomb like workloads to recover much later because we constantly pick and kill processes which are not memory hogs. So, let's remove this whole heuristic. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/20190121215850.221745-2-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/bbbe48029720d2c6b6733f78d02571a281511adb;" That's a lot of unneeded racy work";no;yes;no 56;MDY6Q29tbWl0MjMyNTI5ODpiYmJlNDgwMjk3MjBkMmM2YjY3MzNmNzhkMDI1NzFhMjgxNTExYWRi;Shakeel Butt;Linus Torvalds;"mm, oom: remove 'prefer children over parent' heuristic Since the start of the git history of Linux, the kernel after selecting the worst process to be oom-killed, prefer to kill its child (if the child does not share mm with the parent). Later it was changed to prefer to kill a child who is worst. If the parent is still the worst then the parent will be killed. This heuristic assumes that the children did less work than their parent and by killing one of them, the work lost will be less. However this is very workload dependent. If there is a workload which can benefit from this heuristic, can use oom_score_adj to prefer children to be killed before the parent. The select_bad_process() has already selected the worst process in the system/memcg. There is no need to recheck the badness of its children and hoping to find a worse candidate. That's a lot of unneeded racy work. Also the heuristic is dangerous because it make fork bomb like workloads to recover much later because we constantly pick and kill processes which are not memory hogs. So, let's remove this whole heuristic. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/20190121215850.221745-2-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/bbbe48029720d2c6b6733f78d02571a281511adb;" Also the heuristic is dangerous because it make fork bomb like workloads to recover much later because we constantly pick and kill processes which are not memory hogs";no;yes;yes 56;MDY6Q29tbWl0MjMyNTI5ODpiYmJlNDgwMjk3MjBkMmM2YjY3MzNmNzhkMDI1NzFhMjgxNTExYWRi;Shakeel Butt;Linus Torvalds;"mm, oom: remove 'prefer children over parent' heuristic Since the start of the git history of Linux, the kernel after selecting the worst process to be oom-killed, prefer to kill its child (if the child does not share mm with the parent). Later it was changed to prefer to kill a child who is worst. If the parent is still the worst then the parent will be killed. This heuristic assumes that the children did less work than their parent and by killing one of them, the work lost will be less. However this is very workload dependent. If there is a workload which can benefit from this heuristic, can use oom_score_adj to prefer children to be killed before the parent. The select_bad_process() has already selected the worst process in the system/memcg. There is no need to recheck the badness of its children and hoping to find a worse candidate. That's a lot of unneeded racy work. Also the heuristic is dangerous because it make fork bomb like workloads to recover much later because we constantly pick and kill processes which are not memory hogs. So, let's remove this whole heuristic. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/20190121215850.221745-2-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/bbbe48029720d2c6b6733f78d02571a281511adb;" So, let's remove this whole heuristic.";yes;no;no 57;MDY6Q29tbWl0MjMyNTI5ODpjZWZjN2VmM2M4N2QwMmZjOTMwNzgzNTg2OGZmNzIxZWExMmNjNTk3;Shakeel Butt;Linus Torvalds;"mm, oom: fix use-after-free in oom_kill_process Syzbot instance running on upstream kernel found a use-after-free bug in oom_kill_process. On further inspection it seems like the process selected to be oom-killed has exited even before reaching read_lock(&tasklist_lock) in oom_kill_process(). More specifically the tsk->usage is 1 which is due to get_task_struct() in oom_evaluate_task() and the put_task_struct within for_each_thread() frees the tsk and for_each_thread() tries to access the tsk. The easiest fix is to do get/put across the for_each_thread() on the selected task. Now the next question is should we continue with the oom-kill as the previously selected task has exited? However before adding more complexity and heuristics, let's answer why we even look at the children of oom-kill selected task? The select_bad_process() has already selected the worst process in the system/memcg. Due to race, the selected process might not be the worst at the kill time but does that matter? The userspace can use the oom_score_adj interface to prefer children to be killed before the parent. I looked at the history but it seems like this is there before git history. Link: http://lkml.kernel.org/r/20190121215850.221745-1-shakeelb@google.com Reported-by: syzbot+7fbbfa368521945f0e3d@syzkaller.appspotmail.com Fixes: 6b0c81b3be11 (""mm, oom: reduce dependency on tasklist_lock"") Signed-off-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/cefc7ef3c87d02fc9307835868ff721ea12cc597;mm, oom: fix use-after-free in oom_kill_process;yes;yes;no 57;MDY6Q29tbWl0MjMyNTI5ODpjZWZjN2VmM2M4N2QwMmZjOTMwNzgzNTg2OGZmNzIxZWExMmNjNTk3;Shakeel Butt;Linus Torvalds;"mm, oom: fix use-after-free in oom_kill_process Syzbot instance running on upstream kernel found a use-after-free bug in oom_kill_process. On further inspection it seems like the process selected to be oom-killed has exited even before reaching read_lock(&tasklist_lock) in oom_kill_process(). More specifically the tsk->usage is 1 which is due to get_task_struct() in oom_evaluate_task() and the put_task_struct within for_each_thread() frees the tsk and for_each_thread() tries to access the tsk. The easiest fix is to do get/put across the for_each_thread() on the selected task. Now the next question is should we continue with the oom-kill as the previously selected task has exited? However before adding more complexity and heuristics, let's answer why we even look at the children of oom-kill selected task? The select_bad_process() has already selected the worst process in the system/memcg. Due to race, the selected process might not be the worst at the kill time but does that matter? The userspace can use the oom_score_adj interface to prefer children to be killed before the parent. I looked at the history but it seems like this is there before git history. Link: http://lkml.kernel.org/r/20190121215850.221745-1-shakeelb@google.com Reported-by: syzbot+7fbbfa368521945f0e3d@syzkaller.appspotmail.com Fixes: 6b0c81b3be11 (""mm, oom: reduce dependency on tasklist_lock"") Signed-off-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/cefc7ef3c87d02fc9307835868ff721ea12cc597;"Syzbot instance running on upstream kernel found a use-after-free bug in oom_kill_process";no;yes;yes 57;MDY6Q29tbWl0MjMyNTI5ODpjZWZjN2VmM2M4N2QwMmZjOTMwNzgzNTg2OGZmNzIxZWExMmNjNTk3;Shakeel Butt;Linus Torvalds;"mm, oom: fix use-after-free in oom_kill_process Syzbot instance running on upstream kernel found a use-after-free bug in oom_kill_process. On further inspection it seems like the process selected to be oom-killed has exited even before reaching read_lock(&tasklist_lock) in oom_kill_process(). More specifically the tsk->usage is 1 which is due to get_task_struct() in oom_evaluate_task() and the put_task_struct within for_each_thread() frees the tsk and for_each_thread() tries to access the tsk. The easiest fix is to do get/put across the for_each_thread() on the selected task. Now the next question is should we continue with the oom-kill as the previously selected task has exited? However before adding more complexity and heuristics, let's answer why we even look at the children of oom-kill selected task? The select_bad_process() has already selected the worst process in the system/memcg. Due to race, the selected process might not be the worst at the kill time but does that matter? The userspace can use the oom_score_adj interface to prefer children to be killed before the parent. I looked at the history but it seems like this is there before git history. Link: http://lkml.kernel.org/r/20190121215850.221745-1-shakeelb@google.com Reported-by: syzbot+7fbbfa368521945f0e3d@syzkaller.appspotmail.com Fixes: 6b0c81b3be11 (""mm, oom: reduce dependency on tasklist_lock"") Signed-off-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/cefc7ef3c87d02fc9307835868ff721ea12cc597;" On further inspection it seems like the process selected to be oom-killed has exited even before reaching read_lock(&tasklist_lock) in oom_kill_process()";no;no;yes 57;MDY6Q29tbWl0MjMyNTI5ODpjZWZjN2VmM2M4N2QwMmZjOTMwNzgzNTg2OGZmNzIxZWExMmNjNTk3;Shakeel Butt;Linus Torvalds;"mm, oom: fix use-after-free in oom_kill_process Syzbot instance running on upstream kernel found a use-after-free bug in oom_kill_process. On further inspection it seems like the process selected to be oom-killed has exited even before reaching read_lock(&tasklist_lock) in oom_kill_process(). More specifically the tsk->usage is 1 which is due to get_task_struct() in oom_evaluate_task() and the put_task_struct within for_each_thread() frees the tsk and for_each_thread() tries to access the tsk. The easiest fix is to do get/put across the for_each_thread() on the selected task. Now the next question is should we continue with the oom-kill as the previously selected task has exited? However before adding more complexity and heuristics, let's answer why we even look at the children of oom-kill selected task? The select_bad_process() has already selected the worst process in the system/memcg. Due to race, the selected process might not be the worst at the kill time but does that matter? The userspace can use the oom_score_adj interface to prefer children to be killed before the parent. I looked at the history but it seems like this is there before git history. Link: http://lkml.kernel.org/r/20190121215850.221745-1-shakeelb@google.com Reported-by: syzbot+7fbbfa368521945f0e3d@syzkaller.appspotmail.com Fixes: 6b0c81b3be11 (""mm, oom: reduce dependency on tasklist_lock"") Signed-off-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/cefc7ef3c87d02fc9307835868ff721ea12cc597;" More specifically the tsk->usage is 1 which is due to get_task_struct() in oom_evaluate_task() and the put_task_struct within for_each_thread() frees the tsk and for_each_thread() tries to access the tsk";no;no;yes 57;MDY6Q29tbWl0MjMyNTI5ODpjZWZjN2VmM2M4N2QwMmZjOTMwNzgzNTg2OGZmNzIxZWExMmNjNTk3;Shakeel Butt;Linus Torvalds;"mm, oom: fix use-after-free in oom_kill_process Syzbot instance running on upstream kernel found a use-after-free bug in oom_kill_process. On further inspection it seems like the process selected to be oom-killed has exited even before reaching read_lock(&tasklist_lock) in oom_kill_process(). More specifically the tsk->usage is 1 which is due to get_task_struct() in oom_evaluate_task() and the put_task_struct within for_each_thread() frees the tsk and for_each_thread() tries to access the tsk. The easiest fix is to do get/put across the for_each_thread() on the selected task. Now the next question is should we continue with the oom-kill as the previously selected task has exited? However before adding more complexity and heuristics, let's answer why we even look at the children of oom-kill selected task? The select_bad_process() has already selected the worst process in the system/memcg. Due to race, the selected process might not be the worst at the kill time but does that matter? The userspace can use the oom_score_adj interface to prefer children to be killed before the parent. I looked at the history but it seems like this is there before git history. Link: http://lkml.kernel.org/r/20190121215850.221745-1-shakeelb@google.com Reported-by: syzbot+7fbbfa368521945f0e3d@syzkaller.appspotmail.com Fixes: 6b0c81b3be11 (""mm, oom: reduce dependency on tasklist_lock"") Signed-off-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/cefc7ef3c87d02fc9307835868ff721ea12cc597;" The easiest fix is to do get/put across the for_each_thread() on the selected task";yes;yes;no 57;MDY6Q29tbWl0MjMyNTI5ODpjZWZjN2VmM2M4N2QwMmZjOTMwNzgzNTg2OGZmNzIxZWExMmNjNTk3;Shakeel Butt;Linus Torvalds;"mm, oom: fix use-after-free in oom_kill_process Syzbot instance running on upstream kernel found a use-after-free bug in oom_kill_process. On further inspection it seems like the process selected to be oom-killed has exited even before reaching read_lock(&tasklist_lock) in oom_kill_process(). More specifically the tsk->usage is 1 which is due to get_task_struct() in oom_evaluate_task() and the put_task_struct within for_each_thread() frees the tsk and for_each_thread() tries to access the tsk. The easiest fix is to do get/put across the for_each_thread() on the selected task. Now the next question is should we continue with the oom-kill as the previously selected task has exited? However before adding more complexity and heuristics, let's answer why we even look at the children of oom-kill selected task? The select_bad_process() has already selected the worst process in the system/memcg. Due to race, the selected process might not be the worst at the kill time but does that matter? The userspace can use the oom_score_adj interface to prefer children to be killed before the parent. I looked at the history but it seems like this is there before git history. Link: http://lkml.kernel.org/r/20190121215850.221745-1-shakeelb@google.com Reported-by: syzbot+7fbbfa368521945f0e3d@syzkaller.appspotmail.com Fixes: 6b0c81b3be11 (""mm, oom: reduce dependency on tasklist_lock"") Signed-off-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/cefc7ef3c87d02fc9307835868ff721ea12cc597;"Now the next question is should we continue with the oom-kill as the previously selected task has exited? However before adding more complexity and heuristics, let's answer why we even look at the children of oom-kill selected task? The select_bad_process() has already selected the worst process in the system/memcg";yes;no;yes 57;MDY6Q29tbWl0MjMyNTI5ODpjZWZjN2VmM2M4N2QwMmZjOTMwNzgzNTg2OGZmNzIxZWExMmNjNTk3;Shakeel Butt;Linus Torvalds;"mm, oom: fix use-after-free in oom_kill_process Syzbot instance running on upstream kernel found a use-after-free bug in oom_kill_process. On further inspection it seems like the process selected to be oom-killed has exited even before reaching read_lock(&tasklist_lock) in oom_kill_process(). More specifically the tsk->usage is 1 which is due to get_task_struct() in oom_evaluate_task() and the put_task_struct within for_each_thread() frees the tsk and for_each_thread() tries to access the tsk. The easiest fix is to do get/put across the for_each_thread() on the selected task. Now the next question is should we continue with the oom-kill as the previously selected task has exited? However before adding more complexity and heuristics, let's answer why we even look at the children of oom-kill selected task? The select_bad_process() has already selected the worst process in the system/memcg. Due to race, the selected process might not be the worst at the kill time but does that matter? The userspace can use the oom_score_adj interface to prefer children to be killed before the parent. I looked at the history but it seems like this is there before git history. Link: http://lkml.kernel.org/r/20190121215850.221745-1-shakeelb@google.com Reported-by: syzbot+7fbbfa368521945f0e3d@syzkaller.appspotmail.com Fixes: 6b0c81b3be11 (""mm, oom: reduce dependency on tasklist_lock"") Signed-off-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/cefc7ef3c87d02fc9307835868ff721ea12cc597;" Due to race, the selected process might not be the worst at the kill time but does that matter? The userspace can use the oom_score_adj interface to prefer children to be killed before the parent";no;no;yes 57;MDY6Q29tbWl0MjMyNTI5ODpjZWZjN2VmM2M4N2QwMmZjOTMwNzgzNTg2OGZmNzIxZWExMmNjNTk3;Shakeel Butt;Linus Torvalds;"mm, oom: fix use-after-free in oom_kill_process Syzbot instance running on upstream kernel found a use-after-free bug in oom_kill_process. On further inspection it seems like the process selected to be oom-killed has exited even before reaching read_lock(&tasklist_lock) in oom_kill_process(). More specifically the tsk->usage is 1 which is due to get_task_struct() in oom_evaluate_task() and the put_task_struct within for_each_thread() frees the tsk and for_each_thread() tries to access the tsk. The easiest fix is to do get/put across the for_each_thread() on the selected task. Now the next question is should we continue with the oom-kill as the previously selected task has exited? However before adding more complexity and heuristics, let's answer why we even look at the children of oom-kill selected task? The select_bad_process() has already selected the worst process in the system/memcg. Due to race, the selected process might not be the worst at the kill time but does that matter? The userspace can use the oom_score_adj interface to prefer children to be killed before the parent. I looked at the history but it seems like this is there before git history. Link: http://lkml.kernel.org/r/20190121215850.221745-1-shakeelb@google.com Reported-by: syzbot+7fbbfa368521945f0e3d@syzkaller.appspotmail.com Fixes: 6b0c81b3be11 (""mm, oom: reduce dependency on tasklist_lock"") Signed-off-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/cefc7ef3c87d02fc9307835868ff721ea12cc597;" I looked at the history but it seems like this is there before git history.";no;no;no 58;MDY6Q29tbWl0MjMyNTI5ODo5YmNkZWI1MWJkN2QyYWU5ZmU2NWVhNGQ2MDY0M2QyYWVlZjViZmUz;Tetsuo Handa;Linus Torvalds;"oom, oom_reaper: do not enqueue same task twice Arkadiusz reported that enabling memcg's group oom killing causes strange memcg statistics where there is no task in a memcg despite the number of tasks in that memcg is not 0. It turned out that there is a bug in wake_oom_reaper() which allows enqueuing same task twice which makes impossible to decrease the number of tasks in that memcg due to a refcount leak. This bug existed since the OOM reaper became invokable from task_will_free_mem(current) path in out_of_memory() in Linux 4.7, T1@P1 |T2@P1 |T3@P1 |OOM reaper ----------+----------+----------+------------ # Processing an OOM victim in a different memcg domain. try_charge() mem_cgroup_out_of_memory() mutex_lock(&oom_lock) try_charge() mem_cgroup_out_of_memory() mutex_lock(&oom_lock) try_charge() mem_cgroup_out_of_memory() mutex_lock(&oom_lock) out_of_memory() oom_kill_process(P1) do_send_sig_info(SIGKILL, @P1) mark_oom_victim(T1@P1) wake_oom_reaper(T1@P1) # T1@P1 is enqueued. mutex_unlock(&oom_lock) out_of_memory() mark_oom_victim(T2@P1) wake_oom_reaper(T2@P1) # T2@P1 is enqueued. mutex_unlock(&oom_lock) out_of_memory() mark_oom_victim(T1@P1) wake_oom_reaper(T1@P1) # T1@P1 is enqueued again due to oom_reaper_list == T2@P1 && T1@P1->oom_reaper_list == NULL. mutex_unlock(&oom_lock) # Completed processing an OOM victim in a different memcg domain. spin_lock(&oom_reaper_lock) # T1P1 is dequeued. spin_unlock(&oom_reaper_lock) but memcg's group oom killing made it easier to trigger this bug by calling wake_oom_reaper() on the same task from one out_of_memory() request. Fix this bug using an approach used by commit 855b018325737f76 (""oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task""). As a side effect of this patch, this patch also avoids enqueuing multiple threads sharing memory via task_will_free_mem(current) path. Link: http://lkml.kernel.org/r/e865a044-2c10-9858-f4ef-254bc71d6cc2@i-love.sakura.ne.jp Link: http://lkml.kernel.org/r/5ee34fc6-1485-34f8-8790-903ddabaa809@i-love.sakura.ne.jp Fixes: af8e15cc85a25315 (""oom, oom_reaper: do not enqueue task if it is on the oom_reaper_list head"") Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-by: Arkadiusz Miskiewicz <arekm@maven.pl> Tested-by: Arkadiusz Miskiewicz <arekm@maven.pl> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Roman Gushchin <guro@fb.com> Cc: Tejun Heo <tj@kernel.org> Cc: Aleksa Sarai <asarai@suse.de> Cc: Jay Kamat <jgkamat@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9bcdeb51bd7d2ae9fe65ea4d60643d2aeef5bfe3;oom, oom_reaper: do not enqueue same task twice;yes;yes;no 58;MDY6Q29tbWl0MjMyNTI5ODo5YmNkZWI1MWJkN2QyYWU5ZmU2NWVhNGQ2MDY0M2QyYWVlZjViZmUz;Tetsuo Handa;Linus Torvalds;"oom, oom_reaper: do not enqueue same task twice Arkadiusz reported that enabling memcg's group oom killing causes strange memcg statistics where there is no task in a memcg despite the number of tasks in that memcg is not 0. It turned out that there is a bug in wake_oom_reaper() which allows enqueuing same task twice which makes impossible to decrease the number of tasks in that memcg due to a refcount leak. This bug existed since the OOM reaper became invokable from task_will_free_mem(current) path in out_of_memory() in Linux 4.7, T1@P1 |T2@P1 |T3@P1 |OOM reaper ----------+----------+----------+------------ # Processing an OOM victim in a different memcg domain. try_charge() mem_cgroup_out_of_memory() mutex_lock(&oom_lock) try_charge() mem_cgroup_out_of_memory() mutex_lock(&oom_lock) try_charge() mem_cgroup_out_of_memory() mutex_lock(&oom_lock) out_of_memory() oom_kill_process(P1) do_send_sig_info(SIGKILL, @P1) mark_oom_victim(T1@P1) wake_oom_reaper(T1@P1) # T1@P1 is enqueued. mutex_unlock(&oom_lock) out_of_memory() mark_oom_victim(T2@P1) wake_oom_reaper(T2@P1) # T2@P1 is enqueued. mutex_unlock(&oom_lock) out_of_memory() mark_oom_victim(T1@P1) wake_oom_reaper(T1@P1) # T1@P1 is enqueued again due to oom_reaper_list == T2@P1 && T1@P1->oom_reaper_list == NULL. mutex_unlock(&oom_lock) # Completed processing an OOM victim in a different memcg domain. spin_lock(&oom_reaper_lock) # T1P1 is dequeued. spin_unlock(&oom_reaper_lock) but memcg's group oom killing made it easier to trigger this bug by calling wake_oom_reaper() on the same task from one out_of_memory() request. Fix this bug using an approach used by commit 855b018325737f76 (""oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task""). As a side effect of this patch, this patch also avoids enqueuing multiple threads sharing memory via task_will_free_mem(current) path. Link: http://lkml.kernel.org/r/e865a044-2c10-9858-f4ef-254bc71d6cc2@i-love.sakura.ne.jp Link: http://lkml.kernel.org/r/5ee34fc6-1485-34f8-8790-903ddabaa809@i-love.sakura.ne.jp Fixes: af8e15cc85a25315 (""oom, oom_reaper: do not enqueue task if it is on the oom_reaper_list head"") Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-by: Arkadiusz Miskiewicz <arekm@maven.pl> Tested-by: Arkadiusz Miskiewicz <arekm@maven.pl> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Roman Gushchin <guro@fb.com> Cc: Tejun Heo <tj@kernel.org> Cc: Aleksa Sarai <asarai@suse.de> Cc: Jay Kamat <jgkamat@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9bcdeb51bd7d2ae9fe65ea4d60643d2aeef5bfe3;"Arkadiusz reported that enabling memcg's group oom killing causes strange memcg statistics where there is no task in a memcg despite the number of tasks in that memcg is not 0";no;yes;yes 58;MDY6Q29tbWl0MjMyNTI5ODo5YmNkZWI1MWJkN2QyYWU5ZmU2NWVhNGQ2MDY0M2QyYWVlZjViZmUz;Tetsuo Handa;Linus Torvalds;"oom, oom_reaper: do not enqueue same task twice Arkadiusz reported that enabling memcg's group oom killing causes strange memcg statistics where there is no task in a memcg despite the number of tasks in that memcg is not 0. It turned out that there is a bug in wake_oom_reaper() which allows enqueuing same task twice which makes impossible to decrease the number of tasks in that memcg due to a refcount leak. This bug existed since the OOM reaper became invokable from task_will_free_mem(current) path in out_of_memory() in Linux 4.7, T1@P1 |T2@P1 |T3@P1 |OOM reaper ----------+----------+----------+------------ # Processing an OOM victim in a different memcg domain. try_charge() mem_cgroup_out_of_memory() mutex_lock(&oom_lock) try_charge() mem_cgroup_out_of_memory() mutex_lock(&oom_lock) try_charge() mem_cgroup_out_of_memory() mutex_lock(&oom_lock) out_of_memory() oom_kill_process(P1) do_send_sig_info(SIGKILL, @P1) mark_oom_victim(T1@P1) wake_oom_reaper(T1@P1) # T1@P1 is enqueued. mutex_unlock(&oom_lock) out_of_memory() mark_oom_victim(T2@P1) wake_oom_reaper(T2@P1) # T2@P1 is enqueued. mutex_unlock(&oom_lock) out_of_memory() mark_oom_victim(T1@P1) wake_oom_reaper(T1@P1) # T1@P1 is enqueued again due to oom_reaper_list == T2@P1 && T1@P1->oom_reaper_list == NULL. mutex_unlock(&oom_lock) # Completed processing an OOM victim in a different memcg domain. spin_lock(&oom_reaper_lock) # T1P1 is dequeued. spin_unlock(&oom_reaper_lock) but memcg's group oom killing made it easier to trigger this bug by calling wake_oom_reaper() on the same task from one out_of_memory() request. Fix this bug using an approach used by commit 855b018325737f76 (""oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task""). As a side effect of this patch, this patch also avoids enqueuing multiple threads sharing memory via task_will_free_mem(current) path. Link: http://lkml.kernel.org/r/e865a044-2c10-9858-f4ef-254bc71d6cc2@i-love.sakura.ne.jp Link: http://lkml.kernel.org/r/5ee34fc6-1485-34f8-8790-903ddabaa809@i-love.sakura.ne.jp Fixes: af8e15cc85a25315 (""oom, oom_reaper: do not enqueue task if it is on the oom_reaper_list head"") Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-by: Arkadiusz Miskiewicz <arekm@maven.pl> Tested-by: Arkadiusz Miskiewicz <arekm@maven.pl> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Roman Gushchin <guro@fb.com> Cc: Tejun Heo <tj@kernel.org> Cc: Aleksa Sarai <asarai@suse.de> Cc: Jay Kamat <jgkamat@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9bcdeb51bd7d2ae9fe65ea4d60643d2aeef5bfe3;" It turned out that there is a bug in wake_oom_reaper() which allows enqueuing same task twice which makes impossible to decrease the number of tasks in that memcg due to a refcount leak";no;yes;yes 58;MDY6Q29tbWl0MjMyNTI5ODo5YmNkZWI1MWJkN2QyYWU5ZmU2NWVhNGQ2MDY0M2QyYWVlZjViZmUz;Tetsuo Handa;Linus Torvalds;"oom, oom_reaper: do not enqueue same task twice Arkadiusz reported that enabling memcg's group oom killing causes strange memcg statistics where there is no task in a memcg despite the number of tasks in that memcg is not 0. It turned out that there is a bug in wake_oom_reaper() which allows enqueuing same task twice which makes impossible to decrease the number of tasks in that memcg due to a refcount leak. This bug existed since the OOM reaper became invokable from task_will_free_mem(current) path in out_of_memory() in Linux 4.7, T1@P1 |T2@P1 |T3@P1 |OOM reaper ----------+----------+----------+------------ # Processing an OOM victim in a different memcg domain. try_charge() mem_cgroup_out_of_memory() mutex_lock(&oom_lock) try_charge() mem_cgroup_out_of_memory() mutex_lock(&oom_lock) try_charge() mem_cgroup_out_of_memory() mutex_lock(&oom_lock) out_of_memory() oom_kill_process(P1) do_send_sig_info(SIGKILL, @P1) mark_oom_victim(T1@P1) wake_oom_reaper(T1@P1) # T1@P1 is enqueued. mutex_unlock(&oom_lock) out_of_memory() mark_oom_victim(T2@P1) wake_oom_reaper(T2@P1) # T2@P1 is enqueued. mutex_unlock(&oom_lock) out_of_memory() mark_oom_victim(T1@P1) wake_oom_reaper(T1@P1) # T1@P1 is enqueued again due to oom_reaper_list == T2@P1 && T1@P1->oom_reaper_list == NULL. mutex_unlock(&oom_lock) # Completed processing an OOM victim in a different memcg domain. spin_lock(&oom_reaper_lock) # T1P1 is dequeued. spin_unlock(&oom_reaper_lock) but memcg's group oom killing made it easier to trigger this bug by calling wake_oom_reaper() on the same task from one out_of_memory() request. Fix this bug using an approach used by commit 855b018325737f76 (""oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task""). As a side effect of this patch, this patch also avoids enqueuing multiple threads sharing memory via task_will_free_mem(current) path. Link: http://lkml.kernel.org/r/e865a044-2c10-9858-f4ef-254bc71d6cc2@i-love.sakura.ne.jp Link: http://lkml.kernel.org/r/5ee34fc6-1485-34f8-8790-903ddabaa809@i-love.sakura.ne.jp Fixes: af8e15cc85a25315 (""oom, oom_reaper: do not enqueue task if it is on the oom_reaper_list head"") Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-by: Arkadiusz Miskiewicz <arekm@maven.pl> Tested-by: Arkadiusz Miskiewicz <arekm@maven.pl> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Roman Gushchin <guro@fb.com> Cc: Tejun Heo <tj@kernel.org> Cc: Aleksa Sarai <asarai@suse.de> Cc: Jay Kamat <jgkamat@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9bcdeb51bd7d2ae9fe65ea4d60643d2aeef5bfe3;"This bug existed since the OOM reaper became invokable from task_will_free_mem(current) path in out_of_memory() in Linux 4.7, T1@P1 |T2@P1 |T3@P1 |OOM reaper # Processing an OOM victim in a different memcg domain";no;no;yes 58;MDY6Q29tbWl0MjMyNTI5ODo5YmNkZWI1MWJkN2QyYWU5ZmU2NWVhNGQ2MDY0M2QyYWVlZjViZmUz;Tetsuo Handa;Linus Torvalds;"oom, oom_reaper: do not enqueue same task twice Arkadiusz reported that enabling memcg's group oom killing causes strange memcg statistics where there is no task in a memcg despite the number of tasks in that memcg is not 0. It turned out that there is a bug in wake_oom_reaper() which allows enqueuing same task twice which makes impossible to decrease the number of tasks in that memcg due to a refcount leak. This bug existed since the OOM reaper became invokable from task_will_free_mem(current) path in out_of_memory() in Linux 4.7, T1@P1 |T2@P1 |T3@P1 |OOM reaper ----------+----------+----------+------------ # Processing an OOM victim in a different memcg domain. try_charge() mem_cgroup_out_of_memory() mutex_lock(&oom_lock) try_charge() mem_cgroup_out_of_memory() mutex_lock(&oom_lock) try_charge() mem_cgroup_out_of_memory() mutex_lock(&oom_lock) out_of_memory() oom_kill_process(P1) do_send_sig_info(SIGKILL, @P1) mark_oom_victim(T1@P1) wake_oom_reaper(T1@P1) # T1@P1 is enqueued. mutex_unlock(&oom_lock) out_of_memory() mark_oom_victim(T2@P1) wake_oom_reaper(T2@P1) # T2@P1 is enqueued. mutex_unlock(&oom_lock) out_of_memory() mark_oom_victim(T1@P1) wake_oom_reaper(T1@P1) # T1@P1 is enqueued again due to oom_reaper_list == T2@P1 && T1@P1->oom_reaper_list == NULL. mutex_unlock(&oom_lock) # Completed processing an OOM victim in a different memcg domain. spin_lock(&oom_reaper_lock) # T1P1 is dequeued. spin_unlock(&oom_reaper_lock) but memcg's group oom killing made it easier to trigger this bug by calling wake_oom_reaper() on the same task from one out_of_memory() request. Fix this bug using an approach used by commit 855b018325737f76 (""oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task""). As a side effect of this patch, this patch also avoids enqueuing multiple threads sharing memory via task_will_free_mem(current) path. Link: http://lkml.kernel.org/r/e865a044-2c10-9858-f4ef-254bc71d6cc2@i-love.sakura.ne.jp Link: http://lkml.kernel.org/r/5ee34fc6-1485-34f8-8790-903ddabaa809@i-love.sakura.ne.jp Fixes: af8e15cc85a25315 (""oom, oom_reaper: do not enqueue task if it is on the oom_reaper_list head"") Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-by: Arkadiusz Miskiewicz <arekm@maven.pl> Tested-by: Arkadiusz Miskiewicz <arekm@maven.pl> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Roman Gushchin <guro@fb.com> Cc: Tejun Heo <tj@kernel.org> Cc: Aleksa Sarai <asarai@suse.de> Cc: Jay Kamat <jgkamat@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9bcdeb51bd7d2ae9fe65ea4d60643d2aeef5bfe3;" spin_unlock(&oom_reaper_lock) but memcg's group oom killing made it easier to trigger this bug by calling wake_oom_reaper() on the same task from one out_of_memory() request";no;no;yes 58;MDY6Q29tbWl0MjMyNTI5ODo5YmNkZWI1MWJkN2QyYWU5ZmU2NWVhNGQ2MDY0M2QyYWVlZjViZmUz;Tetsuo Handa;Linus Torvalds;"oom, oom_reaper: do not enqueue same task twice Arkadiusz reported that enabling memcg's group oom killing causes strange memcg statistics where there is no task in a memcg despite the number of tasks in that memcg is not 0. It turned out that there is a bug in wake_oom_reaper() which allows enqueuing same task twice which makes impossible to decrease the number of tasks in that memcg due to a refcount leak. This bug existed since the OOM reaper became invokable from task_will_free_mem(current) path in out_of_memory() in Linux 4.7, T1@P1 |T2@P1 |T3@P1 |OOM reaper ----------+----------+----------+------------ # Processing an OOM victim in a different memcg domain. try_charge() mem_cgroup_out_of_memory() mutex_lock(&oom_lock) try_charge() mem_cgroup_out_of_memory() mutex_lock(&oom_lock) try_charge() mem_cgroup_out_of_memory() mutex_lock(&oom_lock) out_of_memory() oom_kill_process(P1) do_send_sig_info(SIGKILL, @P1) mark_oom_victim(T1@P1) wake_oom_reaper(T1@P1) # T1@P1 is enqueued. mutex_unlock(&oom_lock) out_of_memory() mark_oom_victim(T2@P1) wake_oom_reaper(T2@P1) # T2@P1 is enqueued. mutex_unlock(&oom_lock) out_of_memory() mark_oom_victim(T1@P1) wake_oom_reaper(T1@P1) # T1@P1 is enqueued again due to oom_reaper_list == T2@P1 && T1@P1->oom_reaper_list == NULL. mutex_unlock(&oom_lock) # Completed processing an OOM victim in a different memcg domain. spin_lock(&oom_reaper_lock) # T1P1 is dequeued. spin_unlock(&oom_reaper_lock) but memcg's group oom killing made it easier to trigger this bug by calling wake_oom_reaper() on the same task from one out_of_memory() request. Fix this bug using an approach used by commit 855b018325737f76 (""oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task""). As a side effect of this patch, this patch also avoids enqueuing multiple threads sharing memory via task_will_free_mem(current) path. Link: http://lkml.kernel.org/r/e865a044-2c10-9858-f4ef-254bc71d6cc2@i-love.sakura.ne.jp Link: http://lkml.kernel.org/r/5ee34fc6-1485-34f8-8790-903ddabaa809@i-love.sakura.ne.jp Fixes: af8e15cc85a25315 (""oom, oom_reaper: do not enqueue task if it is on the oom_reaper_list head"") Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-by: Arkadiusz Miskiewicz <arekm@maven.pl> Tested-by: Arkadiusz Miskiewicz <arekm@maven.pl> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Roman Gushchin <guro@fb.com> Cc: Tejun Heo <tj@kernel.org> Cc: Aleksa Sarai <asarai@suse.de> Cc: Jay Kamat <jgkamat@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9bcdeb51bd7d2ae9fe65ea4d60643d2aeef5bfe3;"Fix this bug using an approach used by commit 855b018325737f76 (""oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task"")";yes;yes;no 58;MDY6Q29tbWl0MjMyNTI5ODo5YmNkZWI1MWJkN2QyYWU5ZmU2NWVhNGQ2MDY0M2QyYWVlZjViZmUz;Tetsuo Handa;Linus Torvalds;"oom, oom_reaper: do not enqueue same task twice Arkadiusz reported that enabling memcg's group oom killing causes strange memcg statistics where there is no task in a memcg despite the number of tasks in that memcg is not 0. It turned out that there is a bug in wake_oom_reaper() which allows enqueuing same task twice which makes impossible to decrease the number of tasks in that memcg due to a refcount leak. This bug existed since the OOM reaper became invokable from task_will_free_mem(current) path in out_of_memory() in Linux 4.7, T1@P1 |T2@P1 |T3@P1 |OOM reaper ----------+----------+----------+------------ # Processing an OOM victim in a different memcg domain. try_charge() mem_cgroup_out_of_memory() mutex_lock(&oom_lock) try_charge() mem_cgroup_out_of_memory() mutex_lock(&oom_lock) try_charge() mem_cgroup_out_of_memory() mutex_lock(&oom_lock) out_of_memory() oom_kill_process(P1) do_send_sig_info(SIGKILL, @P1) mark_oom_victim(T1@P1) wake_oom_reaper(T1@P1) # T1@P1 is enqueued. mutex_unlock(&oom_lock) out_of_memory() mark_oom_victim(T2@P1) wake_oom_reaper(T2@P1) # T2@P1 is enqueued. mutex_unlock(&oom_lock) out_of_memory() mark_oom_victim(T1@P1) wake_oom_reaper(T1@P1) # T1@P1 is enqueued again due to oom_reaper_list == T2@P1 && T1@P1->oom_reaper_list == NULL. mutex_unlock(&oom_lock) # Completed processing an OOM victim in a different memcg domain. spin_lock(&oom_reaper_lock) # T1P1 is dequeued. spin_unlock(&oom_reaper_lock) but memcg's group oom killing made it easier to trigger this bug by calling wake_oom_reaper() on the same task from one out_of_memory() request. Fix this bug using an approach used by commit 855b018325737f76 (""oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task""). As a side effect of this patch, this patch also avoids enqueuing multiple threads sharing memory via task_will_free_mem(current) path. Link: http://lkml.kernel.org/r/e865a044-2c10-9858-f4ef-254bc71d6cc2@i-love.sakura.ne.jp Link: http://lkml.kernel.org/r/5ee34fc6-1485-34f8-8790-903ddabaa809@i-love.sakura.ne.jp Fixes: af8e15cc85a25315 (""oom, oom_reaper: do not enqueue task if it is on the oom_reaper_list head"") Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-by: Arkadiusz Miskiewicz <arekm@maven.pl> Tested-by: Arkadiusz Miskiewicz <arekm@maven.pl> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Roman Gushchin <guro@fb.com> Cc: Tejun Heo <tj@kernel.org> Cc: Aleksa Sarai <asarai@suse.de> Cc: Jay Kamat <jgkamat@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9bcdeb51bd7d2ae9fe65ea4d60643d2aeef5bfe3;" As a side effect of this patch, this patch also avoids enqueuing multiple threads sharing memory via task_will_free_mem(current) path.";yes;yes;no 60;MDY6Q29tbWl0MjMyNTI5ODpmMGM4NjdkOTU4OGQ5ZWZjMTBkNmE1NTAwOWM5NTYwMzM2NjczMzY5;yuzhoujian;Linus Torvalds;"mm, oom: add oom victim's memcg to the oom context information The current oom report doesn't display victim's memcg context during the global OOM situation. While this information is not strictly needed, it can be really helpful for containerized environments to locate which container has lost a process. Now that we have a single line for the oom context, we can trivially add both the oom memcg (this can be either global_oom or a specific memcg which hits its hard limits) and task_memcg which is the victim's memcg. Below is the single line output in the oom report after this patch. - global oom context information: oom-kill:constraint=<constraint>,nodemask=<nodemask>,cpuset=<cpuset>,mems_allowed=<mems_allowed>,global_oom,task_memcg=<memcg>,task=<comm>,pid=<pid>,uid=<uid> - memcg oom context information: oom-kill:constraint=<constraint>,nodemask=<nodemask>,cpuset=<cpuset>,mems_allowed=<mems_allowed>,oom_memcg=<memcg>,task_memcg=<memcg>,task=<comm>,pid=<pid>,uid=<uid> [penguin-kernel@I-love.SAKURA.ne.jp: use pr_cont() in mem_cgroup_print_oom_context()] Link: http://lkml.kernel.org/r/201812190723.wBJ7NdkN032628@www262.sakura.ne.jp Link: http://lkml.kernel.org/r/1542799799-36184-2-git-send-email-ufo19890607@gmail.com Signed-off-by: yuzhoujian <yuzhoujian@didichuxing.com> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: ""Kirill A . Shutemov"" <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Roman Gushchin <guro@fb.com> Cc: Yang Shi <yang.s@alibaba-inc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/f0c867d9588d9efc10d6a55009c9560336673369;mm, oom: add oom victim's memcg to the oom context information;yes;no;no 60;MDY6Q29tbWl0MjMyNTI5ODpmMGM4NjdkOTU4OGQ5ZWZjMTBkNmE1NTAwOWM5NTYwMzM2NjczMzY5;yuzhoujian;Linus Torvalds;"mm, oom: add oom victim's memcg to the oom context information The current oom report doesn't display victim's memcg context during the global OOM situation. While this information is not strictly needed, it can be really helpful for containerized environments to locate which container has lost a process. Now that we have a single line for the oom context, we can trivially add both the oom memcg (this can be either global_oom or a specific memcg which hits its hard limits) and task_memcg which is the victim's memcg. Below is the single line output in the oom report after this patch. - global oom context information: oom-kill:constraint=<constraint>,nodemask=<nodemask>,cpuset=<cpuset>,mems_allowed=<mems_allowed>,global_oom,task_memcg=<memcg>,task=<comm>,pid=<pid>,uid=<uid> - memcg oom context information: oom-kill:constraint=<constraint>,nodemask=<nodemask>,cpuset=<cpuset>,mems_allowed=<mems_allowed>,oom_memcg=<memcg>,task_memcg=<memcg>,task=<comm>,pid=<pid>,uid=<uid> [penguin-kernel@I-love.SAKURA.ne.jp: use pr_cont() in mem_cgroup_print_oom_context()] Link: http://lkml.kernel.org/r/201812190723.wBJ7NdkN032628@www262.sakura.ne.jp Link: http://lkml.kernel.org/r/1542799799-36184-2-git-send-email-ufo19890607@gmail.com Signed-off-by: yuzhoujian <yuzhoujian@didichuxing.com> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: ""Kirill A . Shutemov"" <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Roman Gushchin <guro@fb.com> Cc: Yang Shi <yang.s@alibaba-inc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/f0c867d9588d9efc10d6a55009c9560336673369;"The current oom report doesn't display victim's memcg context during the global OOM situation";no;no;yes 60;MDY6Q29tbWl0MjMyNTI5ODpmMGM4NjdkOTU4OGQ5ZWZjMTBkNmE1NTAwOWM5NTYwMzM2NjczMzY5;yuzhoujian;Linus Torvalds;"mm, oom: add oom victim's memcg to the oom context information The current oom report doesn't display victim's memcg context during the global OOM situation. While this information is not strictly needed, it can be really helpful for containerized environments to locate which container has lost a process. Now that we have a single line for the oom context, we can trivially add both the oom memcg (this can be either global_oom or a specific memcg which hits its hard limits) and task_memcg which is the victim's memcg. Below is the single line output in the oom report after this patch. - global oom context information: oom-kill:constraint=<constraint>,nodemask=<nodemask>,cpuset=<cpuset>,mems_allowed=<mems_allowed>,global_oom,task_memcg=<memcg>,task=<comm>,pid=<pid>,uid=<uid> - memcg oom context information: oom-kill:constraint=<constraint>,nodemask=<nodemask>,cpuset=<cpuset>,mems_allowed=<mems_allowed>,oom_memcg=<memcg>,task_memcg=<memcg>,task=<comm>,pid=<pid>,uid=<uid> [penguin-kernel@I-love.SAKURA.ne.jp: use pr_cont() in mem_cgroup_print_oom_context()] Link: http://lkml.kernel.org/r/201812190723.wBJ7NdkN032628@www262.sakura.ne.jp Link: http://lkml.kernel.org/r/1542799799-36184-2-git-send-email-ufo19890607@gmail.com Signed-off-by: yuzhoujian <yuzhoujian@didichuxing.com> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: ""Kirill A . Shutemov"" <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Roman Gushchin <guro@fb.com> Cc: Yang Shi <yang.s@alibaba-inc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/f0c867d9588d9efc10d6a55009c9560336673369;" While this information is not strictly needed, it can be really helpful for containerized environments to locate which container has lost a process";no;yes;yes 60;MDY6Q29tbWl0MjMyNTI5ODpmMGM4NjdkOTU4OGQ5ZWZjMTBkNmE1NTAwOWM5NTYwMzM2NjczMzY5;yuzhoujian;Linus Torvalds;"mm, oom: add oom victim's memcg to the oom context information The current oom report doesn't display victim's memcg context during the global OOM situation. While this information is not strictly needed, it can be really helpful for containerized environments to locate which container has lost a process. Now that we have a single line for the oom context, we can trivially add both the oom memcg (this can be either global_oom or a specific memcg which hits its hard limits) and task_memcg which is the victim's memcg. Below is the single line output in the oom report after this patch. - global oom context information: oom-kill:constraint=<constraint>,nodemask=<nodemask>,cpuset=<cpuset>,mems_allowed=<mems_allowed>,global_oom,task_memcg=<memcg>,task=<comm>,pid=<pid>,uid=<uid> - memcg oom context information: oom-kill:constraint=<constraint>,nodemask=<nodemask>,cpuset=<cpuset>,mems_allowed=<mems_allowed>,oom_memcg=<memcg>,task_memcg=<memcg>,task=<comm>,pid=<pid>,uid=<uid> [penguin-kernel@I-love.SAKURA.ne.jp: use pr_cont() in mem_cgroup_print_oom_context()] Link: http://lkml.kernel.org/r/201812190723.wBJ7NdkN032628@www262.sakura.ne.jp Link: http://lkml.kernel.org/r/1542799799-36184-2-git-send-email-ufo19890607@gmail.com Signed-off-by: yuzhoujian <yuzhoujian@didichuxing.com> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: ""Kirill A . Shutemov"" <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Roman Gushchin <guro@fb.com> Cc: Yang Shi <yang.s@alibaba-inc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/f0c867d9588d9efc10d6a55009c9560336673369;" Now that we have a single line for the oom context, we can trivially add both the oom memcg (this can be either global_oom or a specific memcg which hits its hard limits) and task_memcg which is the victim's memcg";yes;yes;yes 60;MDY6Q29tbWl0MjMyNTI5ODpmMGM4NjdkOTU4OGQ5ZWZjMTBkNmE1NTAwOWM5NTYwMzM2NjczMzY5;yuzhoujian;Linus Torvalds;"mm, oom: add oom victim's memcg to the oom context information The current oom report doesn't display victim's memcg context during the global OOM situation. While this information is not strictly needed, it can be really helpful for containerized environments to locate which container has lost a process. Now that we have a single line for the oom context, we can trivially add both the oom memcg (this can be either global_oom or a specific memcg which hits its hard limits) and task_memcg which is the victim's memcg. Below is the single line output in the oom report after this patch. - global oom context information: oom-kill:constraint=<constraint>,nodemask=<nodemask>,cpuset=<cpuset>,mems_allowed=<mems_allowed>,global_oom,task_memcg=<memcg>,task=<comm>,pid=<pid>,uid=<uid> - memcg oom context information: oom-kill:constraint=<constraint>,nodemask=<nodemask>,cpuset=<cpuset>,mems_allowed=<mems_allowed>,oom_memcg=<memcg>,task_memcg=<memcg>,task=<comm>,pid=<pid>,uid=<uid> [penguin-kernel@I-love.SAKURA.ne.jp: use pr_cont() in mem_cgroup_print_oom_context()] Link: http://lkml.kernel.org/r/201812190723.wBJ7NdkN032628@www262.sakura.ne.jp Link: http://lkml.kernel.org/r/1542799799-36184-2-git-send-email-ufo19890607@gmail.com Signed-off-by: yuzhoujian <yuzhoujian@didichuxing.com> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: ""Kirill A . Shutemov"" <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Roman Gushchin <guro@fb.com> Cc: Yang Shi <yang.s@alibaba-inc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/f0c867d9588d9efc10d6a55009c9560336673369;Below is the single line output in the oom report after this patch;no;no;no 61;MDY6Q29tbWl0MjMyNTI5ODplZjg0NDRlYTAxZDc0NDI2NTJmOGUxYjhhOGI5NDI3OGNiNTdlYWZk;yuzhoujian;Linus Torvalds;"mm, oom: reorganize the oom report in dump_header OOM report contains several sections. The first one is the allocation context that has triggered the OOM. Then we have cpuset context followed by the stack trace of the OOM path. The tird one is the OOM memory information. Followed by the current memory state of all system tasks. At last, we will show oom eligible tasks and the information about the chosen oom victim. One thing that makes parsing more awkward than necessary is that we do not have a single and easily parsable line about the oom context. This patch is reorganizing the oom report to 1) who invoked oom and what was the allocation request [ 515.902945] tuned invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0 2) OOM stack trace [ 515.904273] CPU: 24 PID: 1809 Comm: tuned Not tainted 4.20.0-rc3+ #3 [ 515.905518] Hardware name: Inspur SA5212M4/YZMB-00370-107, BIOS 4.1.10 11/14/2016 [ 515.906821] Call Trace: [ 515.908062] dump_stack+0x5a/0x73 [ 515.909311] dump_header+0x55/0x28c [ 515.914260] oom_kill_process+0x2d8/0x300 [ 515.916708] out_of_memory+0x145/0x4a0 [ 515.917932] __alloc_pages_slowpath+0x7d2/0xa16 [ 515.919157] __alloc_pages_nodemask+0x277/0x290 [ 515.920367] filemap_fault+0x3d0/0x6c0 [ 515.921529] ? filemap_map_pages+0x2b8/0x420 [ 515.922709] ext4_filemap_fault+0x2c/0x40 [ext4] [ 515.923884] __do_fault+0x20/0x80 [ 515.925032] __handle_mm_fault+0xbc0/0xe80 [ 515.926195] handle_mm_fault+0xfa/0x210 [ 515.927357] __do_page_fault+0x233/0x4c0 [ 515.928506] do_page_fault+0x32/0x140 [ 515.929646] ? page_fault+0x8/0x30 [ 515.930770] page_fault+0x1e/0x30 3) OOM memory information [ 515.958093] Mem-Info: [ 515.959647] active_anon:26501758 inactive_anon:1179809 isolated_anon:0 active_file:4402672 inactive_file:483963 isolated_file:1344 unevictable:0 dirty:4886753 writeback:0 unstable:0 slab_reclaimable:148442 slab_unreclaimable:18741 mapped:1347 shmem:1347 pagetables:58669 bounce:0 free:88663 free_pcp:0 free_cma:0 ... 4) current memory state of all system tasks [ 516.079544] [ 744] 0 744 9211 1345 114688 82 0 systemd-journal [ 516.082034] [ 787] 0 787 31764 0 143360 92 0 lvmetad [ 516.084465] [ 792] 0 792 10930 1 110592 208 -1000 systemd-udevd [ 516.086865] [ 1199] 0 1199 13866 0 131072 112 -1000 auditd [ 516.089190] [ 1222] 0 1222 31990 1 110592 157 0 smartd [ 516.091477] [ 1225] 0 1225 4864 85 81920 43 0 irqbalance [ 516.093712] [ 1226] 0 1226 52612 0 258048 426 0 abrtd [ 516.112128] [ 1280] 0 1280 109774 55 299008 400 0 NetworkManager [ 516.113998] [ 1295] 0 1295 28817 37 69632 24 0 ksmtuned [ 516.144596] [ 10718] 0 10718 2622484 1721372 15998976 267219 0 panic [ 516.145792] [ 10719] 0 10719 2622484 1164767 9818112 53576 0 panic [ 516.146977] [ 10720] 0 10720 2622484 1174361 9904128 53709 0 panic [ 516.148163] [ 10721] 0 10721 2622484 1209070 10194944 54824 0 panic [ 516.149329] [ 10722] 0 10722 2622484 1745799 14774272 91138 0 panic 5) oom context (contrains and the chosen victim). oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-1,task=panic,pid=10737,uid=0 An admin can easily get the full oom context at a single line which makes parsing much easier. Link: http://lkml.kernel.org/r/1542799799-36184-1-git-send-email-ufo19890607@gmail.com Signed-off-by: yuzhoujian <yuzhoujian@didichuxing.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: ""Kirill A . Shutemov"" <kirill.shutemov@linux.intel.com> Cc: Roman Gushchin <guro@fb.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Yang Shi <yang.s@alibaba-inc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/ef8444ea01d7442652f8e1b8a8b94278cb57eafd;mm, oom: reorganize the oom report in dump_header;yes;yes;no 61;MDY6Q29tbWl0MjMyNTI5ODplZjg0NDRlYTAxZDc0NDI2NTJmOGUxYjhhOGI5NDI3OGNiNTdlYWZk;yuzhoujian;Linus Torvalds;"mm, oom: reorganize the oom report in dump_header OOM report contains several sections. The first one is the allocation context that has triggered the OOM. Then we have cpuset context followed by the stack trace of the OOM path. The tird one is the OOM memory information. Followed by the current memory state of all system tasks. At last, we will show oom eligible tasks and the information about the chosen oom victim. One thing that makes parsing more awkward than necessary is that we do not have a single and easily parsable line about the oom context. This patch is reorganizing the oom report to 1) who invoked oom and what was the allocation request [ 515.902945] tuned invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0 2) OOM stack trace [ 515.904273] CPU: 24 PID: 1809 Comm: tuned Not tainted 4.20.0-rc3+ #3 [ 515.905518] Hardware name: Inspur SA5212M4/YZMB-00370-107, BIOS 4.1.10 11/14/2016 [ 515.906821] Call Trace: [ 515.908062] dump_stack+0x5a/0x73 [ 515.909311] dump_header+0x55/0x28c [ 515.914260] oom_kill_process+0x2d8/0x300 [ 515.916708] out_of_memory+0x145/0x4a0 [ 515.917932] __alloc_pages_slowpath+0x7d2/0xa16 [ 515.919157] __alloc_pages_nodemask+0x277/0x290 [ 515.920367] filemap_fault+0x3d0/0x6c0 [ 515.921529] ? filemap_map_pages+0x2b8/0x420 [ 515.922709] ext4_filemap_fault+0x2c/0x40 [ext4] [ 515.923884] __do_fault+0x20/0x80 [ 515.925032] __handle_mm_fault+0xbc0/0xe80 [ 515.926195] handle_mm_fault+0xfa/0x210 [ 515.927357] __do_page_fault+0x233/0x4c0 [ 515.928506] do_page_fault+0x32/0x140 [ 515.929646] ? page_fault+0x8/0x30 [ 515.930770] page_fault+0x1e/0x30 3) OOM memory information [ 515.958093] Mem-Info: [ 515.959647] active_anon:26501758 inactive_anon:1179809 isolated_anon:0 active_file:4402672 inactive_file:483963 isolated_file:1344 unevictable:0 dirty:4886753 writeback:0 unstable:0 slab_reclaimable:148442 slab_unreclaimable:18741 mapped:1347 shmem:1347 pagetables:58669 bounce:0 free:88663 free_pcp:0 free_cma:0 ... 4) current memory state of all system tasks [ 516.079544] [ 744] 0 744 9211 1345 114688 82 0 systemd-journal [ 516.082034] [ 787] 0 787 31764 0 143360 92 0 lvmetad [ 516.084465] [ 792] 0 792 10930 1 110592 208 -1000 systemd-udevd [ 516.086865] [ 1199] 0 1199 13866 0 131072 112 -1000 auditd [ 516.089190] [ 1222] 0 1222 31990 1 110592 157 0 smartd [ 516.091477] [ 1225] 0 1225 4864 85 81920 43 0 irqbalance [ 516.093712] [ 1226] 0 1226 52612 0 258048 426 0 abrtd [ 516.112128] [ 1280] 0 1280 109774 55 299008 400 0 NetworkManager [ 516.113998] [ 1295] 0 1295 28817 37 69632 24 0 ksmtuned [ 516.144596] [ 10718] 0 10718 2622484 1721372 15998976 267219 0 panic [ 516.145792] [ 10719] 0 10719 2622484 1164767 9818112 53576 0 panic [ 516.146977] [ 10720] 0 10720 2622484 1174361 9904128 53709 0 panic [ 516.148163] [ 10721] 0 10721 2622484 1209070 10194944 54824 0 panic [ 516.149329] [ 10722] 0 10722 2622484 1745799 14774272 91138 0 panic 5) oom context (contrains and the chosen victim). oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-1,task=panic,pid=10737,uid=0 An admin can easily get the full oom context at a single line which makes parsing much easier. Link: http://lkml.kernel.org/r/1542799799-36184-1-git-send-email-ufo19890607@gmail.com Signed-off-by: yuzhoujian <yuzhoujian@didichuxing.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: ""Kirill A . Shutemov"" <kirill.shutemov@linux.intel.com> Cc: Roman Gushchin <guro@fb.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Yang Shi <yang.s@alibaba-inc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/ef8444ea01d7442652f8e1b8a8b94278cb57eafd;OOM report contains several sections;no;no;yes 61;MDY6Q29tbWl0MjMyNTI5ODplZjg0NDRlYTAxZDc0NDI2NTJmOGUxYjhhOGI5NDI3OGNiNTdlYWZk;yuzhoujian;Linus Torvalds;"mm, oom: reorganize the oom report in dump_header OOM report contains several sections. The first one is the allocation context that has triggered the OOM. Then we have cpuset context followed by the stack trace of the OOM path. The tird one is the OOM memory information. Followed by the current memory state of all system tasks. At last, we will show oom eligible tasks and the information about the chosen oom victim. One thing that makes parsing more awkward than necessary is that we do not have a single and easily parsable line about the oom context. This patch is reorganizing the oom report to 1) who invoked oom and what was the allocation request [ 515.902945] tuned invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0 2) OOM stack trace [ 515.904273] CPU: 24 PID: 1809 Comm: tuned Not tainted 4.20.0-rc3+ #3 [ 515.905518] Hardware name: Inspur SA5212M4/YZMB-00370-107, BIOS 4.1.10 11/14/2016 [ 515.906821] Call Trace: [ 515.908062] dump_stack+0x5a/0x73 [ 515.909311] dump_header+0x55/0x28c [ 515.914260] oom_kill_process+0x2d8/0x300 [ 515.916708] out_of_memory+0x145/0x4a0 [ 515.917932] __alloc_pages_slowpath+0x7d2/0xa16 [ 515.919157] __alloc_pages_nodemask+0x277/0x290 [ 515.920367] filemap_fault+0x3d0/0x6c0 [ 515.921529] ? filemap_map_pages+0x2b8/0x420 [ 515.922709] ext4_filemap_fault+0x2c/0x40 [ext4] [ 515.923884] __do_fault+0x20/0x80 [ 515.925032] __handle_mm_fault+0xbc0/0xe80 [ 515.926195] handle_mm_fault+0xfa/0x210 [ 515.927357] __do_page_fault+0x233/0x4c0 [ 515.928506] do_page_fault+0x32/0x140 [ 515.929646] ? page_fault+0x8/0x30 [ 515.930770] page_fault+0x1e/0x30 3) OOM memory information [ 515.958093] Mem-Info: [ 515.959647] active_anon:26501758 inactive_anon:1179809 isolated_anon:0 active_file:4402672 inactive_file:483963 isolated_file:1344 unevictable:0 dirty:4886753 writeback:0 unstable:0 slab_reclaimable:148442 slab_unreclaimable:18741 mapped:1347 shmem:1347 pagetables:58669 bounce:0 free:88663 free_pcp:0 free_cma:0 ... 4) current memory state of all system tasks [ 516.079544] [ 744] 0 744 9211 1345 114688 82 0 systemd-journal [ 516.082034] [ 787] 0 787 31764 0 143360 92 0 lvmetad [ 516.084465] [ 792] 0 792 10930 1 110592 208 -1000 systemd-udevd [ 516.086865] [ 1199] 0 1199 13866 0 131072 112 -1000 auditd [ 516.089190] [ 1222] 0 1222 31990 1 110592 157 0 smartd [ 516.091477] [ 1225] 0 1225 4864 85 81920 43 0 irqbalance [ 516.093712] [ 1226] 0 1226 52612 0 258048 426 0 abrtd [ 516.112128] [ 1280] 0 1280 109774 55 299008 400 0 NetworkManager [ 516.113998] [ 1295] 0 1295 28817 37 69632 24 0 ksmtuned [ 516.144596] [ 10718] 0 10718 2622484 1721372 15998976 267219 0 panic [ 516.145792] [ 10719] 0 10719 2622484 1164767 9818112 53576 0 panic [ 516.146977] [ 10720] 0 10720 2622484 1174361 9904128 53709 0 panic [ 516.148163] [ 10721] 0 10721 2622484 1209070 10194944 54824 0 panic [ 516.149329] [ 10722] 0 10722 2622484 1745799 14774272 91138 0 panic 5) oom context (contrains and the chosen victim). oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-1,task=panic,pid=10737,uid=0 An admin can easily get the full oom context at a single line which makes parsing much easier. Link: http://lkml.kernel.org/r/1542799799-36184-1-git-send-email-ufo19890607@gmail.com Signed-off-by: yuzhoujian <yuzhoujian@didichuxing.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: ""Kirill A . Shutemov"" <kirill.shutemov@linux.intel.com> Cc: Roman Gushchin <guro@fb.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Yang Shi <yang.s@alibaba-inc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/ef8444ea01d7442652f8e1b8a8b94278cb57eafd;" The first one is the allocation context that has triggered the OOM";no;no;yes 61;MDY6Q29tbWl0MjMyNTI5ODplZjg0NDRlYTAxZDc0NDI2NTJmOGUxYjhhOGI5NDI3OGNiNTdlYWZk;yuzhoujian;Linus Torvalds;"mm, oom: reorganize the oom report in dump_header OOM report contains several sections. The first one is the allocation context that has triggered the OOM. Then we have cpuset context followed by the stack trace of the OOM path. The tird one is the OOM memory information. Followed by the current memory state of all system tasks. At last, we will show oom eligible tasks and the information about the chosen oom victim. One thing that makes parsing more awkward than necessary is that we do not have a single and easily parsable line about the oom context. This patch is reorganizing the oom report to 1) who invoked oom and what was the allocation request [ 515.902945] tuned invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0 2) OOM stack trace [ 515.904273] CPU: 24 PID: 1809 Comm: tuned Not tainted 4.20.0-rc3+ #3 [ 515.905518] Hardware name: Inspur SA5212M4/YZMB-00370-107, BIOS 4.1.10 11/14/2016 [ 515.906821] Call Trace: [ 515.908062] dump_stack+0x5a/0x73 [ 515.909311] dump_header+0x55/0x28c [ 515.914260] oom_kill_process+0x2d8/0x300 [ 515.916708] out_of_memory+0x145/0x4a0 [ 515.917932] __alloc_pages_slowpath+0x7d2/0xa16 [ 515.919157] __alloc_pages_nodemask+0x277/0x290 [ 515.920367] filemap_fault+0x3d0/0x6c0 [ 515.921529] ? filemap_map_pages+0x2b8/0x420 [ 515.922709] ext4_filemap_fault+0x2c/0x40 [ext4] [ 515.923884] __do_fault+0x20/0x80 [ 515.925032] __handle_mm_fault+0xbc0/0xe80 [ 515.926195] handle_mm_fault+0xfa/0x210 [ 515.927357] __do_page_fault+0x233/0x4c0 [ 515.928506] do_page_fault+0x32/0x140 [ 515.929646] ? page_fault+0x8/0x30 [ 515.930770] page_fault+0x1e/0x30 3) OOM memory information [ 515.958093] Mem-Info: [ 515.959647] active_anon:26501758 inactive_anon:1179809 isolated_anon:0 active_file:4402672 inactive_file:483963 isolated_file:1344 unevictable:0 dirty:4886753 writeback:0 unstable:0 slab_reclaimable:148442 slab_unreclaimable:18741 mapped:1347 shmem:1347 pagetables:58669 bounce:0 free:88663 free_pcp:0 free_cma:0 ... 4) current memory state of all system tasks [ 516.079544] [ 744] 0 744 9211 1345 114688 82 0 systemd-journal [ 516.082034] [ 787] 0 787 31764 0 143360 92 0 lvmetad [ 516.084465] [ 792] 0 792 10930 1 110592 208 -1000 systemd-udevd [ 516.086865] [ 1199] 0 1199 13866 0 131072 112 -1000 auditd [ 516.089190] [ 1222] 0 1222 31990 1 110592 157 0 smartd [ 516.091477] [ 1225] 0 1225 4864 85 81920 43 0 irqbalance [ 516.093712] [ 1226] 0 1226 52612 0 258048 426 0 abrtd [ 516.112128] [ 1280] 0 1280 109774 55 299008 400 0 NetworkManager [ 516.113998] [ 1295] 0 1295 28817 37 69632 24 0 ksmtuned [ 516.144596] [ 10718] 0 10718 2622484 1721372 15998976 267219 0 panic [ 516.145792] [ 10719] 0 10719 2622484 1164767 9818112 53576 0 panic [ 516.146977] [ 10720] 0 10720 2622484 1174361 9904128 53709 0 panic [ 516.148163] [ 10721] 0 10721 2622484 1209070 10194944 54824 0 panic [ 516.149329] [ 10722] 0 10722 2622484 1745799 14774272 91138 0 panic 5) oom context (contrains and the chosen victim). oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-1,task=panic,pid=10737,uid=0 An admin can easily get the full oom context at a single line which makes parsing much easier. Link: http://lkml.kernel.org/r/1542799799-36184-1-git-send-email-ufo19890607@gmail.com Signed-off-by: yuzhoujian <yuzhoujian@didichuxing.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: ""Kirill A . Shutemov"" <kirill.shutemov@linux.intel.com> Cc: Roman Gushchin <guro@fb.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Yang Shi <yang.s@alibaba-inc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/ef8444ea01d7442652f8e1b8a8b94278cb57eafd;" Then we have cpuset context followed by the stack trace of the OOM path";no;no;yes 61;MDY6Q29tbWl0MjMyNTI5ODplZjg0NDRlYTAxZDc0NDI2NTJmOGUxYjhhOGI5NDI3OGNiNTdlYWZk;yuzhoujian;Linus Torvalds;"mm, oom: reorganize the oom report in dump_header OOM report contains several sections. The first one is the allocation context that has triggered the OOM. Then we have cpuset context followed by the stack trace of the OOM path. The tird one is the OOM memory information. Followed by the current memory state of all system tasks. At last, we will show oom eligible tasks and the information about the chosen oom victim. One thing that makes parsing more awkward than necessary is that we do not have a single and easily parsable line about the oom context. This patch is reorganizing the oom report to 1) who invoked oom and what was the allocation request [ 515.902945] tuned invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0 2) OOM stack trace [ 515.904273] CPU: 24 PID: 1809 Comm: tuned Not tainted 4.20.0-rc3+ #3 [ 515.905518] Hardware name: Inspur SA5212M4/YZMB-00370-107, BIOS 4.1.10 11/14/2016 [ 515.906821] Call Trace: [ 515.908062] dump_stack+0x5a/0x73 [ 515.909311] dump_header+0x55/0x28c [ 515.914260] oom_kill_process+0x2d8/0x300 [ 515.916708] out_of_memory+0x145/0x4a0 [ 515.917932] __alloc_pages_slowpath+0x7d2/0xa16 [ 515.919157] __alloc_pages_nodemask+0x277/0x290 [ 515.920367] filemap_fault+0x3d0/0x6c0 [ 515.921529] ? filemap_map_pages+0x2b8/0x420 [ 515.922709] ext4_filemap_fault+0x2c/0x40 [ext4] [ 515.923884] __do_fault+0x20/0x80 [ 515.925032] __handle_mm_fault+0xbc0/0xe80 [ 515.926195] handle_mm_fault+0xfa/0x210 [ 515.927357] __do_page_fault+0x233/0x4c0 [ 515.928506] do_page_fault+0x32/0x140 [ 515.929646] ? page_fault+0x8/0x30 [ 515.930770] page_fault+0x1e/0x30 3) OOM memory information [ 515.958093] Mem-Info: [ 515.959647] active_anon:26501758 inactive_anon:1179809 isolated_anon:0 active_file:4402672 inactive_file:483963 isolated_file:1344 unevictable:0 dirty:4886753 writeback:0 unstable:0 slab_reclaimable:148442 slab_unreclaimable:18741 mapped:1347 shmem:1347 pagetables:58669 bounce:0 free:88663 free_pcp:0 free_cma:0 ... 4) current memory state of all system tasks [ 516.079544] [ 744] 0 744 9211 1345 114688 82 0 systemd-journal [ 516.082034] [ 787] 0 787 31764 0 143360 92 0 lvmetad [ 516.084465] [ 792] 0 792 10930 1 110592 208 -1000 systemd-udevd [ 516.086865] [ 1199] 0 1199 13866 0 131072 112 -1000 auditd [ 516.089190] [ 1222] 0 1222 31990 1 110592 157 0 smartd [ 516.091477] [ 1225] 0 1225 4864 85 81920 43 0 irqbalance [ 516.093712] [ 1226] 0 1226 52612 0 258048 426 0 abrtd [ 516.112128] [ 1280] 0 1280 109774 55 299008 400 0 NetworkManager [ 516.113998] [ 1295] 0 1295 28817 37 69632 24 0 ksmtuned [ 516.144596] [ 10718] 0 10718 2622484 1721372 15998976 267219 0 panic [ 516.145792] [ 10719] 0 10719 2622484 1164767 9818112 53576 0 panic [ 516.146977] [ 10720] 0 10720 2622484 1174361 9904128 53709 0 panic [ 516.148163] [ 10721] 0 10721 2622484 1209070 10194944 54824 0 panic [ 516.149329] [ 10722] 0 10722 2622484 1745799 14774272 91138 0 panic 5) oom context (contrains and the chosen victim). oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-1,task=panic,pid=10737,uid=0 An admin can easily get the full oom context at a single line which makes parsing much easier. Link: http://lkml.kernel.org/r/1542799799-36184-1-git-send-email-ufo19890607@gmail.com Signed-off-by: yuzhoujian <yuzhoujian@didichuxing.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: ""Kirill A . Shutemov"" <kirill.shutemov@linux.intel.com> Cc: Roman Gushchin <guro@fb.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Yang Shi <yang.s@alibaba-inc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/ef8444ea01d7442652f8e1b8a8b94278cb57eafd;" The tird one is the OOM memory information";no;no;yes 61;MDY6Q29tbWl0MjMyNTI5ODplZjg0NDRlYTAxZDc0NDI2NTJmOGUxYjhhOGI5NDI3OGNiNTdlYWZk;yuzhoujian;Linus Torvalds;"mm, oom: reorganize the oom report in dump_header OOM report contains several sections. The first one is the allocation context that has triggered the OOM. Then we have cpuset context followed by the stack trace of the OOM path. The tird one is the OOM memory information. Followed by the current memory state of all system tasks. At last, we will show oom eligible tasks and the information about the chosen oom victim. One thing that makes parsing more awkward than necessary is that we do not have a single and easily parsable line about the oom context. This patch is reorganizing the oom report to 1) who invoked oom and what was the allocation request [ 515.902945] tuned invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0 2) OOM stack trace [ 515.904273] CPU: 24 PID: 1809 Comm: tuned Not tainted 4.20.0-rc3+ #3 [ 515.905518] Hardware name: Inspur SA5212M4/YZMB-00370-107, BIOS 4.1.10 11/14/2016 [ 515.906821] Call Trace: [ 515.908062] dump_stack+0x5a/0x73 [ 515.909311] dump_header+0x55/0x28c [ 515.914260] oom_kill_process+0x2d8/0x300 [ 515.916708] out_of_memory+0x145/0x4a0 [ 515.917932] __alloc_pages_slowpath+0x7d2/0xa16 [ 515.919157] __alloc_pages_nodemask+0x277/0x290 [ 515.920367] filemap_fault+0x3d0/0x6c0 [ 515.921529] ? filemap_map_pages+0x2b8/0x420 [ 515.922709] ext4_filemap_fault+0x2c/0x40 [ext4] [ 515.923884] __do_fault+0x20/0x80 [ 515.925032] __handle_mm_fault+0xbc0/0xe80 [ 515.926195] handle_mm_fault+0xfa/0x210 [ 515.927357] __do_page_fault+0x233/0x4c0 [ 515.928506] do_page_fault+0x32/0x140 [ 515.929646] ? page_fault+0x8/0x30 [ 515.930770] page_fault+0x1e/0x30 3) OOM memory information [ 515.958093] Mem-Info: [ 515.959647] active_anon:26501758 inactive_anon:1179809 isolated_anon:0 active_file:4402672 inactive_file:483963 isolated_file:1344 unevictable:0 dirty:4886753 writeback:0 unstable:0 slab_reclaimable:148442 slab_unreclaimable:18741 mapped:1347 shmem:1347 pagetables:58669 bounce:0 free:88663 free_pcp:0 free_cma:0 ... 4) current memory state of all system tasks [ 516.079544] [ 744] 0 744 9211 1345 114688 82 0 systemd-journal [ 516.082034] [ 787] 0 787 31764 0 143360 92 0 lvmetad [ 516.084465] [ 792] 0 792 10930 1 110592 208 -1000 systemd-udevd [ 516.086865] [ 1199] 0 1199 13866 0 131072 112 -1000 auditd [ 516.089190] [ 1222] 0 1222 31990 1 110592 157 0 smartd [ 516.091477] [ 1225] 0 1225 4864 85 81920 43 0 irqbalance [ 516.093712] [ 1226] 0 1226 52612 0 258048 426 0 abrtd [ 516.112128] [ 1280] 0 1280 109774 55 299008 400 0 NetworkManager [ 516.113998] [ 1295] 0 1295 28817 37 69632 24 0 ksmtuned [ 516.144596] [ 10718] 0 10718 2622484 1721372 15998976 267219 0 panic [ 516.145792] [ 10719] 0 10719 2622484 1164767 9818112 53576 0 panic [ 516.146977] [ 10720] 0 10720 2622484 1174361 9904128 53709 0 panic [ 516.148163] [ 10721] 0 10721 2622484 1209070 10194944 54824 0 panic [ 516.149329] [ 10722] 0 10722 2622484 1745799 14774272 91138 0 panic 5) oom context (contrains and the chosen victim). oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-1,task=panic,pid=10737,uid=0 An admin can easily get the full oom context at a single line which makes parsing much easier. Link: http://lkml.kernel.org/r/1542799799-36184-1-git-send-email-ufo19890607@gmail.com Signed-off-by: yuzhoujian <yuzhoujian@didichuxing.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: ""Kirill A . Shutemov"" <kirill.shutemov@linux.intel.com> Cc: Roman Gushchin <guro@fb.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Yang Shi <yang.s@alibaba-inc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/ef8444ea01d7442652f8e1b8a8b94278cb57eafd; Followed by the current memory state of all system tasks;no;no;yes 61;MDY6Q29tbWl0MjMyNTI5ODplZjg0NDRlYTAxZDc0NDI2NTJmOGUxYjhhOGI5NDI3OGNiNTdlYWZk;yuzhoujian;Linus Torvalds;"mm, oom: reorganize the oom report in dump_header OOM report contains several sections. The first one is the allocation context that has triggered the OOM. Then we have cpuset context followed by the stack trace of the OOM path. The tird one is the OOM memory information. Followed by the current memory state of all system tasks. At last, we will show oom eligible tasks and the information about the chosen oom victim. One thing that makes parsing more awkward than necessary is that we do not have a single and easily parsable line about the oom context. This patch is reorganizing the oom report to 1) who invoked oom and what was the allocation request [ 515.902945] tuned invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0 2) OOM stack trace [ 515.904273] CPU: 24 PID: 1809 Comm: tuned Not tainted 4.20.0-rc3+ #3 [ 515.905518] Hardware name: Inspur SA5212M4/YZMB-00370-107, BIOS 4.1.10 11/14/2016 [ 515.906821] Call Trace: [ 515.908062] dump_stack+0x5a/0x73 [ 515.909311] dump_header+0x55/0x28c [ 515.914260] oom_kill_process+0x2d8/0x300 [ 515.916708] out_of_memory+0x145/0x4a0 [ 515.917932] __alloc_pages_slowpath+0x7d2/0xa16 [ 515.919157] __alloc_pages_nodemask+0x277/0x290 [ 515.920367] filemap_fault+0x3d0/0x6c0 [ 515.921529] ? filemap_map_pages+0x2b8/0x420 [ 515.922709] ext4_filemap_fault+0x2c/0x40 [ext4] [ 515.923884] __do_fault+0x20/0x80 [ 515.925032] __handle_mm_fault+0xbc0/0xe80 [ 515.926195] handle_mm_fault+0xfa/0x210 [ 515.927357] __do_page_fault+0x233/0x4c0 [ 515.928506] do_page_fault+0x32/0x140 [ 515.929646] ? page_fault+0x8/0x30 [ 515.930770] page_fault+0x1e/0x30 3) OOM memory information [ 515.958093] Mem-Info: [ 515.959647] active_anon:26501758 inactive_anon:1179809 isolated_anon:0 active_file:4402672 inactive_file:483963 isolated_file:1344 unevictable:0 dirty:4886753 writeback:0 unstable:0 slab_reclaimable:148442 slab_unreclaimable:18741 mapped:1347 shmem:1347 pagetables:58669 bounce:0 free:88663 free_pcp:0 free_cma:0 ... 4) current memory state of all system tasks [ 516.079544] [ 744] 0 744 9211 1345 114688 82 0 systemd-journal [ 516.082034] [ 787] 0 787 31764 0 143360 92 0 lvmetad [ 516.084465] [ 792] 0 792 10930 1 110592 208 -1000 systemd-udevd [ 516.086865] [ 1199] 0 1199 13866 0 131072 112 -1000 auditd [ 516.089190] [ 1222] 0 1222 31990 1 110592 157 0 smartd [ 516.091477] [ 1225] 0 1225 4864 85 81920 43 0 irqbalance [ 516.093712] [ 1226] 0 1226 52612 0 258048 426 0 abrtd [ 516.112128] [ 1280] 0 1280 109774 55 299008 400 0 NetworkManager [ 516.113998] [ 1295] 0 1295 28817 37 69632 24 0 ksmtuned [ 516.144596] [ 10718] 0 10718 2622484 1721372 15998976 267219 0 panic [ 516.145792] [ 10719] 0 10719 2622484 1164767 9818112 53576 0 panic [ 516.146977] [ 10720] 0 10720 2622484 1174361 9904128 53709 0 panic [ 516.148163] [ 10721] 0 10721 2622484 1209070 10194944 54824 0 panic [ 516.149329] [ 10722] 0 10722 2622484 1745799 14774272 91138 0 panic 5) oom context (contrains and the chosen victim). oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-1,task=panic,pid=10737,uid=0 An admin can easily get the full oom context at a single line which makes parsing much easier. Link: http://lkml.kernel.org/r/1542799799-36184-1-git-send-email-ufo19890607@gmail.com Signed-off-by: yuzhoujian <yuzhoujian@didichuxing.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: ""Kirill A . Shutemov"" <kirill.shutemov@linux.intel.com> Cc: Roman Gushchin <guro@fb.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Yang Shi <yang.s@alibaba-inc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/ef8444ea01d7442652f8e1b8a8b94278cb57eafd;"At last, we will show oom eligible tasks and the information about the chosen oom victim";no;no;yes 61;MDY6Q29tbWl0MjMyNTI5ODplZjg0NDRlYTAxZDc0NDI2NTJmOGUxYjhhOGI5NDI3OGNiNTdlYWZk;yuzhoujian;Linus Torvalds;"mm, oom: reorganize the oom report in dump_header OOM report contains several sections. The first one is the allocation context that has triggered the OOM. Then we have cpuset context followed by the stack trace of the OOM path. The tird one is the OOM memory information. Followed by the current memory state of all system tasks. At last, we will show oom eligible tasks and the information about the chosen oom victim. One thing that makes parsing more awkward than necessary is that we do not have a single and easily parsable line about the oom context. This patch is reorganizing the oom report to 1) who invoked oom and what was the allocation request [ 515.902945] tuned invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0 2) OOM stack trace [ 515.904273] CPU: 24 PID: 1809 Comm: tuned Not tainted 4.20.0-rc3+ #3 [ 515.905518] Hardware name: Inspur SA5212M4/YZMB-00370-107, BIOS 4.1.10 11/14/2016 [ 515.906821] Call Trace: [ 515.908062] dump_stack+0x5a/0x73 [ 515.909311] dump_header+0x55/0x28c [ 515.914260] oom_kill_process+0x2d8/0x300 [ 515.916708] out_of_memory+0x145/0x4a0 [ 515.917932] __alloc_pages_slowpath+0x7d2/0xa16 [ 515.919157] __alloc_pages_nodemask+0x277/0x290 [ 515.920367] filemap_fault+0x3d0/0x6c0 [ 515.921529] ? filemap_map_pages+0x2b8/0x420 [ 515.922709] ext4_filemap_fault+0x2c/0x40 [ext4] [ 515.923884] __do_fault+0x20/0x80 [ 515.925032] __handle_mm_fault+0xbc0/0xe80 [ 515.926195] handle_mm_fault+0xfa/0x210 [ 515.927357] __do_page_fault+0x233/0x4c0 [ 515.928506] do_page_fault+0x32/0x140 [ 515.929646] ? page_fault+0x8/0x30 [ 515.930770] page_fault+0x1e/0x30 3) OOM memory information [ 515.958093] Mem-Info: [ 515.959647] active_anon:26501758 inactive_anon:1179809 isolated_anon:0 active_file:4402672 inactive_file:483963 isolated_file:1344 unevictable:0 dirty:4886753 writeback:0 unstable:0 slab_reclaimable:148442 slab_unreclaimable:18741 mapped:1347 shmem:1347 pagetables:58669 bounce:0 free:88663 free_pcp:0 free_cma:0 ... 4) current memory state of all system tasks [ 516.079544] [ 744] 0 744 9211 1345 114688 82 0 systemd-journal [ 516.082034] [ 787] 0 787 31764 0 143360 92 0 lvmetad [ 516.084465] [ 792] 0 792 10930 1 110592 208 -1000 systemd-udevd [ 516.086865] [ 1199] 0 1199 13866 0 131072 112 -1000 auditd [ 516.089190] [ 1222] 0 1222 31990 1 110592 157 0 smartd [ 516.091477] [ 1225] 0 1225 4864 85 81920 43 0 irqbalance [ 516.093712] [ 1226] 0 1226 52612 0 258048 426 0 abrtd [ 516.112128] [ 1280] 0 1280 109774 55 299008 400 0 NetworkManager [ 516.113998] [ 1295] 0 1295 28817 37 69632 24 0 ksmtuned [ 516.144596] [ 10718] 0 10718 2622484 1721372 15998976 267219 0 panic [ 516.145792] [ 10719] 0 10719 2622484 1164767 9818112 53576 0 panic [ 516.146977] [ 10720] 0 10720 2622484 1174361 9904128 53709 0 panic [ 516.148163] [ 10721] 0 10721 2622484 1209070 10194944 54824 0 panic [ 516.149329] [ 10722] 0 10722 2622484 1745799 14774272 91138 0 panic 5) oom context (contrains and the chosen victim). oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-1,task=panic,pid=10737,uid=0 An admin can easily get the full oom context at a single line which makes parsing much easier. Link: http://lkml.kernel.org/r/1542799799-36184-1-git-send-email-ufo19890607@gmail.com Signed-off-by: yuzhoujian <yuzhoujian@didichuxing.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: ""Kirill A . Shutemov"" <kirill.shutemov@linux.intel.com> Cc: Roman Gushchin <guro@fb.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Yang Shi <yang.s@alibaba-inc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/ef8444ea01d7442652f8e1b8a8b94278cb57eafd;"One thing that makes parsing more awkward than necessary is that we do not have a single and easily parsable line about the oom context";no;yes;yes 61;MDY6Q29tbWl0MjMyNTI5ODplZjg0NDRlYTAxZDc0NDI2NTJmOGUxYjhhOGI5NDI3OGNiNTdlYWZk;yuzhoujian;Linus Torvalds;"mm, oom: reorganize the oom report in dump_header OOM report contains several sections. The first one is the allocation context that has triggered the OOM. Then we have cpuset context followed by the stack trace of the OOM path. The tird one is the OOM memory information. Followed by the current memory state of all system tasks. At last, we will show oom eligible tasks and the information about the chosen oom victim. One thing that makes parsing more awkward than necessary is that we do not have a single and easily parsable line about the oom context. This patch is reorganizing the oom report to 1) who invoked oom and what was the allocation request [ 515.902945] tuned invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0 2) OOM stack trace [ 515.904273] CPU: 24 PID: 1809 Comm: tuned Not tainted 4.20.0-rc3+ #3 [ 515.905518] Hardware name: Inspur SA5212M4/YZMB-00370-107, BIOS 4.1.10 11/14/2016 [ 515.906821] Call Trace: [ 515.908062] dump_stack+0x5a/0x73 [ 515.909311] dump_header+0x55/0x28c [ 515.914260] oom_kill_process+0x2d8/0x300 [ 515.916708] out_of_memory+0x145/0x4a0 [ 515.917932] __alloc_pages_slowpath+0x7d2/0xa16 [ 515.919157] __alloc_pages_nodemask+0x277/0x290 [ 515.920367] filemap_fault+0x3d0/0x6c0 [ 515.921529] ? filemap_map_pages+0x2b8/0x420 [ 515.922709] ext4_filemap_fault+0x2c/0x40 [ext4] [ 515.923884] __do_fault+0x20/0x80 [ 515.925032] __handle_mm_fault+0xbc0/0xe80 [ 515.926195] handle_mm_fault+0xfa/0x210 [ 515.927357] __do_page_fault+0x233/0x4c0 [ 515.928506] do_page_fault+0x32/0x140 [ 515.929646] ? page_fault+0x8/0x30 [ 515.930770] page_fault+0x1e/0x30 3) OOM memory information [ 515.958093] Mem-Info: [ 515.959647] active_anon:26501758 inactive_anon:1179809 isolated_anon:0 active_file:4402672 inactive_file:483963 isolated_file:1344 unevictable:0 dirty:4886753 writeback:0 unstable:0 slab_reclaimable:148442 slab_unreclaimable:18741 mapped:1347 shmem:1347 pagetables:58669 bounce:0 free:88663 free_pcp:0 free_cma:0 ... 4) current memory state of all system tasks [ 516.079544] [ 744] 0 744 9211 1345 114688 82 0 systemd-journal [ 516.082034] [ 787] 0 787 31764 0 143360 92 0 lvmetad [ 516.084465] [ 792] 0 792 10930 1 110592 208 -1000 systemd-udevd [ 516.086865] [ 1199] 0 1199 13866 0 131072 112 -1000 auditd [ 516.089190] [ 1222] 0 1222 31990 1 110592 157 0 smartd [ 516.091477] [ 1225] 0 1225 4864 85 81920 43 0 irqbalance [ 516.093712] [ 1226] 0 1226 52612 0 258048 426 0 abrtd [ 516.112128] [ 1280] 0 1280 109774 55 299008 400 0 NetworkManager [ 516.113998] [ 1295] 0 1295 28817 37 69632 24 0 ksmtuned [ 516.144596] [ 10718] 0 10718 2622484 1721372 15998976 267219 0 panic [ 516.145792] [ 10719] 0 10719 2622484 1164767 9818112 53576 0 panic [ 516.146977] [ 10720] 0 10720 2622484 1174361 9904128 53709 0 panic [ 516.148163] [ 10721] 0 10721 2622484 1209070 10194944 54824 0 panic [ 516.149329] [ 10722] 0 10722 2622484 1745799 14774272 91138 0 panic 5) oom context (contrains and the chosen victim). oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-1,task=panic,pid=10737,uid=0 An admin can easily get the full oom context at a single line which makes parsing much easier. Link: http://lkml.kernel.org/r/1542799799-36184-1-git-send-email-ufo19890607@gmail.com Signed-off-by: yuzhoujian <yuzhoujian@didichuxing.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: ""Kirill A . Shutemov"" <kirill.shutemov@linux.intel.com> Cc: Roman Gushchin <guro@fb.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Yang Shi <yang.s@alibaba-inc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/ef8444ea01d7442652f8e1b8a8b94278cb57eafd;" This patch is reorganizing the oom report to 1) who invoked oom and what was the allocation request 2) OOM stack trace 3) OOM memory information active_file:4402672 inactive_file:483963 isolated_file:1344 unevictable:0 dirty:4886753 writeback:0 unstable:0 slab_reclaimable:148442 slab_unreclaimable:18741 mapped:1347 shmem:1347 pagetables:58669 bounce:0 free:88663 free_pcp:0 free_cma:0 4) current memory state of all system tasks 5) oom context (contrains and the chosen victim)";yes;no;no 61;MDY6Q29tbWl0MjMyNTI5ODplZjg0NDRlYTAxZDc0NDI2NTJmOGUxYjhhOGI5NDI3OGNiNTdlYWZk;yuzhoujian;Linus Torvalds;"mm, oom: reorganize the oom report in dump_header OOM report contains several sections. The first one is the allocation context that has triggered the OOM. Then we have cpuset context followed by the stack trace of the OOM path. The tird one is the OOM memory information. Followed by the current memory state of all system tasks. At last, we will show oom eligible tasks and the information about the chosen oom victim. One thing that makes parsing more awkward than necessary is that we do not have a single and easily parsable line about the oom context. This patch is reorganizing the oom report to 1) who invoked oom and what was the allocation request [ 515.902945] tuned invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0 2) OOM stack trace [ 515.904273] CPU: 24 PID: 1809 Comm: tuned Not tainted 4.20.0-rc3+ #3 [ 515.905518] Hardware name: Inspur SA5212M4/YZMB-00370-107, BIOS 4.1.10 11/14/2016 [ 515.906821] Call Trace: [ 515.908062] dump_stack+0x5a/0x73 [ 515.909311] dump_header+0x55/0x28c [ 515.914260] oom_kill_process+0x2d8/0x300 [ 515.916708] out_of_memory+0x145/0x4a0 [ 515.917932] __alloc_pages_slowpath+0x7d2/0xa16 [ 515.919157] __alloc_pages_nodemask+0x277/0x290 [ 515.920367] filemap_fault+0x3d0/0x6c0 [ 515.921529] ? filemap_map_pages+0x2b8/0x420 [ 515.922709] ext4_filemap_fault+0x2c/0x40 [ext4] [ 515.923884] __do_fault+0x20/0x80 [ 515.925032] __handle_mm_fault+0xbc0/0xe80 [ 515.926195] handle_mm_fault+0xfa/0x210 [ 515.927357] __do_page_fault+0x233/0x4c0 [ 515.928506] do_page_fault+0x32/0x140 [ 515.929646] ? page_fault+0x8/0x30 [ 515.930770] page_fault+0x1e/0x30 3) OOM memory information [ 515.958093] Mem-Info: [ 515.959647] active_anon:26501758 inactive_anon:1179809 isolated_anon:0 active_file:4402672 inactive_file:483963 isolated_file:1344 unevictable:0 dirty:4886753 writeback:0 unstable:0 slab_reclaimable:148442 slab_unreclaimable:18741 mapped:1347 shmem:1347 pagetables:58669 bounce:0 free:88663 free_pcp:0 free_cma:0 ... 4) current memory state of all system tasks [ 516.079544] [ 744] 0 744 9211 1345 114688 82 0 systemd-journal [ 516.082034] [ 787] 0 787 31764 0 143360 92 0 lvmetad [ 516.084465] [ 792] 0 792 10930 1 110592 208 -1000 systemd-udevd [ 516.086865] [ 1199] 0 1199 13866 0 131072 112 -1000 auditd [ 516.089190] [ 1222] 0 1222 31990 1 110592 157 0 smartd [ 516.091477] [ 1225] 0 1225 4864 85 81920 43 0 irqbalance [ 516.093712] [ 1226] 0 1226 52612 0 258048 426 0 abrtd [ 516.112128] [ 1280] 0 1280 109774 55 299008 400 0 NetworkManager [ 516.113998] [ 1295] 0 1295 28817 37 69632 24 0 ksmtuned [ 516.144596] [ 10718] 0 10718 2622484 1721372 15998976 267219 0 panic [ 516.145792] [ 10719] 0 10719 2622484 1164767 9818112 53576 0 panic [ 516.146977] [ 10720] 0 10720 2622484 1174361 9904128 53709 0 panic [ 516.148163] [ 10721] 0 10721 2622484 1209070 10194944 54824 0 panic [ 516.149329] [ 10722] 0 10722 2622484 1745799 14774272 91138 0 panic 5) oom context (contrains and the chosen victim). oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-1,task=panic,pid=10737,uid=0 An admin can easily get the full oom context at a single line which makes parsing much easier. Link: http://lkml.kernel.org/r/1542799799-36184-1-git-send-email-ufo19890607@gmail.com Signed-off-by: yuzhoujian <yuzhoujian@didichuxing.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: ""Kirill A . Shutemov"" <kirill.shutemov@linux.intel.com> Cc: Roman Gushchin <guro@fb.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Yang Shi <yang.s@alibaba-inc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/ef8444ea01d7442652f8e1b8a8b94278cb57eafd;"oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-1,task=panic,pid=10737,uid=0 An admin can easily get the full oom context at a single line which makes parsing much easier.";yes;yes;no 64;MDY6Q29tbWl0MjMyNTI5ODo3OWNjODEwNTdlZWY3YWQ4NDY1ODg5NzYyOTZhYjBmMjY2YzFhN2E1;Tetsuo Handa;Linus Torvalds;"mm, oom: fix missing tlb_finish_mmu() in __oom_reap_task_mm(). Commit 93065ac753e4 (""mm, oom: distinguish blockable mode for mmu notifiers"") has added an ability to skip over vmas with blockable mmu notifiers. This however didn't call tlb_finish_mmu as it should. As a result inc_tlb_flush_pending has been called without its pairing dec_tlb_flush_pending and all callers mm_tlb_flush_pending would flush even though this is not really needed. This alone is not harmful and it seems there shouldn't be any such callers for oom victims at all but there is no real reason to skip tlb_finish_mmu on early skip either so call it. [mhocko@suse.com: new changelog] Link: http://lkml.kernel.org/r/b752d1d5-81ad-7a35-2394-7870641be51c@i-love.sakura.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/79cc81057eef7ad846588976296ab0f266c1a7a5;mm, oom: fix missing tlb_finish_mmu() in __oom_reap_task_mm().;yes;yes;no 64;MDY6Q29tbWl0MjMyNTI5ODo3OWNjODEwNTdlZWY3YWQ4NDY1ODg5NzYyOTZhYjBmMjY2YzFhN2E1;Tetsuo Handa;Linus Torvalds;"mm, oom: fix missing tlb_finish_mmu() in __oom_reap_task_mm(). Commit 93065ac753e4 (""mm, oom: distinguish blockable mode for mmu notifiers"") has added an ability to skip over vmas with blockable mmu notifiers. This however didn't call tlb_finish_mmu as it should. As a result inc_tlb_flush_pending has been called without its pairing dec_tlb_flush_pending and all callers mm_tlb_flush_pending would flush even though this is not really needed. This alone is not harmful and it seems there shouldn't be any such callers for oom victims at all but there is no real reason to skip tlb_finish_mmu on early skip either so call it. [mhocko@suse.com: new changelog] Link: http://lkml.kernel.org/r/b752d1d5-81ad-7a35-2394-7870641be51c@i-love.sakura.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/79cc81057eef7ad846588976296ab0f266c1a7a5;"Commit 93065ac753e4 (""mm, oom: distinguish blockable mode for mmu notifiers"") has added an ability to skip over vmas with blockable mmu notifiers";no;no;yes 64;MDY6Q29tbWl0MjMyNTI5ODo3OWNjODEwNTdlZWY3YWQ4NDY1ODg5NzYyOTZhYjBmMjY2YzFhN2E1;Tetsuo Handa;Linus Torvalds;"mm, oom: fix missing tlb_finish_mmu() in __oom_reap_task_mm(). Commit 93065ac753e4 (""mm, oom: distinguish blockable mode for mmu notifiers"") has added an ability to skip over vmas with blockable mmu notifiers. This however didn't call tlb_finish_mmu as it should. As a result inc_tlb_flush_pending has been called without its pairing dec_tlb_flush_pending and all callers mm_tlb_flush_pending would flush even though this is not really needed. This alone is not harmful and it seems there shouldn't be any such callers for oom victims at all but there is no real reason to skip tlb_finish_mmu on early skip either so call it. [mhocko@suse.com: new changelog] Link: http://lkml.kernel.org/r/b752d1d5-81ad-7a35-2394-7870641be51c@i-love.sakura.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/79cc81057eef7ad846588976296ab0f266c1a7a5;This however didn't call tlb_finish_mmu as it should;no;yes;yes 64;MDY6Q29tbWl0MjMyNTI5ODo3OWNjODEwNTdlZWY3YWQ4NDY1ODg5NzYyOTZhYjBmMjY2YzFhN2E1;Tetsuo Handa;Linus Torvalds;"mm, oom: fix missing tlb_finish_mmu() in __oom_reap_task_mm(). Commit 93065ac753e4 (""mm, oom: distinguish blockable mode for mmu notifiers"") has added an ability to skip over vmas with blockable mmu notifiers. This however didn't call tlb_finish_mmu as it should. As a result inc_tlb_flush_pending has been called without its pairing dec_tlb_flush_pending and all callers mm_tlb_flush_pending would flush even though this is not really needed. This alone is not harmful and it seems there shouldn't be any such callers for oom victims at all but there is no real reason to skip tlb_finish_mmu on early skip either so call it. [mhocko@suse.com: new changelog] Link: http://lkml.kernel.org/r/b752d1d5-81ad-7a35-2394-7870641be51c@i-love.sakura.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/79cc81057eef7ad846588976296ab0f266c1a7a5;"As a result inc_tlb_flush_pending has been called without its pairing dec_tlb_flush_pending and all callers mm_tlb_flush_pending would flush even though this is not really needed";no;yes;yes 64;MDY6Q29tbWl0MjMyNTI5ODo3OWNjODEwNTdlZWY3YWQ4NDY1ODg5NzYyOTZhYjBmMjY2YzFhN2E1;Tetsuo Handa;Linus Torvalds;"mm, oom: fix missing tlb_finish_mmu() in __oom_reap_task_mm(). Commit 93065ac753e4 (""mm, oom: distinguish blockable mode for mmu notifiers"") has added an ability to skip over vmas with blockable mmu notifiers. This however didn't call tlb_finish_mmu as it should. As a result inc_tlb_flush_pending has been called without its pairing dec_tlb_flush_pending and all callers mm_tlb_flush_pending would flush even though this is not really needed. This alone is not harmful and it seems there shouldn't be any such callers for oom victims at all but there is no real reason to skip tlb_finish_mmu on early skip either so call it. [mhocko@suse.com: new changelog] Link: http://lkml.kernel.org/r/b752d1d5-81ad-7a35-2394-7870641be51c@i-love.sakura.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/79cc81057eef7ad846588976296ab0f266c1a7a5;" This alone is not harmful and it seems there shouldn't be any such callers for oom victims at all but there is no real reason to skip tlb_finish_mmu on early skip either so call it.";yes;yes;yes 67;MDY6Q29tbWl0MjMyNTI5ODo1OTg5YWQ3YjVlZGUzOGQ2MDVjNTg4OTgxZjYzNGMwODI1MmFiZmMz;Roman Gushchin;Linus Torvalds;"mm, oom: refactor oom_kill_process() Patch series ""introduce memory.oom.group"", v2. This is a tiny implementation of cgroup-aware OOM killer, which adds an ability to kill a cgroup as a single unit and so guarantee the integrity of the workload. Although it has only a limited functionality in comparison to what now resides in the mm tree (it doesn't change the victim task selection algorithm, doesn't look at memory stas on cgroup level, etc), it's also much simpler and more straightforward. So, hopefully, we can avoid having long debates here, as we had with the full implementation. As it doesn't prevent any futher development, and implements an useful and complete feature, it looks as a sane way forward. This patch (of 2): oom_kill_process() consists of two logical parts: the first one is responsible for considering task's children as a potential victim and printing the debug information. The second half is responsible for sending SIGKILL to all tasks sharing the mm struct with the given victim. This commit splits oom_kill_process() with an intention to re-use the the second half: __oom_kill_process(). The cgroup-aware OOM killer will kill multiple tasks belonging to the victim cgroup. We don't need to print the debug information for the each task, as well as play with task selection (considering task's children), so we can't use the existing oom_kill_process(). Link: http://lkml.kernel.org/r/20171130152824.1591-2-guro@fb.com Link: http://lkml.kernel.org/r/20180802003201.817-3-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/5989ad7b5ede38d605c588981f634c08252abfc3;mm, oom: refactor oom_kill_process();yes;yes;no 67;MDY6Q29tbWl0MjMyNTI5ODo1OTg5YWQ3YjVlZGUzOGQ2MDVjNTg4OTgxZjYzNGMwODI1MmFiZmMz;Roman Gushchin;Linus Torvalds;"mm, oom: refactor oom_kill_process() Patch series ""introduce memory.oom.group"", v2. This is a tiny implementation of cgroup-aware OOM killer, which adds an ability to kill a cgroup as a single unit and so guarantee the integrity of the workload. Although it has only a limited functionality in comparison to what now resides in the mm tree (it doesn't change the victim task selection algorithm, doesn't look at memory stas on cgroup level, etc), it's also much simpler and more straightforward. So, hopefully, we can avoid having long debates here, as we had with the full implementation. As it doesn't prevent any futher development, and implements an useful and complete feature, it looks as a sane way forward. This patch (of 2): oom_kill_process() consists of two logical parts: the first one is responsible for considering task's children as a potential victim and printing the debug information. The second half is responsible for sending SIGKILL to all tasks sharing the mm struct with the given victim. This commit splits oom_kill_process() with an intention to re-use the the second half: __oom_kill_process(). The cgroup-aware OOM killer will kill multiple tasks belonging to the victim cgroup. We don't need to print the debug information for the each task, as well as play with task selection (considering task's children), so we can't use the existing oom_kill_process(). Link: http://lkml.kernel.org/r/20171130152824.1591-2-guro@fb.com Link: http://lkml.kernel.org/r/20180802003201.817-3-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/5989ad7b5ede38d605c588981f634c08252abfc3;"Patch series ""introduce memory.oom.group"", v2";no;no;yes 67;MDY6Q29tbWl0MjMyNTI5ODo1OTg5YWQ3YjVlZGUzOGQ2MDVjNTg4OTgxZjYzNGMwODI1MmFiZmMz;Roman Gushchin;Linus Torvalds;"mm, oom: refactor oom_kill_process() Patch series ""introduce memory.oom.group"", v2. This is a tiny implementation of cgroup-aware OOM killer, which adds an ability to kill a cgroup as a single unit and so guarantee the integrity of the workload. Although it has only a limited functionality in comparison to what now resides in the mm tree (it doesn't change the victim task selection algorithm, doesn't look at memory stas on cgroup level, etc), it's also much simpler and more straightforward. So, hopefully, we can avoid having long debates here, as we had with the full implementation. As it doesn't prevent any futher development, and implements an useful and complete feature, it looks as a sane way forward. This patch (of 2): oom_kill_process() consists of two logical parts: the first one is responsible for considering task's children as a potential victim and printing the debug information. The second half is responsible for sending SIGKILL to all tasks sharing the mm struct with the given victim. This commit splits oom_kill_process() with an intention to re-use the the second half: __oom_kill_process(). The cgroup-aware OOM killer will kill multiple tasks belonging to the victim cgroup. We don't need to print the debug information for the each task, as well as play with task selection (considering task's children), so we can't use the existing oom_kill_process(). Link: http://lkml.kernel.org/r/20171130152824.1591-2-guro@fb.com Link: http://lkml.kernel.org/r/20180802003201.817-3-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/5989ad7b5ede38d605c588981f634c08252abfc3;"This is a tiny implementation of cgroup-aware OOM killer, which adds an ability to kill a cgroup as a single unit and so guarantee the integrity of the workload";yes;yes;no 67;MDY6Q29tbWl0MjMyNTI5ODo1OTg5YWQ3YjVlZGUzOGQ2MDVjNTg4OTgxZjYzNGMwODI1MmFiZmMz;Roman Gushchin;Linus Torvalds;"mm, oom: refactor oom_kill_process() Patch series ""introduce memory.oom.group"", v2. This is a tiny implementation of cgroup-aware OOM killer, which adds an ability to kill a cgroup as a single unit and so guarantee the integrity of the workload. Although it has only a limited functionality in comparison to what now resides in the mm tree (it doesn't change the victim task selection algorithm, doesn't look at memory stas on cgroup level, etc), it's also much simpler and more straightforward. So, hopefully, we can avoid having long debates here, as we had with the full implementation. As it doesn't prevent any futher development, and implements an useful and complete feature, it looks as a sane way forward. This patch (of 2): oom_kill_process() consists of two logical parts: the first one is responsible for considering task's children as a potential victim and printing the debug information. The second half is responsible for sending SIGKILL to all tasks sharing the mm struct with the given victim. This commit splits oom_kill_process() with an intention to re-use the the second half: __oom_kill_process(). The cgroup-aware OOM killer will kill multiple tasks belonging to the victim cgroup. We don't need to print the debug information for the each task, as well as play with task selection (considering task's children), so we can't use the existing oom_kill_process(). Link: http://lkml.kernel.org/r/20171130152824.1591-2-guro@fb.com Link: http://lkml.kernel.org/r/20180802003201.817-3-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/5989ad7b5ede38d605c588981f634c08252abfc3;"Although it has only a limited functionality in comparison to what now resides in the mm tree (it doesn't change the victim task selection algorithm, doesn't look at memory stas on cgroup level, etc), it's also much simpler and more straightforward";yes;yes;yes 67;MDY6Q29tbWl0MjMyNTI5ODo1OTg5YWQ3YjVlZGUzOGQ2MDVjNTg4OTgxZjYzNGMwODI1MmFiZmMz;Roman Gushchin;Linus Torvalds;"mm, oom: refactor oom_kill_process() Patch series ""introduce memory.oom.group"", v2. This is a tiny implementation of cgroup-aware OOM killer, which adds an ability to kill a cgroup as a single unit and so guarantee the integrity of the workload. Although it has only a limited functionality in comparison to what now resides in the mm tree (it doesn't change the victim task selection algorithm, doesn't look at memory stas on cgroup level, etc), it's also much simpler and more straightforward. So, hopefully, we can avoid having long debates here, as we had with the full implementation. As it doesn't prevent any futher development, and implements an useful and complete feature, it looks as a sane way forward. This patch (of 2): oom_kill_process() consists of two logical parts: the first one is responsible for considering task's children as a potential victim and printing the debug information. The second half is responsible for sending SIGKILL to all tasks sharing the mm struct with the given victim. This commit splits oom_kill_process() with an intention to re-use the the second half: __oom_kill_process(). The cgroup-aware OOM killer will kill multiple tasks belonging to the victim cgroup. We don't need to print the debug information for the each task, as well as play with task selection (considering task's children), so we can't use the existing oom_kill_process(). Link: http://lkml.kernel.org/r/20171130152824.1591-2-guro@fb.com Link: http://lkml.kernel.org/r/20180802003201.817-3-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/5989ad7b5ede38d605c588981f634c08252abfc3;" So, hopefully, we can avoid having long debates here, as we had with the full implementation";no;no;yes 67;MDY6Q29tbWl0MjMyNTI5ODo1OTg5YWQ3YjVlZGUzOGQ2MDVjNTg4OTgxZjYzNGMwODI1MmFiZmMz;Roman Gushchin;Linus Torvalds;"mm, oom: refactor oom_kill_process() Patch series ""introduce memory.oom.group"", v2. This is a tiny implementation of cgroup-aware OOM killer, which adds an ability to kill a cgroup as a single unit and so guarantee the integrity of the workload. Although it has only a limited functionality in comparison to what now resides in the mm tree (it doesn't change the victim task selection algorithm, doesn't look at memory stas on cgroup level, etc), it's also much simpler and more straightforward. So, hopefully, we can avoid having long debates here, as we had with the full implementation. As it doesn't prevent any futher development, and implements an useful and complete feature, it looks as a sane way forward. This patch (of 2): oom_kill_process() consists of two logical parts: the first one is responsible for considering task's children as a potential victim and printing the debug information. The second half is responsible for sending SIGKILL to all tasks sharing the mm struct with the given victim. This commit splits oom_kill_process() with an intention to re-use the the second half: __oom_kill_process(). The cgroup-aware OOM killer will kill multiple tasks belonging to the victim cgroup. We don't need to print the debug information for the each task, as well as play with task selection (considering task's children), so we can't use the existing oom_kill_process(). Link: http://lkml.kernel.org/r/20171130152824.1591-2-guro@fb.com Link: http://lkml.kernel.org/r/20180802003201.817-3-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/5989ad7b5ede38d605c588981f634c08252abfc3;"As it doesn't prevent any futher development, and implements an useful and complete feature, it looks as a sane way forward";yes;yes;no 67;MDY6Q29tbWl0MjMyNTI5ODo1OTg5YWQ3YjVlZGUzOGQ2MDVjNTg4OTgxZjYzNGMwODI1MmFiZmMz;Roman Gushchin;Linus Torvalds;"mm, oom: refactor oom_kill_process() Patch series ""introduce memory.oom.group"", v2. This is a tiny implementation of cgroup-aware OOM killer, which adds an ability to kill a cgroup as a single unit and so guarantee the integrity of the workload. Although it has only a limited functionality in comparison to what now resides in the mm tree (it doesn't change the victim task selection algorithm, doesn't look at memory stas on cgroup level, etc), it's also much simpler and more straightforward. So, hopefully, we can avoid having long debates here, as we had with the full implementation. As it doesn't prevent any futher development, and implements an useful and complete feature, it looks as a sane way forward. This patch (of 2): oom_kill_process() consists of two logical parts: the first one is responsible for considering task's children as a potential victim and printing the debug information. The second half is responsible for sending SIGKILL to all tasks sharing the mm struct with the given victim. This commit splits oom_kill_process() with an intention to re-use the the second half: __oom_kill_process(). The cgroup-aware OOM killer will kill multiple tasks belonging to the victim cgroup. We don't need to print the debug information for the each task, as well as play with task selection (considering task's children), so we can't use the existing oom_kill_process(). Link: http://lkml.kernel.org/r/20171130152824.1591-2-guro@fb.com Link: http://lkml.kernel.org/r/20180802003201.817-3-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/5989ad7b5ede38d605c588981f634c08252abfc3;"oom_kill_process() consists of two logical parts: the first one is responsible for considering task's children as a potential victim and printing the debug information";yes;no;yes 67;MDY6Q29tbWl0MjMyNTI5ODo1OTg5YWQ3YjVlZGUzOGQ2MDVjNTg4OTgxZjYzNGMwODI1MmFiZmMz;Roman Gushchin;Linus Torvalds;"mm, oom: refactor oom_kill_process() Patch series ""introduce memory.oom.group"", v2. This is a tiny implementation of cgroup-aware OOM killer, which adds an ability to kill a cgroup as a single unit and so guarantee the integrity of the workload. Although it has only a limited functionality in comparison to what now resides in the mm tree (it doesn't change the victim task selection algorithm, doesn't look at memory stas on cgroup level, etc), it's also much simpler and more straightforward. So, hopefully, we can avoid having long debates here, as we had with the full implementation. As it doesn't prevent any futher development, and implements an useful and complete feature, it looks as a sane way forward. This patch (of 2): oom_kill_process() consists of two logical parts: the first one is responsible for considering task's children as a potential victim and printing the debug information. The second half is responsible for sending SIGKILL to all tasks sharing the mm struct with the given victim. This commit splits oom_kill_process() with an intention to re-use the the second half: __oom_kill_process(). The cgroup-aware OOM killer will kill multiple tasks belonging to the victim cgroup. We don't need to print the debug information for the each task, as well as play with task selection (considering task's children), so we can't use the existing oom_kill_process(). Link: http://lkml.kernel.org/r/20171130152824.1591-2-guro@fb.com Link: http://lkml.kernel.org/r/20180802003201.817-3-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/5989ad7b5ede38d605c588981f634c08252abfc3;" The second half is responsible for sending SIGKILL to all tasks sharing the mm struct with the given victim";yes;no;yes 67;MDY6Q29tbWl0MjMyNTI5ODo1OTg5YWQ3YjVlZGUzOGQ2MDVjNTg4OTgxZjYzNGMwODI1MmFiZmMz;Roman Gushchin;Linus Torvalds;"mm, oom: refactor oom_kill_process() Patch series ""introduce memory.oom.group"", v2. This is a tiny implementation of cgroup-aware OOM killer, which adds an ability to kill a cgroup as a single unit and so guarantee the integrity of the workload. Although it has only a limited functionality in comparison to what now resides in the mm tree (it doesn't change the victim task selection algorithm, doesn't look at memory stas on cgroup level, etc), it's also much simpler and more straightforward. So, hopefully, we can avoid having long debates here, as we had with the full implementation. As it doesn't prevent any futher development, and implements an useful and complete feature, it looks as a sane way forward. This patch (of 2): oom_kill_process() consists of two logical parts: the first one is responsible for considering task's children as a potential victim and printing the debug information. The second half is responsible for sending SIGKILL to all tasks sharing the mm struct with the given victim. This commit splits oom_kill_process() with an intention to re-use the the second half: __oom_kill_process(). The cgroup-aware OOM killer will kill multiple tasks belonging to the victim cgroup. We don't need to print the debug information for the each task, as well as play with task selection (considering task's children), so we can't use the existing oom_kill_process(). Link: http://lkml.kernel.org/r/20171130152824.1591-2-guro@fb.com Link: http://lkml.kernel.org/r/20180802003201.817-3-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/5989ad7b5ede38d605c588981f634c08252abfc3;"This commit splits oom_kill_process() with an intention to re-use the the second half: __oom_kill_process()";yes;yes;no 67;MDY6Q29tbWl0MjMyNTI5ODo1OTg5YWQ3YjVlZGUzOGQ2MDVjNTg4OTgxZjYzNGMwODI1MmFiZmMz;Roman Gushchin;Linus Torvalds;"mm, oom: refactor oom_kill_process() Patch series ""introduce memory.oom.group"", v2. This is a tiny implementation of cgroup-aware OOM killer, which adds an ability to kill a cgroup as a single unit and so guarantee the integrity of the workload. Although it has only a limited functionality in comparison to what now resides in the mm tree (it doesn't change the victim task selection algorithm, doesn't look at memory stas on cgroup level, etc), it's also much simpler and more straightforward. So, hopefully, we can avoid having long debates here, as we had with the full implementation. As it doesn't prevent any futher development, and implements an useful and complete feature, it looks as a sane way forward. This patch (of 2): oom_kill_process() consists of two logical parts: the first one is responsible for considering task's children as a potential victim and printing the debug information. The second half is responsible for sending SIGKILL to all tasks sharing the mm struct with the given victim. This commit splits oom_kill_process() with an intention to re-use the the second half: __oom_kill_process(). The cgroup-aware OOM killer will kill multiple tasks belonging to the victim cgroup. We don't need to print the debug information for the each task, as well as play with task selection (considering task's children), so we can't use the existing oom_kill_process(). Link: http://lkml.kernel.org/r/20171130152824.1591-2-guro@fb.com Link: http://lkml.kernel.org/r/20180802003201.817-3-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/5989ad7b5ede38d605c588981f634c08252abfc3;"The cgroup-aware OOM killer will kill multiple tasks belonging to the victim cgroup";yes;no;no 67;MDY6Q29tbWl0MjMyNTI5ODo1OTg5YWQ3YjVlZGUzOGQ2MDVjNTg4OTgxZjYzNGMwODI1MmFiZmMz;Roman Gushchin;Linus Torvalds;"mm, oom: refactor oom_kill_process() Patch series ""introduce memory.oom.group"", v2. This is a tiny implementation of cgroup-aware OOM killer, which adds an ability to kill a cgroup as a single unit and so guarantee the integrity of the workload. Although it has only a limited functionality in comparison to what now resides in the mm tree (it doesn't change the victim task selection algorithm, doesn't look at memory stas on cgroup level, etc), it's also much simpler and more straightforward. So, hopefully, we can avoid having long debates here, as we had with the full implementation. As it doesn't prevent any futher development, and implements an useful and complete feature, it looks as a sane way forward. This patch (of 2): oom_kill_process() consists of two logical parts: the first one is responsible for considering task's children as a potential victim and printing the debug information. The second half is responsible for sending SIGKILL to all tasks sharing the mm struct with the given victim. This commit splits oom_kill_process() with an intention to re-use the the second half: __oom_kill_process(). The cgroup-aware OOM killer will kill multiple tasks belonging to the victim cgroup. We don't need to print the debug information for the each task, as well as play with task selection (considering task's children), so we can't use the existing oom_kill_process(). Link: http://lkml.kernel.org/r/20171130152824.1591-2-guro@fb.com Link: http://lkml.kernel.org/r/20180802003201.817-3-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/5989ad7b5ede38d605c588981f634c08252abfc3;" We don't need to print the debug information for the each task, as well as play with task selection (considering task's children), so we can't use the existing oom_kill_process().";yes;yes;no 68;MDY6Q29tbWl0MjMyNTI5ODo0MzFmNDJmZGZkYjM2ZjA2ZjQzYzcxMWZjNTliZTliODE0ZDhmYjIy;Michal Hocko;Linus Torvalds;"mm/oom_kill.c: clean up oom_reap_task_mm() Andrew has noticed some inconsistencies in oom_reap_task_mm. Notably - Undocumented return value. - comment ""failed to reap part..."" is misleading - sounds like it's referring to something which happened in the past, is in fact referring to something which might happen in the future. - fails to call trace_finish_task_reaping() in one case - code duplication. - Increases mmap_sem hold time a little by moving trace_finish_task_reaping() inside the locked region. So sue me ;) - Sharing the finish: path means that the trace event won't distinguish between the two sources of finishing. Add a short explanation for the return value and fix the rest by reorganizing the function a bit to have unified function exit paths. Link: http://lkml.kernel.org/r/20180724141747.GP28386@dhcp22.suse.cz Suggested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/431f42fdfdb36f06f43c711fc59be9b814d8fb22;mm/oom_kill.c: clean up oom_reap_task_mm();yes;yes;no 68;MDY6Q29tbWl0MjMyNTI5ODo0MzFmNDJmZGZkYjM2ZjA2ZjQzYzcxMWZjNTliZTliODE0ZDhmYjIy;Michal Hocko;Linus Torvalds;"mm/oom_kill.c: clean up oom_reap_task_mm() Andrew has noticed some inconsistencies in oom_reap_task_mm. Notably - Undocumented return value. - comment ""failed to reap part..."" is misleading - sounds like it's referring to something which happened in the past, is in fact referring to something which might happen in the future. - fails to call trace_finish_task_reaping() in one case - code duplication. - Increases mmap_sem hold time a little by moving trace_finish_task_reaping() inside the locked region. So sue me ;) - Sharing the finish: path means that the trace event won't distinguish between the two sources of finishing. Add a short explanation for the return value and fix the rest by reorganizing the function a bit to have unified function exit paths. Link: http://lkml.kernel.org/r/20180724141747.GP28386@dhcp22.suse.cz Suggested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/431f42fdfdb36f06f43c711fc59be9b814d8fb22;Andrew has noticed some inconsistencies in oom_reap_task_mm;no;yes;yes 68;MDY6Q29tbWl0MjMyNTI5ODo0MzFmNDJmZGZkYjM2ZjA2ZjQzYzcxMWZjNTliZTliODE0ZDhmYjIy;Michal Hocko;Linus Torvalds;"mm/oom_kill.c: clean up oom_reap_task_mm() Andrew has noticed some inconsistencies in oom_reap_task_mm. Notably - Undocumented return value. - comment ""failed to reap part..."" is misleading - sounds like it's referring to something which happened in the past, is in fact referring to something which might happen in the future. - fails to call trace_finish_task_reaping() in one case - code duplication. - Increases mmap_sem hold time a little by moving trace_finish_task_reaping() inside the locked region. So sue me ;) - Sharing the finish: path means that the trace event won't distinguish between the two sources of finishing. Add a short explanation for the return value and fix the rest by reorganizing the function a bit to have unified function exit paths. Link: http://lkml.kernel.org/r/20180724141747.GP28386@dhcp22.suse.cz Suggested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/431f42fdfdb36f06f43c711fc59be9b814d8fb22;" Notably - Undocumented return value";no;yes;yes 68;MDY6Q29tbWl0MjMyNTI5ODo0MzFmNDJmZGZkYjM2ZjA2ZjQzYzcxMWZjNTliZTliODE0ZDhmYjIy;Michal Hocko;Linus Torvalds;"mm/oom_kill.c: clean up oom_reap_task_mm() Andrew has noticed some inconsistencies in oom_reap_task_mm. Notably - Undocumented return value. - comment ""failed to reap part..."" is misleading - sounds like it's referring to something which happened in the past, is in fact referring to something which might happen in the future. - fails to call trace_finish_task_reaping() in one case - code duplication. - Increases mmap_sem hold time a little by moving trace_finish_task_reaping() inside the locked region. So sue me ;) - Sharing the finish: path means that the trace event won't distinguish between the two sources of finishing. Add a short explanation for the return value and fix the rest by reorganizing the function a bit to have unified function exit paths. Link: http://lkml.kernel.org/r/20180724141747.GP28386@dhcp22.suse.cz Suggested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/431f42fdfdb36f06f43c711fc59be9b814d8fb22;" - comment ""failed to reap part..."" is misleading - sounds like it's referring to something which happened in the past, is in fact referring to something which might happen in the future";no;yes;yes 68;MDY6Q29tbWl0MjMyNTI5ODo0MzFmNDJmZGZkYjM2ZjA2ZjQzYzcxMWZjNTliZTliODE0ZDhmYjIy;Michal Hocko;Linus Torvalds;"mm/oom_kill.c: clean up oom_reap_task_mm() Andrew has noticed some inconsistencies in oom_reap_task_mm. Notably - Undocumented return value. - comment ""failed to reap part..."" is misleading - sounds like it's referring to something which happened in the past, is in fact referring to something which might happen in the future. - fails to call trace_finish_task_reaping() in one case - code duplication. - Increases mmap_sem hold time a little by moving trace_finish_task_reaping() inside the locked region. So sue me ;) - Sharing the finish: path means that the trace event won't distinguish between the two sources of finishing. Add a short explanation for the return value and fix the rest by reorganizing the function a bit to have unified function exit paths. Link: http://lkml.kernel.org/r/20180724141747.GP28386@dhcp22.suse.cz Suggested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/431f42fdfdb36f06f43c711fc59be9b814d8fb22;" - fails to call trace_finish_task_reaping() in one case - code duplication";no;yes;yes 68;MDY6Q29tbWl0MjMyNTI5ODo0MzFmNDJmZGZkYjM2ZjA2ZjQzYzcxMWZjNTliZTliODE0ZDhmYjIy;Michal Hocko;Linus Torvalds;"mm/oom_kill.c: clean up oom_reap_task_mm() Andrew has noticed some inconsistencies in oom_reap_task_mm. Notably - Undocumented return value. - comment ""failed to reap part..."" is misleading - sounds like it's referring to something which happened in the past, is in fact referring to something which might happen in the future. - fails to call trace_finish_task_reaping() in one case - code duplication. - Increases mmap_sem hold time a little by moving trace_finish_task_reaping() inside the locked region. So sue me ;) - Sharing the finish: path means that the trace event won't distinguish between the two sources of finishing. Add a short explanation for the return value and fix the rest by reorganizing the function a bit to have unified function exit paths. Link: http://lkml.kernel.org/r/20180724141747.GP28386@dhcp22.suse.cz Suggested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/431f42fdfdb36f06f43c711fc59be9b814d8fb22;" - Increases mmap_sem hold time a little by moving trace_finish_task_reaping() inside the locked region";yes;yes;yes 68;MDY6Q29tbWl0MjMyNTI5ODo0MzFmNDJmZGZkYjM2ZjA2ZjQzYzcxMWZjNTliZTliODE0ZDhmYjIy;Michal Hocko;Linus Torvalds;"mm/oom_kill.c: clean up oom_reap_task_mm() Andrew has noticed some inconsistencies in oom_reap_task_mm. Notably - Undocumented return value. - comment ""failed to reap part..."" is misleading - sounds like it's referring to something which happened in the past, is in fact referring to something which might happen in the future. - fails to call trace_finish_task_reaping() in one case - code duplication. - Increases mmap_sem hold time a little by moving trace_finish_task_reaping() inside the locked region. So sue me ;) - Sharing the finish: path means that the trace event won't distinguish between the two sources of finishing. Add a short explanation for the return value and fix the rest by reorganizing the function a bit to have unified function exit paths. Link: http://lkml.kernel.org/r/20180724141747.GP28386@dhcp22.suse.cz Suggested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/431f42fdfdb36f06f43c711fc59be9b814d8fb22;" So sue me ;) - Sharing the finish: path means that the trace event won't distinguish between the two sources of finishing";no;yes;yes 68;MDY6Q29tbWl0MjMyNTI5ODo0MzFmNDJmZGZkYjM2ZjA2ZjQzYzcxMWZjNTliZTliODE0ZDhmYjIy;Michal Hocko;Linus Torvalds;"mm/oom_kill.c: clean up oom_reap_task_mm() Andrew has noticed some inconsistencies in oom_reap_task_mm. Notably - Undocumented return value. - comment ""failed to reap part..."" is misleading - sounds like it's referring to something which happened in the past, is in fact referring to something which might happen in the future. - fails to call trace_finish_task_reaping() in one case - code duplication. - Increases mmap_sem hold time a little by moving trace_finish_task_reaping() inside the locked region. So sue me ;) - Sharing the finish: path means that the trace event won't distinguish between the two sources of finishing. Add a short explanation for the return value and fix the rest by reorganizing the function a bit to have unified function exit paths. Link: http://lkml.kernel.org/r/20180724141747.GP28386@dhcp22.suse.cz Suggested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/431f42fdfdb36f06f43c711fc59be9b814d8fb22;"Add a short explanation for the return value and fix the rest by reorganizing the function a bit to have unified function exit paths.";yes;yes;no 69;MDY6Q29tbWl0MjMyNTI5ODpjM2I3OGIxMWVmYmIyODY1NDMzYWJmOWQyMmMwMDRmZmU0YTczZjVj;Rodrigo Freire;Linus Torvalds;"mm, oom: describe task memory unit, larger PID pad The default page memory unit of OOM task dump events might not be intuitive and potentially misleading for the non-initiated when debugging OOM events: These are pages and not kBs. Add a small printk prior to the task dump informing that the memory units are actually memory _pages_. Also extends PID field to align on up to 7 characters. Reference https://lkml.org/lkml/2018/7/3/1201 Link: http://lkml.kernel.org/r/c795eb5129149ed8a6345c273aba167ff1bbd388.1530715938.git.rfreire@redhat.com Signed-off-by: Rodrigo Freire <rfreire@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Rafael Aquini <aquini@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/c3b78b11efbb2865433abf9d22c004ffe4a73f5c;mm, oom: describe task memory unit, larger PID pad;yes;yes;no 69;MDY6Q29tbWl0MjMyNTI5ODpjM2I3OGIxMWVmYmIyODY1NDMzYWJmOWQyMmMwMDRmZmU0YTczZjVj;Rodrigo Freire;Linus Torvalds;"mm, oom: describe task memory unit, larger PID pad The default page memory unit of OOM task dump events might not be intuitive and potentially misleading for the non-initiated when debugging OOM events: These are pages and not kBs. Add a small printk prior to the task dump informing that the memory units are actually memory _pages_. Also extends PID field to align on up to 7 characters. Reference https://lkml.org/lkml/2018/7/3/1201 Link: http://lkml.kernel.org/r/c795eb5129149ed8a6345c273aba167ff1bbd388.1530715938.git.rfreire@redhat.com Signed-off-by: Rodrigo Freire <rfreire@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Rafael Aquini <aquini@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/c3b78b11efbb2865433abf9d22c004ffe4a73f5c;"The default page memory unit of OOM task dump events might not be intuitive and potentially misleading for the non-initiated when debugging OOM events: These are pages and not kBs";no;yes;yes 69;MDY6Q29tbWl0MjMyNTI5ODpjM2I3OGIxMWVmYmIyODY1NDMzYWJmOWQyMmMwMDRmZmU0YTczZjVj;Rodrigo Freire;Linus Torvalds;"mm, oom: describe task memory unit, larger PID pad The default page memory unit of OOM task dump events might not be intuitive and potentially misleading for the non-initiated when debugging OOM events: These are pages and not kBs. Add a small printk prior to the task dump informing that the memory units are actually memory _pages_. Also extends PID field to align on up to 7 characters. Reference https://lkml.org/lkml/2018/7/3/1201 Link: http://lkml.kernel.org/r/c795eb5129149ed8a6345c273aba167ff1bbd388.1530715938.git.rfreire@redhat.com Signed-off-by: Rodrigo Freire <rfreire@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Rafael Aquini <aquini@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/c3b78b11efbb2865433abf9d22c004ffe4a73f5c;" Add a small printk prior to the task dump informing that the memory units are actually memory _pages_";yes;yes;no 69;MDY6Q29tbWl0MjMyNTI5ODpjM2I3OGIxMWVmYmIyODY1NDMzYWJmOWQyMmMwMDRmZmU0YTczZjVj;Rodrigo Freire;Linus Torvalds;"mm, oom: describe task memory unit, larger PID pad The default page memory unit of OOM task dump events might not be intuitive and potentially misleading for the non-initiated when debugging OOM events: These are pages and not kBs. Add a small printk prior to the task dump informing that the memory units are actually memory _pages_. Also extends PID field to align on up to 7 characters. Reference https://lkml.org/lkml/2018/7/3/1201 Link: http://lkml.kernel.org/r/c795eb5129149ed8a6345c273aba167ff1bbd388.1530715938.git.rfreire@redhat.com Signed-off-by: Rodrigo Freire <rfreire@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Rafael Aquini <aquini@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/c3b78b11efbb2865433abf9d22c004ffe4a73f5c;Also extends PID field to align on up to 7 characters;yes;yes;no 69;MDY6Q29tbWl0MjMyNTI5ODpjM2I3OGIxMWVmYmIyODY1NDMzYWJmOWQyMmMwMDRmZmU0YTczZjVj;Rodrigo Freire;Linus Torvalds;"mm, oom: describe task memory unit, larger PID pad The default page memory unit of OOM task dump events might not be intuitive and potentially misleading for the non-initiated when debugging OOM events: These are pages and not kBs. Add a small printk prior to the task dump informing that the memory units are actually memory _pages_. Also extends PID field to align on up to 7 characters. Reference https://lkml.org/lkml/2018/7/3/1201 Link: http://lkml.kernel.org/r/c795eb5129149ed8a6345c273aba167ff1bbd388.1530715938.git.rfreire@redhat.com Signed-off-by: Rodrigo Freire <rfreire@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Rafael Aquini <aquini@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/c3b78b11efbb2865433abf9d22c004ffe4a73f5c;Reference ;no;no;no 70;MDY6Q29tbWl0MjMyNTI5ODphZjU2NzlmYmM2NjlmMzFmN2ViZDBkNDczYmNhNzZjMjRjMDdkZTMw;Michal Hocko;Linus Torvalds;"mm, oom: remove oom_lock from oom_reaper oom_reaper used to rely on the oom_lock since e2fe14564d33 (""oom_reaper: close race with exiting task""). We do not really need the lock anymore though. 212925802454 (""mm: oom: let oom_reap_task and exit_mmap run concurrently"") has removed serialization with the exit path based on the mm reference count and so we do not really rely on the oom_lock anymore. Tetsuo was arguing that at least MMF_OOM_SKIP should be set under the lock to prevent from races when the page allocator didn't manage to get the freed (reaped) memory in __alloc_pages_may_oom but it sees the flag later on and move on to another victim. Although this is possible in principle let's wait for it to actually happen in real life before we make the locking more complex again. Therefore remove the oom_lock for oom_reaper paths (both exit_mmap and oom_reap_task_mm). The reaper serializes with exit_mmap by mmap_sem + MMF_OOM_SKIP flag. There is no synchronization with out_of_memory path now. [mhocko@kernel.org: oom_reap_task_mm should return false when __oom_reap_task_mm did] Link: http://lkml.kernel.org/r/20180724141747.GP28386@dhcp22.suse.cz Link: http://lkml.kernel.org/r/20180719075922.13784-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Suggested-by: David Rientjes <rientjes@google.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/af5679fbc669f31f7ebd0d473bca76c24c07de30;mm, oom: remove oom_lock from oom_reaper;yes;no;no 70;MDY6Q29tbWl0MjMyNTI5ODphZjU2NzlmYmM2NjlmMzFmN2ViZDBkNDczYmNhNzZjMjRjMDdkZTMw;Michal Hocko;Linus Torvalds;"mm, oom: remove oom_lock from oom_reaper oom_reaper used to rely on the oom_lock since e2fe14564d33 (""oom_reaper: close race with exiting task""). We do not really need the lock anymore though. 212925802454 (""mm: oom: let oom_reap_task and exit_mmap run concurrently"") has removed serialization with the exit path based on the mm reference count and so we do not really rely on the oom_lock anymore. Tetsuo was arguing that at least MMF_OOM_SKIP should be set under the lock to prevent from races when the page allocator didn't manage to get the freed (reaped) memory in __alloc_pages_may_oom but it sees the flag later on and move on to another victim. Although this is possible in principle let's wait for it to actually happen in real life before we make the locking more complex again. Therefore remove the oom_lock for oom_reaper paths (both exit_mmap and oom_reap_task_mm). The reaper serializes with exit_mmap by mmap_sem + MMF_OOM_SKIP flag. There is no synchronization with out_of_memory path now. [mhocko@kernel.org: oom_reap_task_mm should return false when __oom_reap_task_mm did] Link: http://lkml.kernel.org/r/20180724141747.GP28386@dhcp22.suse.cz Link: http://lkml.kernel.org/r/20180719075922.13784-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Suggested-by: David Rientjes <rientjes@google.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/af5679fbc669f31f7ebd0d473bca76c24c07de30;"oom_reaper used to rely on the oom_lock since e2fe14564d33 (""oom_reaper";no;no;yes 70;MDY6Q29tbWl0MjMyNTI5ODphZjU2NzlmYmM2NjlmMzFmN2ViZDBkNDczYmNhNzZjMjRjMDdkZTMw;Michal Hocko;Linus Torvalds;"mm, oom: remove oom_lock from oom_reaper oom_reaper used to rely on the oom_lock since e2fe14564d33 (""oom_reaper: close race with exiting task""). We do not really need the lock anymore though. 212925802454 (""mm: oom: let oom_reap_task and exit_mmap run concurrently"") has removed serialization with the exit path based on the mm reference count and so we do not really rely on the oom_lock anymore. Tetsuo was arguing that at least MMF_OOM_SKIP should be set under the lock to prevent from races when the page allocator didn't manage to get the freed (reaped) memory in __alloc_pages_may_oom but it sees the flag later on and move on to another victim. Although this is possible in principle let's wait for it to actually happen in real life before we make the locking more complex again. Therefore remove the oom_lock for oom_reaper paths (both exit_mmap and oom_reap_task_mm). The reaper serializes with exit_mmap by mmap_sem + MMF_OOM_SKIP flag. There is no synchronization with out_of_memory path now. [mhocko@kernel.org: oom_reap_task_mm should return false when __oom_reap_task_mm did] Link: http://lkml.kernel.org/r/20180724141747.GP28386@dhcp22.suse.cz Link: http://lkml.kernel.org/r/20180719075922.13784-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Suggested-by: David Rientjes <rientjes@google.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/af5679fbc669f31f7ebd0d473bca76c24c07de30;"close race with exiting task"")";no;no;yes 70;MDY6Q29tbWl0MjMyNTI5ODphZjU2NzlmYmM2NjlmMzFmN2ViZDBkNDczYmNhNzZjMjRjMDdkZTMw;Michal Hocko;Linus Torvalds;"mm, oom: remove oom_lock from oom_reaper oom_reaper used to rely on the oom_lock since e2fe14564d33 (""oom_reaper: close race with exiting task""). We do not really need the lock anymore though. 212925802454 (""mm: oom: let oom_reap_task and exit_mmap run concurrently"") has removed serialization with the exit path based on the mm reference count and so we do not really rely on the oom_lock anymore. Tetsuo was arguing that at least MMF_OOM_SKIP should be set under the lock to prevent from races when the page allocator didn't manage to get the freed (reaped) memory in __alloc_pages_may_oom but it sees the flag later on and move on to another victim. Although this is possible in principle let's wait for it to actually happen in real life before we make the locking more complex again. Therefore remove the oom_lock for oom_reaper paths (both exit_mmap and oom_reap_task_mm). The reaper serializes with exit_mmap by mmap_sem + MMF_OOM_SKIP flag. There is no synchronization with out_of_memory path now. [mhocko@kernel.org: oom_reap_task_mm should return false when __oom_reap_task_mm did] Link: http://lkml.kernel.org/r/20180724141747.GP28386@dhcp22.suse.cz Link: http://lkml.kernel.org/r/20180719075922.13784-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Suggested-by: David Rientjes <rientjes@google.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/af5679fbc669f31f7ebd0d473bca76c24c07de30;" We do not really need the lock anymore though";no;yes;no 70;MDY6Q29tbWl0MjMyNTI5ODphZjU2NzlmYmM2NjlmMzFmN2ViZDBkNDczYmNhNzZjMjRjMDdkZTMw;Michal Hocko;Linus Torvalds;"mm, oom: remove oom_lock from oom_reaper oom_reaper used to rely on the oom_lock since e2fe14564d33 (""oom_reaper: close race with exiting task""). We do not really need the lock anymore though. 212925802454 (""mm: oom: let oom_reap_task and exit_mmap run concurrently"") has removed serialization with the exit path based on the mm reference count and so we do not really rely on the oom_lock anymore. Tetsuo was arguing that at least MMF_OOM_SKIP should be set under the lock to prevent from races when the page allocator didn't manage to get the freed (reaped) memory in __alloc_pages_may_oom but it sees the flag later on and move on to another victim. Although this is possible in principle let's wait for it to actually happen in real life before we make the locking more complex again. Therefore remove the oom_lock for oom_reaper paths (both exit_mmap and oom_reap_task_mm). The reaper serializes with exit_mmap by mmap_sem + MMF_OOM_SKIP flag. There is no synchronization with out_of_memory path now. [mhocko@kernel.org: oom_reap_task_mm should return false when __oom_reap_task_mm did] Link: http://lkml.kernel.org/r/20180724141747.GP28386@dhcp22.suse.cz Link: http://lkml.kernel.org/r/20180719075922.13784-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Suggested-by: David Rientjes <rientjes@google.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/af5679fbc669f31f7ebd0d473bca76c24c07de30;" 212925802454 (""mm: oom: let oom_reap_task and exit_mmap run concurrently"") has removed serialization with the exit path based on the mm reference count and so we do not really rely on the oom_lock anymore";no;yes;yes 70;MDY6Q29tbWl0MjMyNTI5ODphZjU2NzlmYmM2NjlmMzFmN2ViZDBkNDczYmNhNzZjMjRjMDdkZTMw;Michal Hocko;Linus Torvalds;"mm, oom: remove oom_lock from oom_reaper oom_reaper used to rely on the oom_lock since e2fe14564d33 (""oom_reaper: close race with exiting task""). We do not really need the lock anymore though. 212925802454 (""mm: oom: let oom_reap_task and exit_mmap run concurrently"") has removed serialization with the exit path based on the mm reference count and so we do not really rely on the oom_lock anymore. Tetsuo was arguing that at least MMF_OOM_SKIP should be set under the lock to prevent from races when the page allocator didn't manage to get the freed (reaped) memory in __alloc_pages_may_oom but it sees the flag later on and move on to another victim. Although this is possible in principle let's wait for it to actually happen in real life before we make the locking more complex again. Therefore remove the oom_lock for oom_reaper paths (both exit_mmap and oom_reap_task_mm). The reaper serializes with exit_mmap by mmap_sem + MMF_OOM_SKIP flag. There is no synchronization with out_of_memory path now. [mhocko@kernel.org: oom_reap_task_mm should return false when __oom_reap_task_mm did] Link: http://lkml.kernel.org/r/20180724141747.GP28386@dhcp22.suse.cz Link: http://lkml.kernel.org/r/20180719075922.13784-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Suggested-by: David Rientjes <rientjes@google.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/af5679fbc669f31f7ebd0d473bca76c24c07de30;"Tetsuo was arguing that at least MMF_OOM_SKIP should be set under the lock to prevent from races when the page allocator didn't manage to get the freed (reaped) memory in __alloc_pages_may_oom but it sees the flag later on and move on to another victim";no;no;yes 70;MDY6Q29tbWl0MjMyNTI5ODphZjU2NzlmYmM2NjlmMzFmN2ViZDBkNDczYmNhNzZjMjRjMDdkZTMw;Michal Hocko;Linus Torvalds;"mm, oom: remove oom_lock from oom_reaper oom_reaper used to rely on the oom_lock since e2fe14564d33 (""oom_reaper: close race with exiting task""). We do not really need the lock anymore though. 212925802454 (""mm: oom: let oom_reap_task and exit_mmap run concurrently"") has removed serialization with the exit path based on the mm reference count and so we do not really rely on the oom_lock anymore. Tetsuo was arguing that at least MMF_OOM_SKIP should be set under the lock to prevent from races when the page allocator didn't manage to get the freed (reaped) memory in __alloc_pages_may_oom but it sees the flag later on and move on to another victim. Although this is possible in principle let's wait for it to actually happen in real life before we make the locking more complex again. Therefore remove the oom_lock for oom_reaper paths (both exit_mmap and oom_reap_task_mm). The reaper serializes with exit_mmap by mmap_sem + MMF_OOM_SKIP flag. There is no synchronization with out_of_memory path now. [mhocko@kernel.org: oom_reap_task_mm should return false when __oom_reap_task_mm did] Link: http://lkml.kernel.org/r/20180724141747.GP28386@dhcp22.suse.cz Link: http://lkml.kernel.org/r/20180719075922.13784-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Suggested-by: David Rientjes <rientjes@google.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/af5679fbc669f31f7ebd0d473bca76c24c07de30;" Although this is possible in principle let's wait for it to actually happen in real life before we make the locking more complex again";yes;yes;yes 70;MDY6Q29tbWl0MjMyNTI5ODphZjU2NzlmYmM2NjlmMzFmN2ViZDBkNDczYmNhNzZjMjRjMDdkZTMw;Michal Hocko;Linus Torvalds;"mm, oom: remove oom_lock from oom_reaper oom_reaper used to rely on the oom_lock since e2fe14564d33 (""oom_reaper: close race with exiting task""). We do not really need the lock anymore though. 212925802454 (""mm: oom: let oom_reap_task and exit_mmap run concurrently"") has removed serialization with the exit path based on the mm reference count and so we do not really rely on the oom_lock anymore. Tetsuo was arguing that at least MMF_OOM_SKIP should be set under the lock to prevent from races when the page allocator didn't manage to get the freed (reaped) memory in __alloc_pages_may_oom but it sees the flag later on and move on to another victim. Although this is possible in principle let's wait for it to actually happen in real life before we make the locking more complex again. Therefore remove the oom_lock for oom_reaper paths (both exit_mmap and oom_reap_task_mm). The reaper serializes with exit_mmap by mmap_sem + MMF_OOM_SKIP flag. There is no synchronization with out_of_memory path now. [mhocko@kernel.org: oom_reap_task_mm should return false when __oom_reap_task_mm did] Link: http://lkml.kernel.org/r/20180724141747.GP28386@dhcp22.suse.cz Link: http://lkml.kernel.org/r/20180719075922.13784-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Suggested-by: David Rientjes <rientjes@google.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/af5679fbc669f31f7ebd0d473bca76c24c07de30;"Therefore remove the oom_lock for oom_reaper paths (both exit_mmap and oom_reap_task_mm)";yes;no;no 70;MDY6Q29tbWl0MjMyNTI5ODphZjU2NzlmYmM2NjlmMzFmN2ViZDBkNDczYmNhNzZjMjRjMDdkZTMw;Michal Hocko;Linus Torvalds;"mm, oom: remove oom_lock from oom_reaper oom_reaper used to rely on the oom_lock since e2fe14564d33 (""oom_reaper: close race with exiting task""). We do not really need the lock anymore though. 212925802454 (""mm: oom: let oom_reap_task and exit_mmap run concurrently"") has removed serialization with the exit path based on the mm reference count and so we do not really rely on the oom_lock anymore. Tetsuo was arguing that at least MMF_OOM_SKIP should be set under the lock to prevent from races when the page allocator didn't manage to get the freed (reaped) memory in __alloc_pages_may_oom but it sees the flag later on and move on to another victim. Although this is possible in principle let's wait for it to actually happen in real life before we make the locking more complex again. Therefore remove the oom_lock for oom_reaper paths (both exit_mmap and oom_reap_task_mm). The reaper serializes with exit_mmap by mmap_sem + MMF_OOM_SKIP flag. There is no synchronization with out_of_memory path now. [mhocko@kernel.org: oom_reap_task_mm should return false when __oom_reap_task_mm did] Link: http://lkml.kernel.org/r/20180724141747.GP28386@dhcp22.suse.cz Link: http://lkml.kernel.org/r/20180719075922.13784-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Suggested-by: David Rientjes <rientjes@google.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/af5679fbc669f31f7ebd0d473bca76c24c07de30;" The reaper serializes with exit_mmap by mmap_sem + MMF_OOM_SKIP flag";yes;no;yes 70;MDY6Q29tbWl0MjMyNTI5ODphZjU2NzlmYmM2NjlmMzFmN2ViZDBkNDczYmNhNzZjMjRjMDdkZTMw;Michal Hocko;Linus Torvalds;"mm, oom: remove oom_lock from oom_reaper oom_reaper used to rely on the oom_lock since e2fe14564d33 (""oom_reaper: close race with exiting task""). We do not really need the lock anymore though. 212925802454 (""mm: oom: let oom_reap_task and exit_mmap run concurrently"") has removed serialization with the exit path based on the mm reference count and so we do not really rely on the oom_lock anymore. Tetsuo was arguing that at least MMF_OOM_SKIP should be set under the lock to prevent from races when the page allocator didn't manage to get the freed (reaped) memory in __alloc_pages_may_oom but it sees the flag later on and move on to another victim. Although this is possible in principle let's wait for it to actually happen in real life before we make the locking more complex again. Therefore remove the oom_lock for oom_reaper paths (both exit_mmap and oom_reap_task_mm). The reaper serializes with exit_mmap by mmap_sem + MMF_OOM_SKIP flag. There is no synchronization with out_of_memory path now. [mhocko@kernel.org: oom_reap_task_mm should return false when __oom_reap_task_mm did] Link: http://lkml.kernel.org/r/20180724141747.GP28386@dhcp22.suse.cz Link: http://lkml.kernel.org/r/20180719075922.13784-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Suggested-by: David Rientjes <rientjes@google.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/af5679fbc669f31f7ebd0d473bca76c24c07de30;" There is no synchronization with out_of_memory path now.";yes;no;yes 71;MDY6Q29tbWl0MjMyNTI5ODo5MzA2NWFjNzUzZTQ0NDM4NDBhMDU3YmZlZjRiZTcxZWM3NjZmZGU5;Michal Hocko;Linus Torvalds;"mm, oom: distinguish blockable mode for mmu notifiers There are several blockable mmu notifiers which might sleep in mmu_notifier_invalidate_range_start and that is a problem for the oom_reaper because it needs to guarantee a forward progress so it cannot depend on any sleepable locks. Currently we simply back off and mark an oom victim with blockable mmu notifiers as done after a short sleep. That can result in selecting a new oom victim prematurely because the previous one still hasn't torn its memory down yet. We can do much better though. Even if mmu notifiers use sleepable locks there is no reason to automatically assume those locks are held. Moreover majority of notifiers only care about a portion of the address space and there is absolutely zero reason to fail when we are unmapping an unrelated range. Many notifiers do really block and wait for HW which is harder to handle and we have to bail out though. This patch handles the low hanging fruit. __mmu_notifier_invalidate_range_start gets a blockable flag and callbacks are not allowed to sleep if the flag is set to false. This is achieved by using trylock instead of the sleepable lock for most callbacks and continue as long as we do not block down the call chain. I think we can improve that even further because there is a common pattern to do a range lookup first and then do something about that. The first part can be done without a sleeping lock in most cases AFAICS. The oom_reaper end then simply retries if there is at least one notifier which couldn't make any progress in !blockable mode. A retry loop is already implemented to wait for the mmap_sem and this is basically the same thing. The simplest way for driver developers to test this code path is to wrap userspace code which uses these notifiers into a memcg and set the hard limit to hit the oom. This can be done e.g. after the test faults in all the mmu notifier managed memory and set the hard limit to something really small. Then we are looking for a proper process tear down. [akpm@linux-foundation.org: coding style fixes] [akpm@linux-foundation.org: minor code simplification] Link: http://lkml.kernel.org/r/20180716115058.5559-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Christian König <christian.koenig@amd.com> # AMD notifiers Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx and umem_odp Reported-by: David Rientjes <rientjes@google.com> Cc: ""David (ChunMing) Zhou"" <David1.Zhou@amd.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: David Airlie <airlied@linux.ie> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Doug Ledford <dledford@redhat.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Mike Marciniszyn <mike.marciniszyn@intel.com> Cc: Dennis Dalessandro <dennis.dalessandro@intel.com> Cc: Sudeep Dutt <sudeep.dutt@intel.com> Cc: Ashutosh Dixit <ashutosh.dixit@intel.com> Cc: Dimitri Sivanich <sivanich@sgi.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Juergen Gross <jgross@suse.com> Cc: ""Jérôme Glisse"" <jglisse@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/93065ac753e4443840a057bfef4be71ec766fde9;mm, oom: distinguish blockable mode for mmu notifiers;yes;yes;no 71;MDY6Q29tbWl0MjMyNTI5ODo5MzA2NWFjNzUzZTQ0NDM4NDBhMDU3YmZlZjRiZTcxZWM3NjZmZGU5;Michal Hocko;Linus Torvalds;"mm, oom: distinguish blockable mode for mmu notifiers There are several blockable mmu notifiers which might sleep in mmu_notifier_invalidate_range_start and that is a problem for the oom_reaper because it needs to guarantee a forward progress so it cannot depend on any sleepable locks. Currently we simply back off and mark an oom victim with blockable mmu notifiers as done after a short sleep. That can result in selecting a new oom victim prematurely because the previous one still hasn't torn its memory down yet. We can do much better though. Even if mmu notifiers use sleepable locks there is no reason to automatically assume those locks are held. Moreover majority of notifiers only care about a portion of the address space and there is absolutely zero reason to fail when we are unmapping an unrelated range. Many notifiers do really block and wait for HW which is harder to handle and we have to bail out though. This patch handles the low hanging fruit. __mmu_notifier_invalidate_range_start gets a blockable flag and callbacks are not allowed to sleep if the flag is set to false. This is achieved by using trylock instead of the sleepable lock for most callbacks and continue as long as we do not block down the call chain. I think we can improve that even further because there is a common pattern to do a range lookup first and then do something about that. The first part can be done without a sleeping lock in most cases AFAICS. The oom_reaper end then simply retries if there is at least one notifier which couldn't make any progress in !blockable mode. A retry loop is already implemented to wait for the mmap_sem and this is basically the same thing. The simplest way for driver developers to test this code path is to wrap userspace code which uses these notifiers into a memcg and set the hard limit to hit the oom. This can be done e.g. after the test faults in all the mmu notifier managed memory and set the hard limit to something really small. Then we are looking for a proper process tear down. [akpm@linux-foundation.org: coding style fixes] [akpm@linux-foundation.org: minor code simplification] Link: http://lkml.kernel.org/r/20180716115058.5559-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Christian König <christian.koenig@amd.com> # AMD notifiers Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx and umem_odp Reported-by: David Rientjes <rientjes@google.com> Cc: ""David (ChunMing) Zhou"" <David1.Zhou@amd.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: David Airlie <airlied@linux.ie> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Doug Ledford <dledford@redhat.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Mike Marciniszyn <mike.marciniszyn@intel.com> Cc: Dennis Dalessandro <dennis.dalessandro@intel.com> Cc: Sudeep Dutt <sudeep.dutt@intel.com> Cc: Ashutosh Dixit <ashutosh.dixit@intel.com> Cc: Dimitri Sivanich <sivanich@sgi.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Juergen Gross <jgross@suse.com> Cc: ""Jérôme Glisse"" <jglisse@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/93065ac753e4443840a057bfef4be71ec766fde9;"There are several blockable mmu notifiers which might sleep in mmu_notifier_invalidate_range_start and that is a problem for the oom_reaper because it needs to guarantee a forward progress so it cannot depend on any sleepable locks";no;yes;yes 71;MDY6Q29tbWl0MjMyNTI5ODo5MzA2NWFjNzUzZTQ0NDM4NDBhMDU3YmZlZjRiZTcxZWM3NjZmZGU5;Michal Hocko;Linus Torvalds;"mm, oom: distinguish blockable mode for mmu notifiers There are several blockable mmu notifiers which might sleep in mmu_notifier_invalidate_range_start and that is a problem for the oom_reaper because it needs to guarantee a forward progress so it cannot depend on any sleepable locks. Currently we simply back off and mark an oom victim with blockable mmu notifiers as done after a short sleep. That can result in selecting a new oom victim prematurely because the previous one still hasn't torn its memory down yet. We can do much better though. Even if mmu notifiers use sleepable locks there is no reason to automatically assume those locks are held. Moreover majority of notifiers only care about a portion of the address space and there is absolutely zero reason to fail when we are unmapping an unrelated range. Many notifiers do really block and wait for HW which is harder to handle and we have to bail out though. This patch handles the low hanging fruit. __mmu_notifier_invalidate_range_start gets a blockable flag and callbacks are not allowed to sleep if the flag is set to false. This is achieved by using trylock instead of the sleepable lock for most callbacks and continue as long as we do not block down the call chain. I think we can improve that even further because there is a common pattern to do a range lookup first and then do something about that. The first part can be done without a sleeping lock in most cases AFAICS. The oom_reaper end then simply retries if there is at least one notifier which couldn't make any progress in !blockable mode. A retry loop is already implemented to wait for the mmap_sem and this is basically the same thing. The simplest way for driver developers to test this code path is to wrap userspace code which uses these notifiers into a memcg and set the hard limit to hit the oom. This can be done e.g. after the test faults in all the mmu notifier managed memory and set the hard limit to something really small. Then we are looking for a proper process tear down. [akpm@linux-foundation.org: coding style fixes] [akpm@linux-foundation.org: minor code simplification] Link: http://lkml.kernel.org/r/20180716115058.5559-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Christian König <christian.koenig@amd.com> # AMD notifiers Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx and umem_odp Reported-by: David Rientjes <rientjes@google.com> Cc: ""David (ChunMing) Zhou"" <David1.Zhou@amd.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: David Airlie <airlied@linux.ie> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Doug Ledford <dledford@redhat.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Mike Marciniszyn <mike.marciniszyn@intel.com> Cc: Dennis Dalessandro <dennis.dalessandro@intel.com> Cc: Sudeep Dutt <sudeep.dutt@intel.com> Cc: Ashutosh Dixit <ashutosh.dixit@intel.com> Cc: Dimitri Sivanich <sivanich@sgi.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Juergen Gross <jgross@suse.com> Cc: ""Jérôme Glisse"" <jglisse@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/93065ac753e4443840a057bfef4be71ec766fde9;"Currently we simply back off and mark an oom victim with blockable mmu notifiers as done after a short sleep";no;no;yes 71;MDY6Q29tbWl0MjMyNTI5ODo5MzA2NWFjNzUzZTQ0NDM4NDBhMDU3YmZlZjRiZTcxZWM3NjZmZGU5;Michal Hocko;Linus Torvalds;"mm, oom: distinguish blockable mode for mmu notifiers There are several blockable mmu notifiers which might sleep in mmu_notifier_invalidate_range_start and that is a problem for the oom_reaper because it needs to guarantee a forward progress so it cannot depend on any sleepable locks. Currently we simply back off and mark an oom victim with blockable mmu notifiers as done after a short sleep. That can result in selecting a new oom victim prematurely because the previous one still hasn't torn its memory down yet. We can do much better though. Even if mmu notifiers use sleepable locks there is no reason to automatically assume those locks are held. Moreover majority of notifiers only care about a portion of the address space and there is absolutely zero reason to fail when we are unmapping an unrelated range. Many notifiers do really block and wait for HW which is harder to handle and we have to bail out though. This patch handles the low hanging fruit. __mmu_notifier_invalidate_range_start gets a blockable flag and callbacks are not allowed to sleep if the flag is set to false. This is achieved by using trylock instead of the sleepable lock for most callbacks and continue as long as we do not block down the call chain. I think we can improve that even further because there is a common pattern to do a range lookup first and then do something about that. The first part can be done without a sleeping lock in most cases AFAICS. The oom_reaper end then simply retries if there is at least one notifier which couldn't make any progress in !blockable mode. A retry loop is already implemented to wait for the mmap_sem and this is basically the same thing. The simplest way for driver developers to test this code path is to wrap userspace code which uses these notifiers into a memcg and set the hard limit to hit the oom. This can be done e.g. after the test faults in all the mmu notifier managed memory and set the hard limit to something really small. Then we are looking for a proper process tear down. [akpm@linux-foundation.org: coding style fixes] [akpm@linux-foundation.org: minor code simplification] Link: http://lkml.kernel.org/r/20180716115058.5559-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Christian König <christian.koenig@amd.com> # AMD notifiers Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx and umem_odp Reported-by: David Rientjes <rientjes@google.com> Cc: ""David (ChunMing) Zhou"" <David1.Zhou@amd.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: David Airlie <airlied@linux.ie> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Doug Ledford <dledford@redhat.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Mike Marciniszyn <mike.marciniszyn@intel.com> Cc: Dennis Dalessandro <dennis.dalessandro@intel.com> Cc: Sudeep Dutt <sudeep.dutt@intel.com> Cc: Ashutosh Dixit <ashutosh.dixit@intel.com> Cc: Dimitri Sivanich <sivanich@sgi.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Juergen Gross <jgross@suse.com> Cc: ""Jérôme Glisse"" <jglisse@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/93065ac753e4443840a057bfef4be71ec766fde9;" That can result in selecting a new oom victim prematurely because the previous one still hasn't torn its memory down yet";no;yes;yes 71;MDY6Q29tbWl0MjMyNTI5ODo5MzA2NWFjNzUzZTQ0NDM4NDBhMDU3YmZlZjRiZTcxZWM3NjZmZGU5;Michal Hocko;Linus Torvalds;"mm, oom: distinguish blockable mode for mmu notifiers There are several blockable mmu notifiers which might sleep in mmu_notifier_invalidate_range_start and that is a problem for the oom_reaper because it needs to guarantee a forward progress so it cannot depend on any sleepable locks. Currently we simply back off and mark an oom victim with blockable mmu notifiers as done after a short sleep. That can result in selecting a new oom victim prematurely because the previous one still hasn't torn its memory down yet. We can do much better though. Even if mmu notifiers use sleepable locks there is no reason to automatically assume those locks are held. Moreover majority of notifiers only care about a portion of the address space and there is absolutely zero reason to fail when we are unmapping an unrelated range. Many notifiers do really block and wait for HW which is harder to handle and we have to bail out though. This patch handles the low hanging fruit. __mmu_notifier_invalidate_range_start gets a blockable flag and callbacks are not allowed to sleep if the flag is set to false. This is achieved by using trylock instead of the sleepable lock for most callbacks and continue as long as we do not block down the call chain. I think we can improve that even further because there is a common pattern to do a range lookup first and then do something about that. The first part can be done without a sleeping lock in most cases AFAICS. The oom_reaper end then simply retries if there is at least one notifier which couldn't make any progress in !blockable mode. A retry loop is already implemented to wait for the mmap_sem and this is basically the same thing. The simplest way for driver developers to test this code path is to wrap userspace code which uses these notifiers into a memcg and set the hard limit to hit the oom. This can be done e.g. after the test faults in all the mmu notifier managed memory and set the hard limit to something really small. Then we are looking for a proper process tear down. [akpm@linux-foundation.org: coding style fixes] [akpm@linux-foundation.org: minor code simplification] Link: http://lkml.kernel.org/r/20180716115058.5559-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Christian König <christian.koenig@amd.com> # AMD notifiers Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx and umem_odp Reported-by: David Rientjes <rientjes@google.com> Cc: ""David (ChunMing) Zhou"" <David1.Zhou@amd.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: David Airlie <airlied@linux.ie> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Doug Ledford <dledford@redhat.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Mike Marciniszyn <mike.marciniszyn@intel.com> Cc: Dennis Dalessandro <dennis.dalessandro@intel.com> Cc: Sudeep Dutt <sudeep.dutt@intel.com> Cc: Ashutosh Dixit <ashutosh.dixit@intel.com> Cc: Dimitri Sivanich <sivanich@sgi.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Juergen Gross <jgross@suse.com> Cc: ""Jérôme Glisse"" <jglisse@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/93065ac753e4443840a057bfef4be71ec766fde9;We can do much better though;no;yes;no 71;MDY6Q29tbWl0MjMyNTI5ODo5MzA2NWFjNzUzZTQ0NDM4NDBhMDU3YmZlZjRiZTcxZWM3NjZmZGU5;Michal Hocko;Linus Torvalds;"mm, oom: distinguish blockable mode for mmu notifiers There are several blockable mmu notifiers which might sleep in mmu_notifier_invalidate_range_start and that is a problem for the oom_reaper because it needs to guarantee a forward progress so it cannot depend on any sleepable locks. Currently we simply back off and mark an oom victim with blockable mmu notifiers as done after a short sleep. That can result in selecting a new oom victim prematurely because the previous one still hasn't torn its memory down yet. We can do much better though. Even if mmu notifiers use sleepable locks there is no reason to automatically assume those locks are held. Moreover majority of notifiers only care about a portion of the address space and there is absolutely zero reason to fail when we are unmapping an unrelated range. Many notifiers do really block and wait for HW which is harder to handle and we have to bail out though. This patch handles the low hanging fruit. __mmu_notifier_invalidate_range_start gets a blockable flag and callbacks are not allowed to sleep if the flag is set to false. This is achieved by using trylock instead of the sleepable lock for most callbacks and continue as long as we do not block down the call chain. I think we can improve that even further because there is a common pattern to do a range lookup first and then do something about that. The first part can be done without a sleeping lock in most cases AFAICS. The oom_reaper end then simply retries if there is at least one notifier which couldn't make any progress in !blockable mode. A retry loop is already implemented to wait for the mmap_sem and this is basically the same thing. The simplest way for driver developers to test this code path is to wrap userspace code which uses these notifiers into a memcg and set the hard limit to hit the oom. This can be done e.g. after the test faults in all the mmu notifier managed memory and set the hard limit to something really small. Then we are looking for a proper process tear down. [akpm@linux-foundation.org: coding style fixes] [akpm@linux-foundation.org: minor code simplification] Link: http://lkml.kernel.org/r/20180716115058.5559-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Christian König <christian.koenig@amd.com> # AMD notifiers Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx and umem_odp Reported-by: David Rientjes <rientjes@google.com> Cc: ""David (ChunMing) Zhou"" <David1.Zhou@amd.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: David Airlie <airlied@linux.ie> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Doug Ledford <dledford@redhat.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Mike Marciniszyn <mike.marciniszyn@intel.com> Cc: Dennis Dalessandro <dennis.dalessandro@intel.com> Cc: Sudeep Dutt <sudeep.dutt@intel.com> Cc: Ashutosh Dixit <ashutosh.dixit@intel.com> Cc: Dimitri Sivanich <sivanich@sgi.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Juergen Gross <jgross@suse.com> Cc: ""Jérôme Glisse"" <jglisse@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/93065ac753e4443840a057bfef4be71ec766fde9;" Even if mmu notifiers use sleepable locks there is no reason to automatically assume those locks are held";no;yes;yes 71;MDY6Q29tbWl0MjMyNTI5ODo5MzA2NWFjNzUzZTQ0NDM4NDBhMDU3YmZlZjRiZTcxZWM3NjZmZGU5;Michal Hocko;Linus Torvalds;"mm, oom: distinguish blockable mode for mmu notifiers There are several blockable mmu notifiers which might sleep in mmu_notifier_invalidate_range_start and that is a problem for the oom_reaper because it needs to guarantee a forward progress so it cannot depend on any sleepable locks. Currently we simply back off and mark an oom victim with blockable mmu notifiers as done after a short sleep. That can result in selecting a new oom victim prematurely because the previous one still hasn't torn its memory down yet. We can do much better though. Even if mmu notifiers use sleepable locks there is no reason to automatically assume those locks are held. Moreover majority of notifiers only care about a portion of the address space and there is absolutely zero reason to fail when we are unmapping an unrelated range. Many notifiers do really block and wait for HW which is harder to handle and we have to bail out though. This patch handles the low hanging fruit. __mmu_notifier_invalidate_range_start gets a blockable flag and callbacks are not allowed to sleep if the flag is set to false. This is achieved by using trylock instead of the sleepable lock for most callbacks and continue as long as we do not block down the call chain. I think we can improve that even further because there is a common pattern to do a range lookup first and then do something about that. The first part can be done without a sleeping lock in most cases AFAICS. The oom_reaper end then simply retries if there is at least one notifier which couldn't make any progress in !blockable mode. A retry loop is already implemented to wait for the mmap_sem and this is basically the same thing. The simplest way for driver developers to test this code path is to wrap userspace code which uses these notifiers into a memcg and set the hard limit to hit the oom. This can be done e.g. after the test faults in all the mmu notifier managed memory and set the hard limit to something really small. Then we are looking for a proper process tear down. [akpm@linux-foundation.org: coding style fixes] [akpm@linux-foundation.org: minor code simplification] Link: http://lkml.kernel.org/r/20180716115058.5559-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Christian König <christian.koenig@amd.com> # AMD notifiers Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx and umem_odp Reported-by: David Rientjes <rientjes@google.com> Cc: ""David (ChunMing) Zhou"" <David1.Zhou@amd.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: David Airlie <airlied@linux.ie> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Doug Ledford <dledford@redhat.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Mike Marciniszyn <mike.marciniszyn@intel.com> Cc: Dennis Dalessandro <dennis.dalessandro@intel.com> Cc: Sudeep Dutt <sudeep.dutt@intel.com> Cc: Ashutosh Dixit <ashutosh.dixit@intel.com> Cc: Dimitri Sivanich <sivanich@sgi.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Juergen Gross <jgross@suse.com> Cc: ""Jérôme Glisse"" <jglisse@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/93065ac753e4443840a057bfef4be71ec766fde9;" Moreover majority of notifiers only care about a portion of the address space and there is absolutely zero reason to fail when we are unmapping an unrelated range";no;yes;yes 71;MDY6Q29tbWl0MjMyNTI5ODo5MzA2NWFjNzUzZTQ0NDM4NDBhMDU3YmZlZjRiZTcxZWM3NjZmZGU5;Michal Hocko;Linus Torvalds;"mm, oom: distinguish blockable mode for mmu notifiers There are several blockable mmu notifiers which might sleep in mmu_notifier_invalidate_range_start and that is a problem for the oom_reaper because it needs to guarantee a forward progress so it cannot depend on any sleepable locks. Currently we simply back off and mark an oom victim with blockable mmu notifiers as done after a short sleep. That can result in selecting a new oom victim prematurely because the previous one still hasn't torn its memory down yet. We can do much better though. Even if mmu notifiers use sleepable locks there is no reason to automatically assume those locks are held. Moreover majority of notifiers only care about a portion of the address space and there is absolutely zero reason to fail when we are unmapping an unrelated range. Many notifiers do really block and wait for HW which is harder to handle and we have to bail out though. This patch handles the low hanging fruit. __mmu_notifier_invalidate_range_start gets a blockable flag and callbacks are not allowed to sleep if the flag is set to false. This is achieved by using trylock instead of the sleepable lock for most callbacks and continue as long as we do not block down the call chain. I think we can improve that even further because there is a common pattern to do a range lookup first and then do something about that. The first part can be done without a sleeping lock in most cases AFAICS. The oom_reaper end then simply retries if there is at least one notifier which couldn't make any progress in !blockable mode. A retry loop is already implemented to wait for the mmap_sem and this is basically the same thing. The simplest way for driver developers to test this code path is to wrap userspace code which uses these notifiers into a memcg and set the hard limit to hit the oom. This can be done e.g. after the test faults in all the mmu notifier managed memory and set the hard limit to something really small. Then we are looking for a proper process tear down. [akpm@linux-foundation.org: coding style fixes] [akpm@linux-foundation.org: minor code simplification] Link: http://lkml.kernel.org/r/20180716115058.5559-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Christian König <christian.koenig@amd.com> # AMD notifiers Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx and umem_odp Reported-by: David Rientjes <rientjes@google.com> Cc: ""David (ChunMing) Zhou"" <David1.Zhou@amd.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: David Airlie <airlied@linux.ie> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Doug Ledford <dledford@redhat.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Mike Marciniszyn <mike.marciniszyn@intel.com> Cc: Dennis Dalessandro <dennis.dalessandro@intel.com> Cc: Sudeep Dutt <sudeep.dutt@intel.com> Cc: Ashutosh Dixit <ashutosh.dixit@intel.com> Cc: Dimitri Sivanich <sivanich@sgi.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Juergen Gross <jgross@suse.com> Cc: ""Jérôme Glisse"" <jglisse@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/93065ac753e4443840a057bfef4be71ec766fde9;" Many notifiers do really block and wait for HW which is harder to handle and we have to bail out though";no;yes;yes 71;MDY6Q29tbWl0MjMyNTI5ODo5MzA2NWFjNzUzZTQ0NDM4NDBhMDU3YmZlZjRiZTcxZWM3NjZmZGU5;Michal Hocko;Linus Torvalds;"mm, oom: distinguish blockable mode for mmu notifiers There are several blockable mmu notifiers which might sleep in mmu_notifier_invalidate_range_start and that is a problem for the oom_reaper because it needs to guarantee a forward progress so it cannot depend on any sleepable locks. Currently we simply back off and mark an oom victim with blockable mmu notifiers as done after a short sleep. That can result in selecting a new oom victim prematurely because the previous one still hasn't torn its memory down yet. We can do much better though. Even if mmu notifiers use sleepable locks there is no reason to automatically assume those locks are held. Moreover majority of notifiers only care about a portion of the address space and there is absolutely zero reason to fail when we are unmapping an unrelated range. Many notifiers do really block and wait for HW which is harder to handle and we have to bail out though. This patch handles the low hanging fruit. __mmu_notifier_invalidate_range_start gets a blockable flag and callbacks are not allowed to sleep if the flag is set to false. This is achieved by using trylock instead of the sleepable lock for most callbacks and continue as long as we do not block down the call chain. I think we can improve that even further because there is a common pattern to do a range lookup first and then do something about that. The first part can be done without a sleeping lock in most cases AFAICS. The oom_reaper end then simply retries if there is at least one notifier which couldn't make any progress in !blockable mode. A retry loop is already implemented to wait for the mmap_sem and this is basically the same thing. The simplest way for driver developers to test this code path is to wrap userspace code which uses these notifiers into a memcg and set the hard limit to hit the oom. This can be done e.g. after the test faults in all the mmu notifier managed memory and set the hard limit to something really small. Then we are looking for a proper process tear down. [akpm@linux-foundation.org: coding style fixes] [akpm@linux-foundation.org: minor code simplification] Link: http://lkml.kernel.org/r/20180716115058.5559-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Christian König <christian.koenig@amd.com> # AMD notifiers Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx and umem_odp Reported-by: David Rientjes <rientjes@google.com> Cc: ""David (ChunMing) Zhou"" <David1.Zhou@amd.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: David Airlie <airlied@linux.ie> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Doug Ledford <dledford@redhat.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Mike Marciniszyn <mike.marciniszyn@intel.com> Cc: Dennis Dalessandro <dennis.dalessandro@intel.com> Cc: Sudeep Dutt <sudeep.dutt@intel.com> Cc: Ashutosh Dixit <ashutosh.dixit@intel.com> Cc: Dimitri Sivanich <sivanich@sgi.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Juergen Gross <jgross@suse.com> Cc: ""Jérôme Glisse"" <jglisse@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/93065ac753e4443840a057bfef4be71ec766fde9;This patch handles the low hanging fruit;yes;yes;no 71;MDY6Q29tbWl0MjMyNTI5ODo5MzA2NWFjNzUzZTQ0NDM4NDBhMDU3YmZlZjRiZTcxZWM3NjZmZGU5;Michal Hocko;Linus Torvalds;"mm, oom: distinguish blockable mode for mmu notifiers There are several blockable mmu notifiers which might sleep in mmu_notifier_invalidate_range_start and that is a problem for the oom_reaper because it needs to guarantee a forward progress so it cannot depend on any sleepable locks. Currently we simply back off and mark an oom victim with blockable mmu notifiers as done after a short sleep. That can result in selecting a new oom victim prematurely because the previous one still hasn't torn its memory down yet. We can do much better though. Even if mmu notifiers use sleepable locks there is no reason to automatically assume those locks are held. Moreover majority of notifiers only care about a portion of the address space and there is absolutely zero reason to fail when we are unmapping an unrelated range. Many notifiers do really block and wait for HW which is harder to handle and we have to bail out though. This patch handles the low hanging fruit. __mmu_notifier_invalidate_range_start gets a blockable flag and callbacks are not allowed to sleep if the flag is set to false. This is achieved by using trylock instead of the sleepable lock for most callbacks and continue as long as we do not block down the call chain. I think we can improve that even further because there is a common pattern to do a range lookup first and then do something about that. The first part can be done without a sleeping lock in most cases AFAICS. The oom_reaper end then simply retries if there is at least one notifier which couldn't make any progress in !blockable mode. A retry loop is already implemented to wait for the mmap_sem and this is basically the same thing. The simplest way for driver developers to test this code path is to wrap userspace code which uses these notifiers into a memcg and set the hard limit to hit the oom. This can be done e.g. after the test faults in all the mmu notifier managed memory and set the hard limit to something really small. Then we are looking for a proper process tear down. [akpm@linux-foundation.org: coding style fixes] [akpm@linux-foundation.org: minor code simplification] Link: http://lkml.kernel.org/r/20180716115058.5559-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Christian König <christian.koenig@amd.com> # AMD notifiers Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx and umem_odp Reported-by: David Rientjes <rientjes@google.com> Cc: ""David (ChunMing) Zhou"" <David1.Zhou@amd.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: David Airlie <airlied@linux.ie> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Doug Ledford <dledford@redhat.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Mike Marciniszyn <mike.marciniszyn@intel.com> Cc: Dennis Dalessandro <dennis.dalessandro@intel.com> Cc: Sudeep Dutt <sudeep.dutt@intel.com> Cc: Ashutosh Dixit <ashutosh.dixit@intel.com> Cc: Dimitri Sivanich <sivanich@sgi.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Juergen Gross <jgross@suse.com> Cc: ""Jérôme Glisse"" <jglisse@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/93065ac753e4443840a057bfef4be71ec766fde9;"__mmu_notifier_invalidate_range_start gets a blockable flag and callbacks are not allowed to sleep if the flag is set to false";yes;no;no 71;MDY6Q29tbWl0MjMyNTI5ODo5MzA2NWFjNzUzZTQ0NDM4NDBhMDU3YmZlZjRiZTcxZWM3NjZmZGU5;Michal Hocko;Linus Torvalds;"mm, oom: distinguish blockable mode for mmu notifiers There are several blockable mmu notifiers which might sleep in mmu_notifier_invalidate_range_start and that is a problem for the oom_reaper because it needs to guarantee a forward progress so it cannot depend on any sleepable locks. Currently we simply back off and mark an oom victim with blockable mmu notifiers as done after a short sleep. That can result in selecting a new oom victim prematurely because the previous one still hasn't torn its memory down yet. We can do much better though. Even if mmu notifiers use sleepable locks there is no reason to automatically assume those locks are held. Moreover majority of notifiers only care about a portion of the address space and there is absolutely zero reason to fail when we are unmapping an unrelated range. Many notifiers do really block and wait for HW which is harder to handle and we have to bail out though. This patch handles the low hanging fruit. __mmu_notifier_invalidate_range_start gets a blockable flag and callbacks are not allowed to sleep if the flag is set to false. This is achieved by using trylock instead of the sleepable lock for most callbacks and continue as long as we do not block down the call chain. I think we can improve that even further because there is a common pattern to do a range lookup first and then do something about that. The first part can be done without a sleeping lock in most cases AFAICS. The oom_reaper end then simply retries if there is at least one notifier which couldn't make any progress in !blockable mode. A retry loop is already implemented to wait for the mmap_sem and this is basically the same thing. The simplest way for driver developers to test this code path is to wrap userspace code which uses these notifiers into a memcg and set the hard limit to hit the oom. This can be done e.g. after the test faults in all the mmu notifier managed memory and set the hard limit to something really small. Then we are looking for a proper process tear down. [akpm@linux-foundation.org: coding style fixes] [akpm@linux-foundation.org: minor code simplification] Link: http://lkml.kernel.org/r/20180716115058.5559-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Christian König <christian.koenig@amd.com> # AMD notifiers Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx and umem_odp Reported-by: David Rientjes <rientjes@google.com> Cc: ""David (ChunMing) Zhou"" <David1.Zhou@amd.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: David Airlie <airlied@linux.ie> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Doug Ledford <dledford@redhat.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Mike Marciniszyn <mike.marciniszyn@intel.com> Cc: Dennis Dalessandro <dennis.dalessandro@intel.com> Cc: Sudeep Dutt <sudeep.dutt@intel.com> Cc: Ashutosh Dixit <ashutosh.dixit@intel.com> Cc: Dimitri Sivanich <sivanich@sgi.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Juergen Gross <jgross@suse.com> Cc: ""Jérôme Glisse"" <jglisse@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/93065ac753e4443840a057bfef4be71ec766fde9;" This is achieved by using trylock instead of the sleepable lock for most callbacks and continue as long as we do not block down the call chain";yes;no;no 71;MDY6Q29tbWl0MjMyNTI5ODo5MzA2NWFjNzUzZTQ0NDM4NDBhMDU3YmZlZjRiZTcxZWM3NjZmZGU5;Michal Hocko;Linus Torvalds;"mm, oom: distinguish blockable mode for mmu notifiers There are several blockable mmu notifiers which might sleep in mmu_notifier_invalidate_range_start and that is a problem for the oom_reaper because it needs to guarantee a forward progress so it cannot depend on any sleepable locks. Currently we simply back off and mark an oom victim with blockable mmu notifiers as done after a short sleep. That can result in selecting a new oom victim prematurely because the previous one still hasn't torn its memory down yet. We can do much better though. Even if mmu notifiers use sleepable locks there is no reason to automatically assume those locks are held. Moreover majority of notifiers only care about a portion of the address space and there is absolutely zero reason to fail when we are unmapping an unrelated range. Many notifiers do really block and wait for HW which is harder to handle and we have to bail out though. This patch handles the low hanging fruit. __mmu_notifier_invalidate_range_start gets a blockable flag and callbacks are not allowed to sleep if the flag is set to false. This is achieved by using trylock instead of the sleepable lock for most callbacks and continue as long as we do not block down the call chain. I think we can improve that even further because there is a common pattern to do a range lookup first and then do something about that. The first part can be done without a sleeping lock in most cases AFAICS. The oom_reaper end then simply retries if there is at least one notifier which couldn't make any progress in !blockable mode. A retry loop is already implemented to wait for the mmap_sem and this is basically the same thing. The simplest way for driver developers to test this code path is to wrap userspace code which uses these notifiers into a memcg and set the hard limit to hit the oom. This can be done e.g. after the test faults in all the mmu notifier managed memory and set the hard limit to something really small. Then we are looking for a proper process tear down. [akpm@linux-foundation.org: coding style fixes] [akpm@linux-foundation.org: minor code simplification] Link: http://lkml.kernel.org/r/20180716115058.5559-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Christian König <christian.koenig@amd.com> # AMD notifiers Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx and umem_odp Reported-by: David Rientjes <rientjes@google.com> Cc: ""David (ChunMing) Zhou"" <David1.Zhou@amd.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: David Airlie <airlied@linux.ie> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Doug Ledford <dledford@redhat.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Mike Marciniszyn <mike.marciniszyn@intel.com> Cc: Dennis Dalessandro <dennis.dalessandro@intel.com> Cc: Sudeep Dutt <sudeep.dutt@intel.com> Cc: Ashutosh Dixit <ashutosh.dixit@intel.com> Cc: Dimitri Sivanich <sivanich@sgi.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Juergen Gross <jgross@suse.com> Cc: ""Jérôme Glisse"" <jglisse@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/93065ac753e4443840a057bfef4be71ec766fde9;"I think we can improve that even further because there is a common pattern to do a range lookup first and then do something about that";yes;yes;no 71;MDY6Q29tbWl0MjMyNTI5ODo5MzA2NWFjNzUzZTQ0NDM4NDBhMDU3YmZlZjRiZTcxZWM3NjZmZGU5;Michal Hocko;Linus Torvalds;"mm, oom: distinguish blockable mode for mmu notifiers There are several blockable mmu notifiers which might sleep in mmu_notifier_invalidate_range_start and that is a problem for the oom_reaper because it needs to guarantee a forward progress so it cannot depend on any sleepable locks. Currently we simply back off and mark an oom victim with blockable mmu notifiers as done after a short sleep. That can result in selecting a new oom victim prematurely because the previous one still hasn't torn its memory down yet. We can do much better though. Even if mmu notifiers use sleepable locks there is no reason to automatically assume those locks are held. Moreover majority of notifiers only care about a portion of the address space and there is absolutely zero reason to fail when we are unmapping an unrelated range. Many notifiers do really block and wait for HW which is harder to handle and we have to bail out though. This patch handles the low hanging fruit. __mmu_notifier_invalidate_range_start gets a blockable flag and callbacks are not allowed to sleep if the flag is set to false. This is achieved by using trylock instead of the sleepable lock for most callbacks and continue as long as we do not block down the call chain. I think we can improve that even further because there is a common pattern to do a range lookup first and then do something about that. The first part can be done without a sleeping lock in most cases AFAICS. The oom_reaper end then simply retries if there is at least one notifier which couldn't make any progress in !blockable mode. A retry loop is already implemented to wait for the mmap_sem and this is basically the same thing. The simplest way for driver developers to test this code path is to wrap userspace code which uses these notifiers into a memcg and set the hard limit to hit the oom. This can be done e.g. after the test faults in all the mmu notifier managed memory and set the hard limit to something really small. Then we are looking for a proper process tear down. [akpm@linux-foundation.org: coding style fixes] [akpm@linux-foundation.org: minor code simplification] Link: http://lkml.kernel.org/r/20180716115058.5559-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Christian König <christian.koenig@amd.com> # AMD notifiers Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx and umem_odp Reported-by: David Rientjes <rientjes@google.com> Cc: ""David (ChunMing) Zhou"" <David1.Zhou@amd.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: David Airlie <airlied@linux.ie> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Doug Ledford <dledford@redhat.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Mike Marciniszyn <mike.marciniszyn@intel.com> Cc: Dennis Dalessandro <dennis.dalessandro@intel.com> Cc: Sudeep Dutt <sudeep.dutt@intel.com> Cc: Ashutosh Dixit <ashutosh.dixit@intel.com> Cc: Dimitri Sivanich <sivanich@sgi.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Juergen Gross <jgross@suse.com> Cc: ""Jérôme Glisse"" <jglisse@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/93065ac753e4443840a057bfef4be71ec766fde9;" The first part can be done without a sleeping lock in most cases AFAICS";yes;yes;no 71;MDY6Q29tbWl0MjMyNTI5ODo5MzA2NWFjNzUzZTQ0NDM4NDBhMDU3YmZlZjRiZTcxZWM3NjZmZGU5;Michal Hocko;Linus Torvalds;"mm, oom: distinguish blockable mode for mmu notifiers There are several blockable mmu notifiers which might sleep in mmu_notifier_invalidate_range_start and that is a problem for the oom_reaper because it needs to guarantee a forward progress so it cannot depend on any sleepable locks. Currently we simply back off and mark an oom victim with blockable mmu notifiers as done after a short sleep. That can result in selecting a new oom victim prematurely because the previous one still hasn't torn its memory down yet. We can do much better though. Even if mmu notifiers use sleepable locks there is no reason to automatically assume those locks are held. Moreover majority of notifiers only care about a portion of the address space and there is absolutely zero reason to fail when we are unmapping an unrelated range. Many notifiers do really block and wait for HW which is harder to handle and we have to bail out though. This patch handles the low hanging fruit. __mmu_notifier_invalidate_range_start gets a blockable flag and callbacks are not allowed to sleep if the flag is set to false. This is achieved by using trylock instead of the sleepable lock for most callbacks and continue as long as we do not block down the call chain. I think we can improve that even further because there is a common pattern to do a range lookup first and then do something about that. The first part can be done without a sleeping lock in most cases AFAICS. The oom_reaper end then simply retries if there is at least one notifier which couldn't make any progress in !blockable mode. A retry loop is already implemented to wait for the mmap_sem and this is basically the same thing. The simplest way for driver developers to test this code path is to wrap userspace code which uses these notifiers into a memcg and set the hard limit to hit the oom. This can be done e.g. after the test faults in all the mmu notifier managed memory and set the hard limit to something really small. Then we are looking for a proper process tear down. [akpm@linux-foundation.org: coding style fixes] [akpm@linux-foundation.org: minor code simplification] Link: http://lkml.kernel.org/r/20180716115058.5559-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Christian König <christian.koenig@amd.com> # AMD notifiers Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx and umem_odp Reported-by: David Rientjes <rientjes@google.com> Cc: ""David (ChunMing) Zhou"" <David1.Zhou@amd.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: David Airlie <airlied@linux.ie> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Doug Ledford <dledford@redhat.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Mike Marciniszyn <mike.marciniszyn@intel.com> Cc: Dennis Dalessandro <dennis.dalessandro@intel.com> Cc: Sudeep Dutt <sudeep.dutt@intel.com> Cc: Ashutosh Dixit <ashutosh.dixit@intel.com> Cc: Dimitri Sivanich <sivanich@sgi.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Juergen Gross <jgross@suse.com> Cc: ""Jérôme Glisse"" <jglisse@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/93065ac753e4443840a057bfef4be71ec766fde9;"The oom_reaper end then simply retries if there is at least one notifier which couldn't make any progress in !blockable mode";yes;no;yes 71;MDY6Q29tbWl0MjMyNTI5ODo5MzA2NWFjNzUzZTQ0NDM4NDBhMDU3YmZlZjRiZTcxZWM3NjZmZGU5;Michal Hocko;Linus Torvalds;"mm, oom: distinguish blockable mode for mmu notifiers There are several blockable mmu notifiers which might sleep in mmu_notifier_invalidate_range_start and that is a problem for the oom_reaper because it needs to guarantee a forward progress so it cannot depend on any sleepable locks. Currently we simply back off and mark an oom victim with blockable mmu notifiers as done after a short sleep. That can result in selecting a new oom victim prematurely because the previous one still hasn't torn its memory down yet. We can do much better though. Even if mmu notifiers use sleepable locks there is no reason to automatically assume those locks are held. Moreover majority of notifiers only care about a portion of the address space and there is absolutely zero reason to fail when we are unmapping an unrelated range. Many notifiers do really block and wait for HW which is harder to handle and we have to bail out though. This patch handles the low hanging fruit. __mmu_notifier_invalidate_range_start gets a blockable flag and callbacks are not allowed to sleep if the flag is set to false. This is achieved by using trylock instead of the sleepable lock for most callbacks and continue as long as we do not block down the call chain. I think we can improve that even further because there is a common pattern to do a range lookup first and then do something about that. The first part can be done without a sleeping lock in most cases AFAICS. The oom_reaper end then simply retries if there is at least one notifier which couldn't make any progress in !blockable mode. A retry loop is already implemented to wait for the mmap_sem and this is basically the same thing. The simplest way for driver developers to test this code path is to wrap userspace code which uses these notifiers into a memcg and set the hard limit to hit the oom. This can be done e.g. after the test faults in all the mmu notifier managed memory and set the hard limit to something really small. Then we are looking for a proper process tear down. [akpm@linux-foundation.org: coding style fixes] [akpm@linux-foundation.org: minor code simplification] Link: http://lkml.kernel.org/r/20180716115058.5559-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Christian König <christian.koenig@amd.com> # AMD notifiers Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx and umem_odp Reported-by: David Rientjes <rientjes@google.com> Cc: ""David (ChunMing) Zhou"" <David1.Zhou@amd.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: David Airlie <airlied@linux.ie> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Doug Ledford <dledford@redhat.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Mike Marciniszyn <mike.marciniszyn@intel.com> Cc: Dennis Dalessandro <dennis.dalessandro@intel.com> Cc: Sudeep Dutt <sudeep.dutt@intel.com> Cc: Ashutosh Dixit <ashutosh.dixit@intel.com> Cc: Dimitri Sivanich <sivanich@sgi.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Juergen Gross <jgross@suse.com> Cc: ""Jérôme Glisse"" <jglisse@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/93065ac753e4443840a057bfef4be71ec766fde9;" A retry loop is already implemented to wait for the mmap_sem and this is basically the same thing";yes;no;yes 71;MDY6Q29tbWl0MjMyNTI5ODo5MzA2NWFjNzUzZTQ0NDM4NDBhMDU3YmZlZjRiZTcxZWM3NjZmZGU5;Michal Hocko;Linus Torvalds;"mm, oom: distinguish blockable mode for mmu notifiers There are several blockable mmu notifiers which might sleep in mmu_notifier_invalidate_range_start and that is a problem for the oom_reaper because it needs to guarantee a forward progress so it cannot depend on any sleepable locks. Currently we simply back off and mark an oom victim with blockable mmu notifiers as done after a short sleep. That can result in selecting a new oom victim prematurely because the previous one still hasn't torn its memory down yet. We can do much better though. Even if mmu notifiers use sleepable locks there is no reason to automatically assume those locks are held. Moreover majority of notifiers only care about a portion of the address space and there is absolutely zero reason to fail when we are unmapping an unrelated range. Many notifiers do really block and wait for HW which is harder to handle and we have to bail out though. This patch handles the low hanging fruit. __mmu_notifier_invalidate_range_start gets a blockable flag and callbacks are not allowed to sleep if the flag is set to false. This is achieved by using trylock instead of the sleepable lock for most callbacks and continue as long as we do not block down the call chain. I think we can improve that even further because there is a common pattern to do a range lookup first and then do something about that. The first part can be done without a sleeping lock in most cases AFAICS. The oom_reaper end then simply retries if there is at least one notifier which couldn't make any progress in !blockable mode. A retry loop is already implemented to wait for the mmap_sem and this is basically the same thing. The simplest way for driver developers to test this code path is to wrap userspace code which uses these notifiers into a memcg and set the hard limit to hit the oom. This can be done e.g. after the test faults in all the mmu notifier managed memory and set the hard limit to something really small. Then we are looking for a proper process tear down. [akpm@linux-foundation.org: coding style fixes] [akpm@linux-foundation.org: minor code simplification] Link: http://lkml.kernel.org/r/20180716115058.5559-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Christian König <christian.koenig@amd.com> # AMD notifiers Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx and umem_odp Reported-by: David Rientjes <rientjes@google.com> Cc: ""David (ChunMing) Zhou"" <David1.Zhou@amd.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: David Airlie <airlied@linux.ie> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Doug Ledford <dledford@redhat.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Mike Marciniszyn <mike.marciniszyn@intel.com> Cc: Dennis Dalessandro <dennis.dalessandro@intel.com> Cc: Sudeep Dutt <sudeep.dutt@intel.com> Cc: Ashutosh Dixit <ashutosh.dixit@intel.com> Cc: Dimitri Sivanich <sivanich@sgi.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Juergen Gross <jgross@suse.com> Cc: ""Jérôme Glisse"" <jglisse@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/93065ac753e4443840a057bfef4be71ec766fde9;"The simplest way for driver developers to test this code path is to wrap userspace code which uses these notifiers into a memcg and set the hard limit to hit the oom";no;no;yes 71;MDY6Q29tbWl0MjMyNTI5ODo5MzA2NWFjNzUzZTQ0NDM4NDBhMDU3YmZlZjRiZTcxZWM3NjZmZGU5;Michal Hocko;Linus Torvalds;"mm, oom: distinguish blockable mode for mmu notifiers There are several blockable mmu notifiers which might sleep in mmu_notifier_invalidate_range_start and that is a problem for the oom_reaper because it needs to guarantee a forward progress so it cannot depend on any sleepable locks. Currently we simply back off and mark an oom victim with blockable mmu notifiers as done after a short sleep. That can result in selecting a new oom victim prematurely because the previous one still hasn't torn its memory down yet. We can do much better though. Even if mmu notifiers use sleepable locks there is no reason to automatically assume those locks are held. Moreover majority of notifiers only care about a portion of the address space and there is absolutely zero reason to fail when we are unmapping an unrelated range. Many notifiers do really block and wait for HW which is harder to handle and we have to bail out though. This patch handles the low hanging fruit. __mmu_notifier_invalidate_range_start gets a blockable flag and callbacks are not allowed to sleep if the flag is set to false. This is achieved by using trylock instead of the sleepable lock for most callbacks and continue as long as we do not block down the call chain. I think we can improve that even further because there is a common pattern to do a range lookup first and then do something about that. The first part can be done without a sleeping lock in most cases AFAICS. The oom_reaper end then simply retries if there is at least one notifier which couldn't make any progress in !blockable mode. A retry loop is already implemented to wait for the mmap_sem and this is basically the same thing. The simplest way for driver developers to test this code path is to wrap userspace code which uses these notifiers into a memcg and set the hard limit to hit the oom. This can be done e.g. after the test faults in all the mmu notifier managed memory and set the hard limit to something really small. Then we are looking for a proper process tear down. [akpm@linux-foundation.org: coding style fixes] [akpm@linux-foundation.org: minor code simplification] Link: http://lkml.kernel.org/r/20180716115058.5559-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Christian König <christian.koenig@amd.com> # AMD notifiers Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx and umem_odp Reported-by: David Rientjes <rientjes@google.com> Cc: ""David (ChunMing) Zhou"" <David1.Zhou@amd.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: David Airlie <airlied@linux.ie> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Doug Ledford <dledford@redhat.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Mike Marciniszyn <mike.marciniszyn@intel.com> Cc: Dennis Dalessandro <dennis.dalessandro@intel.com> Cc: Sudeep Dutt <sudeep.dutt@intel.com> Cc: Ashutosh Dixit <ashutosh.dixit@intel.com> Cc: Dimitri Sivanich <sivanich@sgi.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Juergen Gross <jgross@suse.com> Cc: ""Jérôme Glisse"" <jglisse@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/93065ac753e4443840a057bfef4be71ec766fde9; This can be done e.g;no;no;yes 71;MDY6Q29tbWl0MjMyNTI5ODo5MzA2NWFjNzUzZTQ0NDM4NDBhMDU3YmZlZjRiZTcxZWM3NjZmZGU5;Michal Hocko;Linus Torvalds;"mm, oom: distinguish blockable mode for mmu notifiers There are several blockable mmu notifiers which might sleep in mmu_notifier_invalidate_range_start and that is a problem for the oom_reaper because it needs to guarantee a forward progress so it cannot depend on any sleepable locks. Currently we simply back off and mark an oom victim with blockable mmu notifiers as done after a short sleep. That can result in selecting a new oom victim prematurely because the previous one still hasn't torn its memory down yet. We can do much better though. Even if mmu notifiers use sleepable locks there is no reason to automatically assume those locks are held. Moreover majority of notifiers only care about a portion of the address space and there is absolutely zero reason to fail when we are unmapping an unrelated range. Many notifiers do really block and wait for HW which is harder to handle and we have to bail out though. This patch handles the low hanging fruit. __mmu_notifier_invalidate_range_start gets a blockable flag and callbacks are not allowed to sleep if the flag is set to false. This is achieved by using trylock instead of the sleepable lock for most callbacks and continue as long as we do not block down the call chain. I think we can improve that even further because there is a common pattern to do a range lookup first and then do something about that. The first part can be done without a sleeping lock in most cases AFAICS. The oom_reaper end then simply retries if there is at least one notifier which couldn't make any progress in !blockable mode. A retry loop is already implemented to wait for the mmap_sem and this is basically the same thing. The simplest way for driver developers to test this code path is to wrap userspace code which uses these notifiers into a memcg and set the hard limit to hit the oom. This can be done e.g. after the test faults in all the mmu notifier managed memory and set the hard limit to something really small. Then we are looking for a proper process tear down. [akpm@linux-foundation.org: coding style fixes] [akpm@linux-foundation.org: minor code simplification] Link: http://lkml.kernel.org/r/20180716115058.5559-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Christian König <christian.koenig@amd.com> # AMD notifiers Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx and umem_odp Reported-by: David Rientjes <rientjes@google.com> Cc: ""David (ChunMing) Zhou"" <David1.Zhou@amd.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: David Airlie <airlied@linux.ie> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Doug Ledford <dledford@redhat.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Mike Marciniszyn <mike.marciniszyn@intel.com> Cc: Dennis Dalessandro <dennis.dalessandro@intel.com> Cc: Sudeep Dutt <sudeep.dutt@intel.com> Cc: Ashutosh Dixit <ashutosh.dixit@intel.com> Cc: Dimitri Sivanich <sivanich@sgi.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Juergen Gross <jgross@suse.com> Cc: ""Jérôme Glisse"" <jglisse@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/93065ac753e4443840a057bfef4be71ec766fde9;" after the test faults in all the mmu notifier managed memory and set the hard limit to something really small";no;no;yes 71;MDY6Q29tbWl0MjMyNTI5ODo5MzA2NWFjNzUzZTQ0NDM4NDBhMDU3YmZlZjRiZTcxZWM3NjZmZGU5;Michal Hocko;Linus Torvalds;"mm, oom: distinguish blockable mode for mmu notifiers There are several blockable mmu notifiers which might sleep in mmu_notifier_invalidate_range_start and that is a problem for the oom_reaper because it needs to guarantee a forward progress so it cannot depend on any sleepable locks. Currently we simply back off and mark an oom victim with blockable mmu notifiers as done after a short sleep. That can result in selecting a new oom victim prematurely because the previous one still hasn't torn its memory down yet. We can do much better though. Even if mmu notifiers use sleepable locks there is no reason to automatically assume those locks are held. Moreover majority of notifiers only care about a portion of the address space and there is absolutely zero reason to fail when we are unmapping an unrelated range. Many notifiers do really block and wait for HW which is harder to handle and we have to bail out though. This patch handles the low hanging fruit. __mmu_notifier_invalidate_range_start gets a blockable flag and callbacks are not allowed to sleep if the flag is set to false. This is achieved by using trylock instead of the sleepable lock for most callbacks and continue as long as we do not block down the call chain. I think we can improve that even further because there is a common pattern to do a range lookup first and then do something about that. The first part can be done without a sleeping lock in most cases AFAICS. The oom_reaper end then simply retries if there is at least one notifier which couldn't make any progress in !blockable mode. A retry loop is already implemented to wait for the mmap_sem and this is basically the same thing. The simplest way for driver developers to test this code path is to wrap userspace code which uses these notifiers into a memcg and set the hard limit to hit the oom. This can be done e.g. after the test faults in all the mmu notifier managed memory and set the hard limit to something really small. Then we are looking for a proper process tear down. [akpm@linux-foundation.org: coding style fixes] [akpm@linux-foundation.org: minor code simplification] Link: http://lkml.kernel.org/r/20180716115058.5559-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Christian König <christian.koenig@amd.com> # AMD notifiers Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx and umem_odp Reported-by: David Rientjes <rientjes@google.com> Cc: ""David (ChunMing) Zhou"" <David1.Zhou@amd.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: David Airlie <airlied@linux.ie> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Doug Ledford <dledford@redhat.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Mike Marciniszyn <mike.marciniszyn@intel.com> Cc: Dennis Dalessandro <dennis.dalessandro@intel.com> Cc: Sudeep Dutt <sudeep.dutt@intel.com> Cc: Ashutosh Dixit <ashutosh.dixit@intel.com> Cc: Dimitri Sivanich <sivanich@sgi.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Juergen Gross <jgross@suse.com> Cc: ""Jérôme Glisse"" <jglisse@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/93065ac753e4443840a057bfef4be71ec766fde9; Then we are looking for a proper process tear down.;no;no;yes 72;MDY6Q29tbWl0MjMyNTI5ODphMTk1ZDNmNWI3NGYzZjQ1YTY3NDJmOTA2M2I1ZTk1YTI1MjJiNDZk;Michal Hocko;Linus Torvalds;"mm/oom_kill.c: document oom_lock Add comments describing oom_lock's scope. Requested-by: David Rientjes <rientjes@google.com> Link: http://lkml.kernel.org/r/20180711120121.25635-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/a195d3f5b74f3f45a6742f9063b5e95a2522b46d;mm/oom_kill.c: document oom_lock;yes;yes;no 72;MDY6Q29tbWl0MjMyNTI5ODphMTk1ZDNmNWI3NGYzZjQ1YTY3NDJmOTA2M2I1ZTk1YTI1MjJiNDZk;Michal Hocko;Linus Torvalds;"mm/oom_kill.c: document oom_lock Add comments describing oom_lock's scope. Requested-by: David Rientjes <rientjes@google.com> Link: http://lkml.kernel.org/r/20180711120121.25635-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/a195d3f5b74f3f45a6742f9063b5e95a2522b46d;Add comments describing oom_lock's scope.;yes;yes;no 73;MDY6Q29tbWl0MjMyNTI5ODo5YmZlNWRlZDA1NGI4ZTI4YTk0Yzc4NTgwZjIzM2Q2ODc5YTAwMTQ2;Michal Hocko;Linus Torvalds;"mm, oom: remove sleep from under oom_lock Tetsuo has pointed out that since 27ae357fa82b (""mm, oom: fix concurrent munlock and oom reaper unmap, v3"") we have a strong synchronization between the oom_killer and victim's exiting because both have to take the oom_lock. Therefore the original heuristic to sleep for a short time in out_of_memory doesn't serve the original purpose. Moreover Tetsuo has noticed that the short sleep can be more harmful than actually useful. Hammering the system with many processes can lead to a starvation when the task holding the oom_lock can block for a long time (minutes) and block any further progress because the oom_reaper depends on the oom_lock as well. Drop the short sleep from out_of_memory when we hold the lock. Keep the sleep when the trylock fails to throttle the concurrent OOM paths a bit. This should be solved in a more reasonable way (e.g. sleep proportional to the time spent in the active reclaiming etc.) but this is much more complex thing to achieve. This is a quick fixup to remove a stale code. Link: http://lkml.kernel.org/r/20180709074706.30635-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9bfe5ded054b8e28a94c78580f233d6879a00146;mm, oom: remove sleep from under oom_lock;yes;no;no 73;MDY6Q29tbWl0MjMyNTI5ODo5YmZlNWRlZDA1NGI4ZTI4YTk0Yzc4NTgwZjIzM2Q2ODc5YTAwMTQ2;Michal Hocko;Linus Torvalds;"mm, oom: remove sleep from under oom_lock Tetsuo has pointed out that since 27ae357fa82b (""mm, oom: fix concurrent munlock and oom reaper unmap, v3"") we have a strong synchronization between the oom_killer and victim's exiting because both have to take the oom_lock. Therefore the original heuristic to sleep for a short time in out_of_memory doesn't serve the original purpose. Moreover Tetsuo has noticed that the short sleep can be more harmful than actually useful. Hammering the system with many processes can lead to a starvation when the task holding the oom_lock can block for a long time (minutes) and block any further progress because the oom_reaper depends on the oom_lock as well. Drop the short sleep from out_of_memory when we hold the lock. Keep the sleep when the trylock fails to throttle the concurrent OOM paths a bit. This should be solved in a more reasonable way (e.g. sleep proportional to the time spent in the active reclaiming etc.) but this is much more complex thing to achieve. This is a quick fixup to remove a stale code. Link: http://lkml.kernel.org/r/20180709074706.30635-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9bfe5ded054b8e28a94c78580f233d6879a00146;"Tetsuo has pointed out that since 27ae357fa82b (""mm, oom: fix concurrent munlock and oom reaper unmap, v3"") we have a strong synchronization between the oom_killer and victim's exiting because both have to take the oom_lock";no;no;yes 73;MDY6Q29tbWl0MjMyNTI5ODo5YmZlNWRlZDA1NGI4ZTI4YTk0Yzc4NTgwZjIzM2Q2ODc5YTAwMTQ2;Michal Hocko;Linus Torvalds;"mm, oom: remove sleep from under oom_lock Tetsuo has pointed out that since 27ae357fa82b (""mm, oom: fix concurrent munlock and oom reaper unmap, v3"") we have a strong synchronization between the oom_killer and victim's exiting because both have to take the oom_lock. Therefore the original heuristic to sleep for a short time in out_of_memory doesn't serve the original purpose. Moreover Tetsuo has noticed that the short sleep can be more harmful than actually useful. Hammering the system with many processes can lead to a starvation when the task holding the oom_lock can block for a long time (minutes) and block any further progress because the oom_reaper depends on the oom_lock as well. Drop the short sleep from out_of_memory when we hold the lock. Keep the sleep when the trylock fails to throttle the concurrent OOM paths a bit. This should be solved in a more reasonable way (e.g. sleep proportional to the time spent in the active reclaiming etc.) but this is much more complex thing to achieve. This is a quick fixup to remove a stale code. Link: http://lkml.kernel.org/r/20180709074706.30635-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9bfe5ded054b8e28a94c78580f233d6879a00146;" Therefore the original heuristic to sleep for a short time in out_of_memory doesn't serve the original purpose";no;yes;yes 73;MDY6Q29tbWl0MjMyNTI5ODo5YmZlNWRlZDA1NGI4ZTI4YTk0Yzc4NTgwZjIzM2Q2ODc5YTAwMTQ2;Michal Hocko;Linus Torvalds;"mm, oom: remove sleep from under oom_lock Tetsuo has pointed out that since 27ae357fa82b (""mm, oom: fix concurrent munlock and oom reaper unmap, v3"") we have a strong synchronization between the oom_killer and victim's exiting because both have to take the oom_lock. Therefore the original heuristic to sleep for a short time in out_of_memory doesn't serve the original purpose. Moreover Tetsuo has noticed that the short sleep can be more harmful than actually useful. Hammering the system with many processes can lead to a starvation when the task holding the oom_lock can block for a long time (minutes) and block any further progress because the oom_reaper depends on the oom_lock as well. Drop the short sleep from out_of_memory when we hold the lock. Keep the sleep when the trylock fails to throttle the concurrent OOM paths a bit. This should be solved in a more reasonable way (e.g. sleep proportional to the time spent in the active reclaiming etc.) but this is much more complex thing to achieve. This is a quick fixup to remove a stale code. Link: http://lkml.kernel.org/r/20180709074706.30635-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9bfe5ded054b8e28a94c78580f233d6879a00146;"Moreover Tetsuo has noticed that the short sleep can be more harmful than actually useful";no;yes;yes 73;MDY6Q29tbWl0MjMyNTI5ODo5YmZlNWRlZDA1NGI4ZTI4YTk0Yzc4NTgwZjIzM2Q2ODc5YTAwMTQ2;Michal Hocko;Linus Torvalds;"mm, oom: remove sleep from under oom_lock Tetsuo has pointed out that since 27ae357fa82b (""mm, oom: fix concurrent munlock and oom reaper unmap, v3"") we have a strong synchronization between the oom_killer and victim's exiting because both have to take the oom_lock. Therefore the original heuristic to sleep for a short time in out_of_memory doesn't serve the original purpose. Moreover Tetsuo has noticed that the short sleep can be more harmful than actually useful. Hammering the system with many processes can lead to a starvation when the task holding the oom_lock can block for a long time (minutes) and block any further progress because the oom_reaper depends on the oom_lock as well. Drop the short sleep from out_of_memory when we hold the lock. Keep the sleep when the trylock fails to throttle the concurrent OOM paths a bit. This should be solved in a more reasonable way (e.g. sleep proportional to the time spent in the active reclaiming etc.) but this is much more complex thing to achieve. This is a quick fixup to remove a stale code. Link: http://lkml.kernel.org/r/20180709074706.30635-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9bfe5ded054b8e28a94c78580f233d6879a00146;" Hammering the system with many processes can lead to a starvation when the task holding the oom_lock can block for a long time (minutes) and block any further progress because the oom_reaper depends on the oom_lock as well";no;yes;yes 73;MDY6Q29tbWl0MjMyNTI5ODo5YmZlNWRlZDA1NGI4ZTI4YTk0Yzc4NTgwZjIzM2Q2ODc5YTAwMTQ2;Michal Hocko;Linus Torvalds;"mm, oom: remove sleep from under oom_lock Tetsuo has pointed out that since 27ae357fa82b (""mm, oom: fix concurrent munlock and oom reaper unmap, v3"") we have a strong synchronization between the oom_killer and victim's exiting because both have to take the oom_lock. Therefore the original heuristic to sleep for a short time in out_of_memory doesn't serve the original purpose. Moreover Tetsuo has noticed that the short sleep can be more harmful than actually useful. Hammering the system with many processes can lead to a starvation when the task holding the oom_lock can block for a long time (minutes) and block any further progress because the oom_reaper depends on the oom_lock as well. Drop the short sleep from out_of_memory when we hold the lock. Keep the sleep when the trylock fails to throttle the concurrent OOM paths a bit. This should be solved in a more reasonable way (e.g. sleep proportional to the time spent in the active reclaiming etc.) but this is much more complex thing to achieve. This is a quick fixup to remove a stale code. Link: http://lkml.kernel.org/r/20180709074706.30635-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9bfe5ded054b8e28a94c78580f233d6879a00146;Drop the short sleep from out_of_memory when we hold the lock;yes;no;no 73;MDY6Q29tbWl0MjMyNTI5ODo5YmZlNWRlZDA1NGI4ZTI4YTk0Yzc4NTgwZjIzM2Q2ODc5YTAwMTQ2;Michal Hocko;Linus Torvalds;"mm, oom: remove sleep from under oom_lock Tetsuo has pointed out that since 27ae357fa82b (""mm, oom: fix concurrent munlock and oom reaper unmap, v3"") we have a strong synchronization between the oom_killer and victim's exiting because both have to take the oom_lock. Therefore the original heuristic to sleep for a short time in out_of_memory doesn't serve the original purpose. Moreover Tetsuo has noticed that the short sleep can be more harmful than actually useful. Hammering the system with many processes can lead to a starvation when the task holding the oom_lock can block for a long time (minutes) and block any further progress because the oom_reaper depends on the oom_lock as well. Drop the short sleep from out_of_memory when we hold the lock. Keep the sleep when the trylock fails to throttle the concurrent OOM paths a bit. This should be solved in a more reasonable way (e.g. sleep proportional to the time spent in the active reclaiming etc.) but this is much more complex thing to achieve. This is a quick fixup to remove a stale code. Link: http://lkml.kernel.org/r/20180709074706.30635-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9bfe5ded054b8e28a94c78580f233d6879a00146;" Keep the sleep when the trylock fails to throttle the concurrent OOM paths a bit";yes;no;no 73;MDY6Q29tbWl0MjMyNTI5ODo5YmZlNWRlZDA1NGI4ZTI4YTk0Yzc4NTgwZjIzM2Q2ODc5YTAwMTQ2;Michal Hocko;Linus Torvalds;"mm, oom: remove sleep from under oom_lock Tetsuo has pointed out that since 27ae357fa82b (""mm, oom: fix concurrent munlock and oom reaper unmap, v3"") we have a strong synchronization between the oom_killer and victim's exiting because both have to take the oom_lock. Therefore the original heuristic to sleep for a short time in out_of_memory doesn't serve the original purpose. Moreover Tetsuo has noticed that the short sleep can be more harmful than actually useful. Hammering the system with many processes can lead to a starvation when the task holding the oom_lock can block for a long time (minutes) and block any further progress because the oom_reaper depends on the oom_lock as well. Drop the short sleep from out_of_memory when we hold the lock. Keep the sleep when the trylock fails to throttle the concurrent OOM paths a bit. This should be solved in a more reasonable way (e.g. sleep proportional to the time spent in the active reclaiming etc.) but this is much more complex thing to achieve. This is a quick fixup to remove a stale code. Link: http://lkml.kernel.org/r/20180709074706.30635-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9bfe5ded054b8e28a94c78580f233d6879a00146;This should be solved in a more reasonable way (e.g;yes;yes;no 73;MDY6Q29tbWl0MjMyNTI5ODo5YmZlNWRlZDA1NGI4ZTI4YTk0Yzc4NTgwZjIzM2Q2ODc5YTAwMTQ2;Michal Hocko;Linus Torvalds;"mm, oom: remove sleep from under oom_lock Tetsuo has pointed out that since 27ae357fa82b (""mm, oom: fix concurrent munlock and oom reaper unmap, v3"") we have a strong synchronization between the oom_killer and victim's exiting because both have to take the oom_lock. Therefore the original heuristic to sleep for a short time in out_of_memory doesn't serve the original purpose. Moreover Tetsuo has noticed that the short sleep can be more harmful than actually useful. Hammering the system with many processes can lead to a starvation when the task holding the oom_lock can block for a long time (minutes) and block any further progress because the oom_reaper depends on the oom_lock as well. Drop the short sleep from out_of_memory when we hold the lock. Keep the sleep when the trylock fails to throttle the concurrent OOM paths a bit. This should be solved in a more reasonable way (e.g. sleep proportional to the time spent in the active reclaiming etc.) but this is much more complex thing to achieve. This is a quick fixup to remove a stale code. Link: http://lkml.kernel.org/r/20180709074706.30635-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9bfe5ded054b8e28a94c78580f233d6879a00146;" sleep proportional to the time spent in the active reclaiming etc.) but this is much more complex thing to achieve";yes;yes;no 73;MDY6Q29tbWl0MjMyNTI5ODo5YmZlNWRlZDA1NGI4ZTI4YTk0Yzc4NTgwZjIzM2Q2ODc5YTAwMTQ2;Michal Hocko;Linus Torvalds;"mm, oom: remove sleep from under oom_lock Tetsuo has pointed out that since 27ae357fa82b (""mm, oom: fix concurrent munlock and oom reaper unmap, v3"") we have a strong synchronization between the oom_killer and victim's exiting because both have to take the oom_lock. Therefore the original heuristic to sleep for a short time in out_of_memory doesn't serve the original purpose. Moreover Tetsuo has noticed that the short sleep can be more harmful than actually useful. Hammering the system with many processes can lead to a starvation when the task holding the oom_lock can block for a long time (minutes) and block any further progress because the oom_reaper depends on the oom_lock as well. Drop the short sleep from out_of_memory when we hold the lock. Keep the sleep when the trylock fails to throttle the concurrent OOM paths a bit. This should be solved in a more reasonable way (e.g. sleep proportional to the time spent in the active reclaiming etc.) but this is much more complex thing to achieve. This is a quick fixup to remove a stale code. Link: http://lkml.kernel.org/r/20180709074706.30635-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9bfe5ded054b8e28a94c78580f233d6879a00146; This is a quick fixup to remove a stale code.;yes;yes;no 76;MDY6Q29tbWl0MjMyNTI5ODpiYmVjMmUxNTE3MGFhZTNlMDg0ZDdkOWFmYzczMGFlZWJlMDFiNjU0;Roman Gushchin;Linus Torvalds;"mm: rename page_counter's count/limit into usage/max This patch renames struct page_counter fields: count -> usage limit -> max and the corresponding functions: page_counter_limit() -> page_counter_set_max() mem_cgroup_get_limit() -> mem_cgroup_get_max() mem_cgroup_resize_limit() -> mem_cgroup_resize_max() memcg_update_kmem_limit() -> memcg_update_kmem_max() memcg_update_tcp_limit() -> memcg_update_tcp_max() The idea behind this renaming is to have the direct matching between memory cgroup knobs (low, high, max) and page_counters API. This is pure renaming, this patch doesn't bring any functional change. Link: http://lkml.kernel.org/r/20180405185921.4942-1-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/bbec2e15170aae3e084d7d9afc730aeebe01b654;mm: rename page_counter's count/limit into usage/max;yes;no;no 76;MDY6Q29tbWl0MjMyNTI5ODpiYmVjMmUxNTE3MGFhZTNlMDg0ZDdkOWFmYzczMGFlZWJlMDFiNjU0;Roman Gushchin;Linus Torvalds;"mm: rename page_counter's count/limit into usage/max This patch renames struct page_counter fields: count -> usage limit -> max and the corresponding functions: page_counter_limit() -> page_counter_set_max() mem_cgroup_get_limit() -> mem_cgroup_get_max() mem_cgroup_resize_limit() -> mem_cgroup_resize_max() memcg_update_kmem_limit() -> memcg_update_kmem_max() memcg_update_tcp_limit() -> memcg_update_tcp_max() The idea behind this renaming is to have the direct matching between memory cgroup knobs (low, high, max) and page_counters API. This is pure renaming, this patch doesn't bring any functional change. Link: http://lkml.kernel.org/r/20180405185921.4942-1-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/bbec2e15170aae3e084d7d9afc730aeebe01b654;This patch renames struct page_counter fields;yes;no;no 76;MDY6Q29tbWl0MjMyNTI5ODpiYmVjMmUxNTE3MGFhZTNlMDg0ZDdkOWFmYzczMGFlZWJlMDFiNjU0;Roman Gushchin;Linus Torvalds;"mm: rename page_counter's count/limit into usage/max This patch renames struct page_counter fields: count -> usage limit -> max and the corresponding functions: page_counter_limit() -> page_counter_set_max() mem_cgroup_get_limit() -> mem_cgroup_get_max() mem_cgroup_resize_limit() -> mem_cgroup_resize_max() memcg_update_kmem_limit() -> memcg_update_kmem_max() memcg_update_tcp_limit() -> memcg_update_tcp_max() The idea behind this renaming is to have the direct matching between memory cgroup knobs (low, high, max) and page_counters API. This is pure renaming, this patch doesn't bring any functional change. Link: http://lkml.kernel.org/r/20180405185921.4942-1-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/bbec2e15170aae3e084d7d9afc730aeebe01b654;" count -> usage limit -> max and the corresponding functions";yes;no;no 76;MDY6Q29tbWl0MjMyNTI5ODpiYmVjMmUxNTE3MGFhZTNlMDg0ZDdkOWFmYzczMGFlZWJlMDFiNjU0;Roman Gushchin;Linus Torvalds;"mm: rename page_counter's count/limit into usage/max This patch renames struct page_counter fields: count -> usage limit -> max and the corresponding functions: page_counter_limit() -> page_counter_set_max() mem_cgroup_get_limit() -> mem_cgroup_get_max() mem_cgroup_resize_limit() -> mem_cgroup_resize_max() memcg_update_kmem_limit() -> memcg_update_kmem_max() memcg_update_tcp_limit() -> memcg_update_tcp_max() The idea behind this renaming is to have the direct matching between memory cgroup knobs (low, high, max) and page_counters API. This is pure renaming, this patch doesn't bring any functional change. Link: http://lkml.kernel.org/r/20180405185921.4942-1-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/bbec2e15170aae3e084d7d9afc730aeebe01b654;" page_counter_limit() -> page_counter_set_max() mem_cgroup_get_limit() -> mem_cgroup_get_max() mem_cgroup_resize_limit() -> mem_cgroup_resize_max() memcg_update_kmem_limit() -> memcg_update_kmem_max() memcg_update_tcp_limit() -> memcg_update_tcp_max() The idea behind this renaming is to have the direct matching between memory cgroup knobs (low, high, max) and page_counters API";yes;yes;no 76;MDY6Q29tbWl0MjMyNTI5ODpiYmVjMmUxNTE3MGFhZTNlMDg0ZDdkOWFmYzczMGFlZWJlMDFiNjU0;Roman Gushchin;Linus Torvalds;"mm: rename page_counter's count/limit into usage/max This patch renames struct page_counter fields: count -> usage limit -> max and the corresponding functions: page_counter_limit() -> page_counter_set_max() mem_cgroup_get_limit() -> mem_cgroup_get_max() mem_cgroup_resize_limit() -> mem_cgroup_resize_max() memcg_update_kmem_limit() -> memcg_update_kmem_max() memcg_update_tcp_limit() -> memcg_update_tcp_max() The idea behind this renaming is to have the direct matching between memory cgroup knobs (low, high, max) and page_counters API. This is pure renaming, this patch doesn't bring any functional change. Link: http://lkml.kernel.org/r/20180405185921.4942-1-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/bbec2e15170aae3e084d7d9afc730aeebe01b654;This is pure renaming, this patch doesn't bring any functional change.;yes;yes;no 77;MDY6Q29tbWl0MjMyNTI5ODoyN2FlMzU3ZmE4MmJlNWFiNzNiMmVmOGQzOWRjYjhjYTI1NjM0ODNh;David Rientjes;Linus Torvalds;"mm, oom: fix concurrent munlock and oom reaper unmap, v3 Since exit_mmap() is done without the protection of mm->mmap_sem, it is possible for the oom reaper to concurrently operate on an mm until MMF_OOM_SKIP is set. This allows munlock_vma_pages_all() to concurrently run while the oom reaper is operating on a vma. Since munlock_vma_pages_range() depends on clearing VM_LOCKED from vm_flags before actually doing the munlock to determine if any other vmas are locking the same memory, the check for VM_LOCKED in the oom reaper is racy. This is especially noticeable on architectures such as powerpc where clearing a huge pmd requires serialize_against_pte_lookup(). If the pmd is zapped by the oom reaper during follow_page_mask() after the check for pmd_none() is bypassed, this ends up deferencing a NULL ptl or a kernel oops. Fix this by manually freeing all possible memory from the mm before doing the munlock and then setting MMF_OOM_SKIP. The oom reaper can not run on the mm anymore so the munlock is safe to do in exit_mmap(). It also matches the logic that the oom reaper currently uses for determining when to set MMF_OOM_SKIP itself, so there's no new risk of excessive oom killing. This issue fixes CVE-2018-1000200. Link: http://lkml.kernel.org/r/alpine.DEB.2.21.1804241526320.238665@chino.kir.corp.google.com Fixes: 212925802454 (""mm: oom: let oom_reap_task and exit_mmap run concurrently"") Signed-off-by: David Rientjes <rientjes@google.com> Suggested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> [4.14+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/27ae357fa82be5ab73b2ef8d39dcb8ca2563483a;mm, oom: fix concurrent munlock and oom reaper unmap, v3;yes;yes;no 77;MDY6Q29tbWl0MjMyNTI5ODoyN2FlMzU3ZmE4MmJlNWFiNzNiMmVmOGQzOWRjYjhjYTI1NjM0ODNh;David Rientjes;Linus Torvalds;"mm, oom: fix concurrent munlock and oom reaper unmap, v3 Since exit_mmap() is done without the protection of mm->mmap_sem, it is possible for the oom reaper to concurrently operate on an mm until MMF_OOM_SKIP is set. This allows munlock_vma_pages_all() to concurrently run while the oom reaper is operating on a vma. Since munlock_vma_pages_range() depends on clearing VM_LOCKED from vm_flags before actually doing the munlock to determine if any other vmas are locking the same memory, the check for VM_LOCKED in the oom reaper is racy. This is especially noticeable on architectures such as powerpc where clearing a huge pmd requires serialize_against_pte_lookup(). If the pmd is zapped by the oom reaper during follow_page_mask() after the check for pmd_none() is bypassed, this ends up deferencing a NULL ptl or a kernel oops. Fix this by manually freeing all possible memory from the mm before doing the munlock and then setting MMF_OOM_SKIP. The oom reaper can not run on the mm anymore so the munlock is safe to do in exit_mmap(). It also matches the logic that the oom reaper currently uses for determining when to set MMF_OOM_SKIP itself, so there's no new risk of excessive oom killing. This issue fixes CVE-2018-1000200. Link: http://lkml.kernel.org/r/alpine.DEB.2.21.1804241526320.238665@chino.kir.corp.google.com Fixes: 212925802454 (""mm: oom: let oom_reap_task and exit_mmap run concurrently"") Signed-off-by: David Rientjes <rientjes@google.com> Suggested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> [4.14+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/27ae357fa82be5ab73b2ef8d39dcb8ca2563483a;"Since exit_mmap() is done without the protection of mm->mmap_sem, it is possible for the oom reaper to concurrently operate on an mm until MMF_OOM_SKIP is set";no;no;yes 77;MDY6Q29tbWl0MjMyNTI5ODoyN2FlMzU3ZmE4MmJlNWFiNzNiMmVmOGQzOWRjYjhjYTI1NjM0ODNh;David Rientjes;Linus Torvalds;"mm, oom: fix concurrent munlock and oom reaper unmap, v3 Since exit_mmap() is done without the protection of mm->mmap_sem, it is possible for the oom reaper to concurrently operate on an mm until MMF_OOM_SKIP is set. This allows munlock_vma_pages_all() to concurrently run while the oom reaper is operating on a vma. Since munlock_vma_pages_range() depends on clearing VM_LOCKED from vm_flags before actually doing the munlock to determine if any other vmas are locking the same memory, the check for VM_LOCKED in the oom reaper is racy. This is especially noticeable on architectures such as powerpc where clearing a huge pmd requires serialize_against_pte_lookup(). If the pmd is zapped by the oom reaper during follow_page_mask() after the check for pmd_none() is bypassed, this ends up deferencing a NULL ptl or a kernel oops. Fix this by manually freeing all possible memory from the mm before doing the munlock and then setting MMF_OOM_SKIP. The oom reaper can not run on the mm anymore so the munlock is safe to do in exit_mmap(). It also matches the logic that the oom reaper currently uses for determining when to set MMF_OOM_SKIP itself, so there's no new risk of excessive oom killing. This issue fixes CVE-2018-1000200. Link: http://lkml.kernel.org/r/alpine.DEB.2.21.1804241526320.238665@chino.kir.corp.google.com Fixes: 212925802454 (""mm: oom: let oom_reap_task and exit_mmap run concurrently"") Signed-off-by: David Rientjes <rientjes@google.com> Suggested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> [4.14+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/27ae357fa82be5ab73b2ef8d39dcb8ca2563483a;"This allows munlock_vma_pages_all() to concurrently run while the oom reaper is operating on a vma";no;no;yes 77;MDY6Q29tbWl0MjMyNTI5ODoyN2FlMzU3ZmE4MmJlNWFiNzNiMmVmOGQzOWRjYjhjYTI1NjM0ODNh;David Rientjes;Linus Torvalds;"mm, oom: fix concurrent munlock and oom reaper unmap, v3 Since exit_mmap() is done without the protection of mm->mmap_sem, it is possible for the oom reaper to concurrently operate on an mm until MMF_OOM_SKIP is set. This allows munlock_vma_pages_all() to concurrently run while the oom reaper is operating on a vma. Since munlock_vma_pages_range() depends on clearing VM_LOCKED from vm_flags before actually doing the munlock to determine if any other vmas are locking the same memory, the check for VM_LOCKED in the oom reaper is racy. This is especially noticeable on architectures such as powerpc where clearing a huge pmd requires serialize_against_pte_lookup(). If the pmd is zapped by the oom reaper during follow_page_mask() after the check for pmd_none() is bypassed, this ends up deferencing a NULL ptl or a kernel oops. Fix this by manually freeing all possible memory from the mm before doing the munlock and then setting MMF_OOM_SKIP. The oom reaper can not run on the mm anymore so the munlock is safe to do in exit_mmap(). It also matches the logic that the oom reaper currently uses for determining when to set MMF_OOM_SKIP itself, so there's no new risk of excessive oom killing. This issue fixes CVE-2018-1000200. Link: http://lkml.kernel.org/r/alpine.DEB.2.21.1804241526320.238665@chino.kir.corp.google.com Fixes: 212925802454 (""mm: oom: let oom_reap_task and exit_mmap run concurrently"") Signed-off-by: David Rientjes <rientjes@google.com> Suggested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> [4.14+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/27ae357fa82be5ab73b2ef8d39dcb8ca2563483a;" Since munlock_vma_pages_range() depends on clearing VM_LOCKED from vm_flags before actually doing the munlock to determine if any other vmas are locking the same memory, the check for VM_LOCKED in the oom reaper is racy";no;yes;yes 77;MDY6Q29tbWl0MjMyNTI5ODoyN2FlMzU3ZmE4MmJlNWFiNzNiMmVmOGQzOWRjYjhjYTI1NjM0ODNh;David Rientjes;Linus Torvalds;"mm, oom: fix concurrent munlock and oom reaper unmap, v3 Since exit_mmap() is done without the protection of mm->mmap_sem, it is possible for the oom reaper to concurrently operate on an mm until MMF_OOM_SKIP is set. This allows munlock_vma_pages_all() to concurrently run while the oom reaper is operating on a vma. Since munlock_vma_pages_range() depends on clearing VM_LOCKED from vm_flags before actually doing the munlock to determine if any other vmas are locking the same memory, the check for VM_LOCKED in the oom reaper is racy. This is especially noticeable on architectures such as powerpc where clearing a huge pmd requires serialize_against_pte_lookup(). If the pmd is zapped by the oom reaper during follow_page_mask() after the check for pmd_none() is bypassed, this ends up deferencing a NULL ptl or a kernel oops. Fix this by manually freeing all possible memory from the mm before doing the munlock and then setting MMF_OOM_SKIP. The oom reaper can not run on the mm anymore so the munlock is safe to do in exit_mmap(). It also matches the logic that the oom reaper currently uses for determining when to set MMF_OOM_SKIP itself, so there's no new risk of excessive oom killing. This issue fixes CVE-2018-1000200. Link: http://lkml.kernel.org/r/alpine.DEB.2.21.1804241526320.238665@chino.kir.corp.google.com Fixes: 212925802454 (""mm: oom: let oom_reap_task and exit_mmap run concurrently"") Signed-off-by: David Rientjes <rientjes@google.com> Suggested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> [4.14+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/27ae357fa82be5ab73b2ef8d39dcb8ca2563483a;"This is especially noticeable on architectures such as powerpc where clearing a huge pmd requires serialize_against_pte_lookup()";no;no;yes 77;MDY6Q29tbWl0MjMyNTI5ODoyN2FlMzU3ZmE4MmJlNWFiNzNiMmVmOGQzOWRjYjhjYTI1NjM0ODNh;David Rientjes;Linus Torvalds;"mm, oom: fix concurrent munlock and oom reaper unmap, v3 Since exit_mmap() is done without the protection of mm->mmap_sem, it is possible for the oom reaper to concurrently operate on an mm until MMF_OOM_SKIP is set. This allows munlock_vma_pages_all() to concurrently run while the oom reaper is operating on a vma. Since munlock_vma_pages_range() depends on clearing VM_LOCKED from vm_flags before actually doing the munlock to determine if any other vmas are locking the same memory, the check for VM_LOCKED in the oom reaper is racy. This is especially noticeable on architectures such as powerpc where clearing a huge pmd requires serialize_against_pte_lookup(). If the pmd is zapped by the oom reaper during follow_page_mask() after the check for pmd_none() is bypassed, this ends up deferencing a NULL ptl or a kernel oops. Fix this by manually freeing all possible memory from the mm before doing the munlock and then setting MMF_OOM_SKIP. The oom reaper can not run on the mm anymore so the munlock is safe to do in exit_mmap(). It also matches the logic that the oom reaper currently uses for determining when to set MMF_OOM_SKIP itself, so there's no new risk of excessive oom killing. This issue fixes CVE-2018-1000200. Link: http://lkml.kernel.org/r/alpine.DEB.2.21.1804241526320.238665@chino.kir.corp.google.com Fixes: 212925802454 (""mm: oom: let oom_reap_task and exit_mmap run concurrently"") Signed-off-by: David Rientjes <rientjes@google.com> Suggested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> [4.14+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/27ae357fa82be5ab73b2ef8d39dcb8ca2563483a;" If the pmd is zapped by the oom reaper during follow_page_mask() after the check for pmd_none() is bypassed, this ends up deferencing a NULL ptl or a kernel oops";no;yes;yes 77;MDY6Q29tbWl0MjMyNTI5ODoyN2FlMzU3ZmE4MmJlNWFiNzNiMmVmOGQzOWRjYjhjYTI1NjM0ODNh;David Rientjes;Linus Torvalds;"mm, oom: fix concurrent munlock and oom reaper unmap, v3 Since exit_mmap() is done without the protection of mm->mmap_sem, it is possible for the oom reaper to concurrently operate on an mm until MMF_OOM_SKIP is set. This allows munlock_vma_pages_all() to concurrently run while the oom reaper is operating on a vma. Since munlock_vma_pages_range() depends on clearing VM_LOCKED from vm_flags before actually doing the munlock to determine if any other vmas are locking the same memory, the check for VM_LOCKED in the oom reaper is racy. This is especially noticeable on architectures such as powerpc where clearing a huge pmd requires serialize_against_pte_lookup(). If the pmd is zapped by the oom reaper during follow_page_mask() after the check for pmd_none() is bypassed, this ends up deferencing a NULL ptl or a kernel oops. Fix this by manually freeing all possible memory from the mm before doing the munlock and then setting MMF_OOM_SKIP. The oom reaper can not run on the mm anymore so the munlock is safe to do in exit_mmap(). It also matches the logic that the oom reaper currently uses for determining when to set MMF_OOM_SKIP itself, so there's no new risk of excessive oom killing. This issue fixes CVE-2018-1000200. Link: http://lkml.kernel.org/r/alpine.DEB.2.21.1804241526320.238665@chino.kir.corp.google.com Fixes: 212925802454 (""mm: oom: let oom_reap_task and exit_mmap run concurrently"") Signed-off-by: David Rientjes <rientjes@google.com> Suggested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> [4.14+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/27ae357fa82be5ab73b2ef8d39dcb8ca2563483a;"Fix this by manually freeing all possible memory from the mm before doing the munlock and then setting MMF_OOM_SKIP";yes;yes;no 77;MDY6Q29tbWl0MjMyNTI5ODoyN2FlMzU3ZmE4MmJlNWFiNzNiMmVmOGQzOWRjYjhjYTI1NjM0ODNh;David Rientjes;Linus Torvalds;"mm, oom: fix concurrent munlock and oom reaper unmap, v3 Since exit_mmap() is done without the protection of mm->mmap_sem, it is possible for the oom reaper to concurrently operate on an mm until MMF_OOM_SKIP is set. This allows munlock_vma_pages_all() to concurrently run while the oom reaper is operating on a vma. Since munlock_vma_pages_range() depends on clearing VM_LOCKED from vm_flags before actually doing the munlock to determine if any other vmas are locking the same memory, the check for VM_LOCKED in the oom reaper is racy. This is especially noticeable on architectures such as powerpc where clearing a huge pmd requires serialize_against_pte_lookup(). If the pmd is zapped by the oom reaper during follow_page_mask() after the check for pmd_none() is bypassed, this ends up deferencing a NULL ptl or a kernel oops. Fix this by manually freeing all possible memory from the mm before doing the munlock and then setting MMF_OOM_SKIP. The oom reaper can not run on the mm anymore so the munlock is safe to do in exit_mmap(). It also matches the logic that the oom reaper currently uses for determining when to set MMF_OOM_SKIP itself, so there's no new risk of excessive oom killing. This issue fixes CVE-2018-1000200. Link: http://lkml.kernel.org/r/alpine.DEB.2.21.1804241526320.238665@chino.kir.corp.google.com Fixes: 212925802454 (""mm: oom: let oom_reap_task and exit_mmap run concurrently"") Signed-off-by: David Rientjes <rientjes@google.com> Suggested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> [4.14+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/27ae357fa82be5ab73b2ef8d39dcb8ca2563483a;" The oom reaper can not run on the mm anymore so the munlock is safe to do in exit_mmap()";yes;yes;yes 77;MDY6Q29tbWl0MjMyNTI5ODoyN2FlMzU3ZmE4MmJlNWFiNzNiMmVmOGQzOWRjYjhjYTI1NjM0ODNh;David Rientjes;Linus Torvalds;"mm, oom: fix concurrent munlock and oom reaper unmap, v3 Since exit_mmap() is done without the protection of mm->mmap_sem, it is possible for the oom reaper to concurrently operate on an mm until MMF_OOM_SKIP is set. This allows munlock_vma_pages_all() to concurrently run while the oom reaper is operating on a vma. Since munlock_vma_pages_range() depends on clearing VM_LOCKED from vm_flags before actually doing the munlock to determine if any other vmas are locking the same memory, the check for VM_LOCKED in the oom reaper is racy. This is especially noticeable on architectures such as powerpc where clearing a huge pmd requires serialize_against_pte_lookup(). If the pmd is zapped by the oom reaper during follow_page_mask() after the check for pmd_none() is bypassed, this ends up deferencing a NULL ptl or a kernel oops. Fix this by manually freeing all possible memory from the mm before doing the munlock and then setting MMF_OOM_SKIP. The oom reaper can not run on the mm anymore so the munlock is safe to do in exit_mmap(). It also matches the logic that the oom reaper currently uses for determining when to set MMF_OOM_SKIP itself, so there's no new risk of excessive oom killing. This issue fixes CVE-2018-1000200. Link: http://lkml.kernel.org/r/alpine.DEB.2.21.1804241526320.238665@chino.kir.corp.google.com Fixes: 212925802454 (""mm: oom: let oom_reap_task and exit_mmap run concurrently"") Signed-off-by: David Rientjes <rientjes@google.com> Suggested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> [4.14+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/27ae357fa82be5ab73b2ef8d39dcb8ca2563483a;" It also matches the logic that the oom reaper currently uses for determining when to set MMF_OOM_SKIP itself, so there's no new risk of excessive oom killing";yes;yes;yes 77;MDY6Q29tbWl0MjMyNTI5ODoyN2FlMzU3ZmE4MmJlNWFiNzNiMmVmOGQzOWRjYjhjYTI1NjM0ODNh;David Rientjes;Linus Torvalds;"mm, oom: fix concurrent munlock and oom reaper unmap, v3 Since exit_mmap() is done without the protection of mm->mmap_sem, it is possible for the oom reaper to concurrently operate on an mm until MMF_OOM_SKIP is set. This allows munlock_vma_pages_all() to concurrently run while the oom reaper is operating on a vma. Since munlock_vma_pages_range() depends on clearing VM_LOCKED from vm_flags before actually doing the munlock to determine if any other vmas are locking the same memory, the check for VM_LOCKED in the oom reaper is racy. This is especially noticeable on architectures such as powerpc where clearing a huge pmd requires serialize_against_pte_lookup(). If the pmd is zapped by the oom reaper during follow_page_mask() after the check for pmd_none() is bypassed, this ends up deferencing a NULL ptl or a kernel oops. Fix this by manually freeing all possible memory from the mm before doing the munlock and then setting MMF_OOM_SKIP. The oom reaper can not run on the mm anymore so the munlock is safe to do in exit_mmap(). It also matches the logic that the oom reaper currently uses for determining when to set MMF_OOM_SKIP itself, so there's no new risk of excessive oom killing. This issue fixes CVE-2018-1000200. Link: http://lkml.kernel.org/r/alpine.DEB.2.21.1804241526320.238665@chino.kir.corp.google.com Fixes: 212925802454 (""mm: oom: let oom_reap_task and exit_mmap run concurrently"") Signed-off-by: David Rientjes <rientjes@google.com> Suggested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> [4.14+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/27ae357fa82be5ab73b2ef8d39dcb8ca2563483a;This issue fixes CVE-2018-1000200.;no;yes;no 79;MDY6Q29tbWl0MjMyNTI5ODpkNDYwNzhiMjg4ODk0N2E4NmI2YmI5OTdjZDU5MjdlNjAyZThmZGM5;David Rientjes;Linus Torvalds;"mm, oom: remove 3% bonus for CAP_SYS_ADMIN processes Since the 2.6 kernel, the oom killer has slightly biased away from CAP_SYS_ADMIN processes by discounting some of its memory usage in comparison to other processes. This has always been implicit and nothing exactly relies on the behavior. Gaurav notices that __task_cred() can dereference a potentially freed pointer if the task under consideration is exiting because a reference to the task_struct is not held. Remove the CAP_SYS_ADMIN bias so that all processes are treated equally. If any CAP_SYS_ADMIN process would like to be biased against, it is always allowed to adjust /proc/pid/oom_score_adj. Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1803071548510.6996@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Reported-by: Gaurav Kohli <gkohli@codeaurora.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/d46078b2888947a86b6bb997cd5927e602e8fdc9;mm, oom: remove 3% bonus for CAP_SYS_ADMIN processes;yes;no;no 79;MDY6Q29tbWl0MjMyNTI5ODpkNDYwNzhiMjg4ODk0N2E4NmI2YmI5OTdjZDU5MjdlNjAyZThmZGM5;David Rientjes;Linus Torvalds;"mm, oom: remove 3% bonus for CAP_SYS_ADMIN processes Since the 2.6 kernel, the oom killer has slightly biased away from CAP_SYS_ADMIN processes by discounting some of its memory usage in comparison to other processes. This has always been implicit and nothing exactly relies on the behavior. Gaurav notices that __task_cred() can dereference a potentially freed pointer if the task under consideration is exiting because a reference to the task_struct is not held. Remove the CAP_SYS_ADMIN bias so that all processes are treated equally. If any CAP_SYS_ADMIN process would like to be biased against, it is always allowed to adjust /proc/pid/oom_score_adj. Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1803071548510.6996@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Reported-by: Gaurav Kohli <gkohli@codeaurora.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/d46078b2888947a86b6bb997cd5927e602e8fdc9;"Since the 2.6 kernel, the oom killer has slightly biased away from CAP_SYS_ADMIN processes by discounting some of its memory usage in comparison to other processes";no;no;yes 79;MDY6Q29tbWl0MjMyNTI5ODpkNDYwNzhiMjg4ODk0N2E4NmI2YmI5OTdjZDU5MjdlNjAyZThmZGM5;David Rientjes;Linus Torvalds;"mm, oom: remove 3% bonus for CAP_SYS_ADMIN processes Since the 2.6 kernel, the oom killer has slightly biased away from CAP_SYS_ADMIN processes by discounting some of its memory usage in comparison to other processes. This has always been implicit and nothing exactly relies on the behavior. Gaurav notices that __task_cred() can dereference a potentially freed pointer if the task under consideration is exiting because a reference to the task_struct is not held. Remove the CAP_SYS_ADMIN bias so that all processes are treated equally. If any CAP_SYS_ADMIN process would like to be biased against, it is always allowed to adjust /proc/pid/oom_score_adj. Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1803071548510.6996@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Reported-by: Gaurav Kohli <gkohli@codeaurora.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/d46078b2888947a86b6bb997cd5927e602e8fdc9;"This has always been implicit and nothing exactly relies on the behavior";no;no;yes 79;MDY6Q29tbWl0MjMyNTI5ODpkNDYwNzhiMjg4ODk0N2E4NmI2YmI5OTdjZDU5MjdlNjAyZThmZGM5;David Rientjes;Linus Torvalds;"mm, oom: remove 3% bonus for CAP_SYS_ADMIN processes Since the 2.6 kernel, the oom killer has slightly biased away from CAP_SYS_ADMIN processes by discounting some of its memory usage in comparison to other processes. This has always been implicit and nothing exactly relies on the behavior. Gaurav notices that __task_cred() can dereference a potentially freed pointer if the task under consideration is exiting because a reference to the task_struct is not held. Remove the CAP_SYS_ADMIN bias so that all processes are treated equally. If any CAP_SYS_ADMIN process would like to be biased against, it is always allowed to adjust /proc/pid/oom_score_adj. Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1803071548510.6996@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Reported-by: Gaurav Kohli <gkohli@codeaurora.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/d46078b2888947a86b6bb997cd5927e602e8fdc9;"Gaurav notices that __task_cred() can dereference a potentially freed pointer if the task under consideration is exiting because a reference to the task_struct is not held";no;yes;yes 79;MDY6Q29tbWl0MjMyNTI5ODpkNDYwNzhiMjg4ODk0N2E4NmI2YmI5OTdjZDU5MjdlNjAyZThmZGM5;David Rientjes;Linus Torvalds;"mm, oom: remove 3% bonus for CAP_SYS_ADMIN processes Since the 2.6 kernel, the oom killer has slightly biased away from CAP_SYS_ADMIN processes by discounting some of its memory usage in comparison to other processes. This has always been implicit and nothing exactly relies on the behavior. Gaurav notices that __task_cred() can dereference a potentially freed pointer if the task under consideration is exiting because a reference to the task_struct is not held. Remove the CAP_SYS_ADMIN bias so that all processes are treated equally. If any CAP_SYS_ADMIN process would like to be biased against, it is always allowed to adjust /proc/pid/oom_score_adj. Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1803071548510.6996@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Reported-by: Gaurav Kohli <gkohli@codeaurora.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/d46078b2888947a86b6bb997cd5927e602e8fdc9;Remove the CAP_SYS_ADMIN bias so that all processes are treated equally;yes;yes;no 79;MDY6Q29tbWl0MjMyNTI5ODpkNDYwNzhiMjg4ODk0N2E4NmI2YmI5OTdjZDU5MjdlNjAyZThmZGM5;David Rientjes;Linus Torvalds;"mm, oom: remove 3% bonus for CAP_SYS_ADMIN processes Since the 2.6 kernel, the oom killer has slightly biased away from CAP_SYS_ADMIN processes by discounting some of its memory usage in comparison to other processes. This has always been implicit and nothing exactly relies on the behavior. Gaurav notices that __task_cred() can dereference a potentially freed pointer if the task under consideration is exiting because a reference to the task_struct is not held. Remove the CAP_SYS_ADMIN bias so that all processes are treated equally. If any CAP_SYS_ADMIN process would like to be biased against, it is always allowed to adjust /proc/pid/oom_score_adj. Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1803071548510.6996@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Reported-by: Gaurav Kohli <gkohli@codeaurora.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/d46078b2888947a86b6bb997cd5927e602e8fdc9;"If any CAP_SYS_ADMIN process would like to be biased against, it is always allowed to adjust /proc/pid/oom_score_adj.";yes;yes;yes 80;MDY6Q29tbWl0MjMyNTI5ODplOGIwOThmYzU3NDdhN2M4NzFmMTEzYzllYjY1NDUzY2MyZDg2ZTZm;Mike Rapoport;Linus Torvalds;"mm: kernel-doc: add missing parameter descriptions Link: http://lkml.kernel.org/r/1519585191-10180-4-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/e8b098fc5747a7c871f113c9eb65453cc2d86e6f;mm: kernel-doc: add missing parameter descriptions;yes;yes;no 81;MDY6Q29tbWl0MjMyNTI5ODpmMzQwZmY4MjAzNDViMTc5YjY5N2Y2NmVjNjc0M2M3MDQxNmJmOTNm;David Rientjes;Linus Torvalds;"mm, oom: avoid reaping only for mm's with blockable invalidate callbacks This uses the new annotation to determine if an mm has mmu notifiers with blockable invalidate range callbacks to avoid oom reaping. Otherwise, the callbacks are used around unmap_page_range(). Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1712141330120.74052@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Christian König <christian.koenig@amd.com> Cc: Dimitri Sivanich <sivanich@hpe.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Oded Gabbay <oded.gabbay@gmail.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: David Airlie <airlied@linux.ie> Cc: Joerg Roedel <joro@8bytes.org> Cc: Doug Ledford <dledford@redhat.com> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Mike Marciniszyn <mike.marciniszyn@intel.com> Cc: Sean Hefty <sean.hefty@intel.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Radim KrÄÂmář <rkrcmar@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/f340ff820345b179b697f66ec6743c70416bf93f;mm, oom: avoid reaping only for mm's with blockable invalidate callbacks;yes;no;no 81;MDY6Q29tbWl0MjMyNTI5ODpmMzQwZmY4MjAzNDViMTc5YjY5N2Y2NmVjNjc0M2M3MDQxNmJmOTNm;David Rientjes;Linus Torvalds;"mm, oom: avoid reaping only for mm's with blockable invalidate callbacks This uses the new annotation to determine if an mm has mmu notifiers with blockable invalidate range callbacks to avoid oom reaping. Otherwise, the callbacks are used around unmap_page_range(). Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1712141330120.74052@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Christian König <christian.koenig@amd.com> Cc: Dimitri Sivanich <sivanich@hpe.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Oded Gabbay <oded.gabbay@gmail.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: David Airlie <airlied@linux.ie> Cc: Joerg Roedel <joro@8bytes.org> Cc: Doug Ledford <dledford@redhat.com> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Mike Marciniszyn <mike.marciniszyn@intel.com> Cc: Sean Hefty <sean.hefty@intel.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Radim KrÄÂmář <rkrcmar@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/f340ff820345b179b697f66ec6743c70416bf93f;"This uses the new annotation to determine if an mm has mmu notifiers with blockable invalidate range callbacks to avoid oom reaping";yes;yes;yes 81;MDY6Q29tbWl0MjMyNTI5ODpmMzQwZmY4MjAzNDViMTc5YjY5N2Y2NmVjNjc0M2M3MDQxNmJmOTNm;David Rientjes;Linus Torvalds;"mm, oom: avoid reaping only for mm's with blockable invalidate callbacks This uses the new annotation to determine if an mm has mmu notifiers with blockable invalidate range callbacks to avoid oom reaping. Otherwise, the callbacks are used around unmap_page_range(). Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1712141330120.74052@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Christian König <christian.koenig@amd.com> Cc: Dimitri Sivanich <sivanich@hpe.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Oded Gabbay <oded.gabbay@gmail.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: David Airlie <airlied@linux.ie> Cc: Joerg Roedel <joro@8bytes.org> Cc: Doug Ledford <dledford@redhat.com> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Mike Marciniszyn <mike.marciniszyn@intel.com> Cc: Sean Hefty <sean.hefty@intel.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Radim KrÄÂmář <rkrcmar@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/f340ff820345b179b697f66ec6743c70416bf93f;Otherwise, the callbacks are used around unmap_page_range().;yes;no;no 82;MDY6Q29tbWl0MjMyNTI5ODo0ODM3ZmUzN2FkZmYxZDE1OTkwNGYwYzAxMzQ3MWIxZWNiY2I0NTVl;Michal Hocko;Linus Torvalds;"mm, oom_reaper: fix memory corruption David Rientjes has reported the following memory corruption while the oom reaper tries to unmap the victims address space BUG: Bad page map in process oom_reaper pte:6353826300000000 pmd:00000000 addr:00007f50cab1d000 vm_flags:08100073 anon_vma:ffff9eea335603f0 mapping: (null) index:7f50cab1d file: (null) fault: (null) mmap: (null) readpage: (null) CPU: 2 PID: 1001 Comm: oom_reaper Call Trace: unmap_page_range+0x1068/0x1130 __oom_reap_task_mm+0xd5/0x16b oom_reaper+0xff/0x14c kthread+0xc1/0xe0 Tetsuo Handa has noticed that the synchronization inside exit_mmap is insufficient. We only synchronize with the oom reaper if tsk_is_oom_victim which is not true if the final __mmput is called from a different context than the oom victim exit path. This can trivially happen from context of any task which has grabbed mm reference (e.g. to read /proc/<pid>/ file which requires mm etc.). The race would look like this oom_reaper oom_victim task mmget_not_zero do_exit mmput __oom_reap_task_mm mmput __mmput exit_mmap remove_vma unmap_page_range Fix this issue by providing a new mm_is_oom_victim() helper which operates on the mm struct rather than a task. Any context which operates on a remote mm struct should use this helper in place of tsk_is_oom_victim. The flag is set in mark_oom_victim and never cleared so it is stable in the exit_mmap path. Debugged by Tetsuo Handa. Link: http://lkml.kernel.org/r/20171210095130.17110-1-mhocko@kernel.org Fixes: 212925802454 (""mm: oom: let oom_reap_task and exit_mmap run concurrently"") Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: David Rientjes <rientjes@google.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Andrea Argangeli <andrea@kernel.org> Cc: <stable@vger.kernel.org> [4.14] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/4837fe37adff1d159904f0c013471b1ecbcb455e;mm, oom_reaper: fix memory corruption;yes;yes;no 82;MDY6Q29tbWl0MjMyNTI5ODo0ODM3ZmUzN2FkZmYxZDE1OTkwNGYwYzAxMzQ3MWIxZWNiY2I0NTVl;Michal Hocko;Linus Torvalds;"mm, oom_reaper: fix memory corruption David Rientjes has reported the following memory corruption while the oom reaper tries to unmap the victims address space BUG: Bad page map in process oom_reaper pte:6353826300000000 pmd:00000000 addr:00007f50cab1d000 vm_flags:08100073 anon_vma:ffff9eea335603f0 mapping: (null) index:7f50cab1d file: (null) fault: (null) mmap: (null) readpage: (null) CPU: 2 PID: 1001 Comm: oom_reaper Call Trace: unmap_page_range+0x1068/0x1130 __oom_reap_task_mm+0xd5/0x16b oom_reaper+0xff/0x14c kthread+0xc1/0xe0 Tetsuo Handa has noticed that the synchronization inside exit_mmap is insufficient. We only synchronize with the oom reaper if tsk_is_oom_victim which is not true if the final __mmput is called from a different context than the oom victim exit path. This can trivially happen from context of any task which has grabbed mm reference (e.g. to read /proc/<pid>/ file which requires mm etc.). The race would look like this oom_reaper oom_victim task mmget_not_zero do_exit mmput __oom_reap_task_mm mmput __mmput exit_mmap remove_vma unmap_page_range Fix this issue by providing a new mm_is_oom_victim() helper which operates on the mm struct rather than a task. Any context which operates on a remote mm struct should use this helper in place of tsk_is_oom_victim. The flag is set in mark_oom_victim and never cleared so it is stable in the exit_mmap path. Debugged by Tetsuo Handa. Link: http://lkml.kernel.org/r/20171210095130.17110-1-mhocko@kernel.org Fixes: 212925802454 (""mm: oom: let oom_reap_task and exit_mmap run concurrently"") Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: David Rientjes <rientjes@google.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Andrea Argangeli <andrea@kernel.org> Cc: <stable@vger.kernel.org> [4.14] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/4837fe37adff1d159904f0c013471b1ecbcb455e;"David Rientjes has reported the following memory corruption while the oom reaper tries to unmap the victims address space BUG: Bad page map in process oom_reaper pte:6353826300000000 pmd:00000000 addr:00007f50cab1d000 vm_flags:08100073 anon_vma:ffff9eea335603f0 mapping: (null) index:7f50cab1d file: (null) fault: (null) mmap: (null) readpage: (null) CPU: 2 PID: 1001 Comm: oom_reaper Tetsuo Handa has noticed that the synchronization inside exit_mmap is insufficient";no;yes;yes 82;MDY6Q29tbWl0MjMyNTI5ODo0ODM3ZmUzN2FkZmYxZDE1OTkwNGYwYzAxMzQ3MWIxZWNiY2I0NTVl;Michal Hocko;Linus Torvalds;"mm, oom_reaper: fix memory corruption David Rientjes has reported the following memory corruption while the oom reaper tries to unmap the victims address space BUG: Bad page map in process oom_reaper pte:6353826300000000 pmd:00000000 addr:00007f50cab1d000 vm_flags:08100073 anon_vma:ffff9eea335603f0 mapping: (null) index:7f50cab1d file: (null) fault: (null) mmap: (null) readpage: (null) CPU: 2 PID: 1001 Comm: oom_reaper Call Trace: unmap_page_range+0x1068/0x1130 __oom_reap_task_mm+0xd5/0x16b oom_reaper+0xff/0x14c kthread+0xc1/0xe0 Tetsuo Handa has noticed that the synchronization inside exit_mmap is insufficient. We only synchronize with the oom reaper if tsk_is_oom_victim which is not true if the final __mmput is called from a different context than the oom victim exit path. This can trivially happen from context of any task which has grabbed mm reference (e.g. to read /proc/<pid>/ file which requires mm etc.). The race would look like this oom_reaper oom_victim task mmget_not_zero do_exit mmput __oom_reap_task_mm mmput __mmput exit_mmap remove_vma unmap_page_range Fix this issue by providing a new mm_is_oom_victim() helper which operates on the mm struct rather than a task. Any context which operates on a remote mm struct should use this helper in place of tsk_is_oom_victim. The flag is set in mark_oom_victim and never cleared so it is stable in the exit_mmap path. Debugged by Tetsuo Handa. Link: http://lkml.kernel.org/r/20171210095130.17110-1-mhocko@kernel.org Fixes: 212925802454 (""mm: oom: let oom_reap_task and exit_mmap run concurrently"") Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: David Rientjes <rientjes@google.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Andrea Argangeli <andrea@kernel.org> Cc: <stable@vger.kernel.org> [4.14] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/4837fe37adff1d159904f0c013471b1ecbcb455e;" We only synchronize with the oom reaper if tsk_is_oom_victim which is not true if the final __mmput is called from a different context than the oom victim exit path";no;no;yes 82;MDY6Q29tbWl0MjMyNTI5ODo0ODM3ZmUzN2FkZmYxZDE1OTkwNGYwYzAxMzQ3MWIxZWNiY2I0NTVl;Michal Hocko;Linus Torvalds;"mm, oom_reaper: fix memory corruption David Rientjes has reported the following memory corruption while the oom reaper tries to unmap the victims address space BUG: Bad page map in process oom_reaper pte:6353826300000000 pmd:00000000 addr:00007f50cab1d000 vm_flags:08100073 anon_vma:ffff9eea335603f0 mapping: (null) index:7f50cab1d file: (null) fault: (null) mmap: (null) readpage: (null) CPU: 2 PID: 1001 Comm: oom_reaper Call Trace: unmap_page_range+0x1068/0x1130 __oom_reap_task_mm+0xd5/0x16b oom_reaper+0xff/0x14c kthread+0xc1/0xe0 Tetsuo Handa has noticed that the synchronization inside exit_mmap is insufficient. We only synchronize with the oom reaper if tsk_is_oom_victim which is not true if the final __mmput is called from a different context than the oom victim exit path. This can trivially happen from context of any task which has grabbed mm reference (e.g. to read /proc/<pid>/ file which requires mm etc.). The race would look like this oom_reaper oom_victim task mmget_not_zero do_exit mmput __oom_reap_task_mm mmput __mmput exit_mmap remove_vma unmap_page_range Fix this issue by providing a new mm_is_oom_victim() helper which operates on the mm struct rather than a task. Any context which operates on a remote mm struct should use this helper in place of tsk_is_oom_victim. The flag is set in mark_oom_victim and never cleared so it is stable in the exit_mmap path. Debugged by Tetsuo Handa. Link: http://lkml.kernel.org/r/20171210095130.17110-1-mhocko@kernel.org Fixes: 212925802454 (""mm: oom: let oom_reap_task and exit_mmap run concurrently"") Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: David Rientjes <rientjes@google.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Andrea Argangeli <andrea@kernel.org> Cc: <stable@vger.kernel.org> [4.14] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/4837fe37adff1d159904f0c013471b1ecbcb455e;" This can trivially happen from context of any task which has grabbed mm reference (e.g";no;no;yes 82;MDY6Q29tbWl0MjMyNTI5ODo0ODM3ZmUzN2FkZmYxZDE1OTkwNGYwYzAxMzQ3MWIxZWNiY2I0NTVl;Michal Hocko;Linus Torvalds;"mm, oom_reaper: fix memory corruption David Rientjes has reported the following memory corruption while the oom reaper tries to unmap the victims address space BUG: Bad page map in process oom_reaper pte:6353826300000000 pmd:00000000 addr:00007f50cab1d000 vm_flags:08100073 anon_vma:ffff9eea335603f0 mapping: (null) index:7f50cab1d file: (null) fault: (null) mmap: (null) readpage: (null) CPU: 2 PID: 1001 Comm: oom_reaper Call Trace: unmap_page_range+0x1068/0x1130 __oom_reap_task_mm+0xd5/0x16b oom_reaper+0xff/0x14c kthread+0xc1/0xe0 Tetsuo Handa has noticed that the synchronization inside exit_mmap is insufficient. We only synchronize with the oom reaper if tsk_is_oom_victim which is not true if the final __mmput is called from a different context than the oom victim exit path. This can trivially happen from context of any task which has grabbed mm reference (e.g. to read /proc/<pid>/ file which requires mm etc.). The race would look like this oom_reaper oom_victim task mmget_not_zero do_exit mmput __oom_reap_task_mm mmput __mmput exit_mmap remove_vma unmap_page_range Fix this issue by providing a new mm_is_oom_victim() helper which operates on the mm struct rather than a task. Any context which operates on a remote mm struct should use this helper in place of tsk_is_oom_victim. The flag is set in mark_oom_victim and never cleared so it is stable in the exit_mmap path. Debugged by Tetsuo Handa. Link: http://lkml.kernel.org/r/20171210095130.17110-1-mhocko@kernel.org Fixes: 212925802454 (""mm: oom: let oom_reap_task and exit_mmap run concurrently"") Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: David Rientjes <rientjes@google.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Andrea Argangeli <andrea@kernel.org> Cc: <stable@vger.kernel.org> [4.14] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/4837fe37adff1d159904f0c013471b1ecbcb455e;" to read /proc/<pid>/ file which requires mm etc.)";no;no;yes 82;MDY6Q29tbWl0MjMyNTI5ODo0ODM3ZmUzN2FkZmYxZDE1OTkwNGYwYzAxMzQ3MWIxZWNiY2I0NTVl;Michal Hocko;Linus Torvalds;"mm, oom_reaper: fix memory corruption David Rientjes has reported the following memory corruption while the oom reaper tries to unmap the victims address space BUG: Bad page map in process oom_reaper pte:6353826300000000 pmd:00000000 addr:00007f50cab1d000 vm_flags:08100073 anon_vma:ffff9eea335603f0 mapping: (null) index:7f50cab1d file: (null) fault: (null) mmap: (null) readpage: (null) CPU: 2 PID: 1001 Comm: oom_reaper Call Trace: unmap_page_range+0x1068/0x1130 __oom_reap_task_mm+0xd5/0x16b oom_reaper+0xff/0x14c kthread+0xc1/0xe0 Tetsuo Handa has noticed that the synchronization inside exit_mmap is insufficient. We only synchronize with the oom reaper if tsk_is_oom_victim which is not true if the final __mmput is called from a different context than the oom victim exit path. This can trivially happen from context of any task which has grabbed mm reference (e.g. to read /proc/<pid>/ file which requires mm etc.). The race would look like this oom_reaper oom_victim task mmget_not_zero do_exit mmput __oom_reap_task_mm mmput __mmput exit_mmap remove_vma unmap_page_range Fix this issue by providing a new mm_is_oom_victim() helper which operates on the mm struct rather than a task. Any context which operates on a remote mm struct should use this helper in place of tsk_is_oom_victim. The flag is set in mark_oom_victim and never cleared so it is stable in the exit_mmap path. Debugged by Tetsuo Handa. Link: http://lkml.kernel.org/r/20171210095130.17110-1-mhocko@kernel.org Fixes: 212925802454 (""mm: oom: let oom_reap_task and exit_mmap run concurrently"") Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: David Rientjes <rientjes@google.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Andrea Argangeli <andrea@kernel.org> Cc: <stable@vger.kernel.org> [4.14] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/4837fe37adff1d159904f0c013471b1ecbcb455e;"The race would look like this oom_reaper oom_victim task mmget_not_zero do_exit mmput __oom_reap_task_mm mmput __mmput exit_mmap remove_vma unmap_page_range Fix this issue by providing a new mm_is_oom_victim() helper which operates on the mm struct rather than a task";yes;yes;yes 82;MDY6Q29tbWl0MjMyNTI5ODo0ODM3ZmUzN2FkZmYxZDE1OTkwNGYwYzAxMzQ3MWIxZWNiY2I0NTVl;Michal Hocko;Linus Torvalds;"mm, oom_reaper: fix memory corruption David Rientjes has reported the following memory corruption while the oom reaper tries to unmap the victims address space BUG: Bad page map in process oom_reaper pte:6353826300000000 pmd:00000000 addr:00007f50cab1d000 vm_flags:08100073 anon_vma:ffff9eea335603f0 mapping: (null) index:7f50cab1d file: (null) fault: (null) mmap: (null) readpage: (null) CPU: 2 PID: 1001 Comm: oom_reaper Call Trace: unmap_page_range+0x1068/0x1130 __oom_reap_task_mm+0xd5/0x16b oom_reaper+0xff/0x14c kthread+0xc1/0xe0 Tetsuo Handa has noticed that the synchronization inside exit_mmap is insufficient. We only synchronize with the oom reaper if tsk_is_oom_victim which is not true if the final __mmput is called from a different context than the oom victim exit path. This can trivially happen from context of any task which has grabbed mm reference (e.g. to read /proc/<pid>/ file which requires mm etc.). The race would look like this oom_reaper oom_victim task mmget_not_zero do_exit mmput __oom_reap_task_mm mmput __mmput exit_mmap remove_vma unmap_page_range Fix this issue by providing a new mm_is_oom_victim() helper which operates on the mm struct rather than a task. Any context which operates on a remote mm struct should use this helper in place of tsk_is_oom_victim. The flag is set in mark_oom_victim and never cleared so it is stable in the exit_mmap path. Debugged by Tetsuo Handa. Link: http://lkml.kernel.org/r/20171210095130.17110-1-mhocko@kernel.org Fixes: 212925802454 (""mm: oom: let oom_reap_task and exit_mmap run concurrently"") Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: David Rientjes <rientjes@google.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Andrea Argangeli <andrea@kernel.org> Cc: <stable@vger.kernel.org> [4.14] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/4837fe37adff1d159904f0c013471b1ecbcb455e;" Any context which operates on a remote mm struct should use this helper in place of tsk_is_oom_victim";yes;yes;no 82;MDY6Q29tbWl0MjMyNTI5ODo0ODM3ZmUzN2FkZmYxZDE1OTkwNGYwYzAxMzQ3MWIxZWNiY2I0NTVl;Michal Hocko;Linus Torvalds;"mm, oom_reaper: fix memory corruption David Rientjes has reported the following memory corruption while the oom reaper tries to unmap the victims address space BUG: Bad page map in process oom_reaper pte:6353826300000000 pmd:00000000 addr:00007f50cab1d000 vm_flags:08100073 anon_vma:ffff9eea335603f0 mapping: (null) index:7f50cab1d file: (null) fault: (null) mmap: (null) readpage: (null) CPU: 2 PID: 1001 Comm: oom_reaper Call Trace: unmap_page_range+0x1068/0x1130 __oom_reap_task_mm+0xd5/0x16b oom_reaper+0xff/0x14c kthread+0xc1/0xe0 Tetsuo Handa has noticed that the synchronization inside exit_mmap is insufficient. We only synchronize with the oom reaper if tsk_is_oom_victim which is not true if the final __mmput is called from a different context than the oom victim exit path. This can trivially happen from context of any task which has grabbed mm reference (e.g. to read /proc/<pid>/ file which requires mm etc.). The race would look like this oom_reaper oom_victim task mmget_not_zero do_exit mmput __oom_reap_task_mm mmput __mmput exit_mmap remove_vma unmap_page_range Fix this issue by providing a new mm_is_oom_victim() helper which operates on the mm struct rather than a task. Any context which operates on a remote mm struct should use this helper in place of tsk_is_oom_victim. The flag is set in mark_oom_victim and never cleared so it is stable in the exit_mmap path. Debugged by Tetsuo Handa. Link: http://lkml.kernel.org/r/20171210095130.17110-1-mhocko@kernel.org Fixes: 212925802454 (""mm: oom: let oom_reap_task and exit_mmap run concurrently"") Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: David Rientjes <rientjes@google.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Andrea Argangeli <andrea@kernel.org> Cc: <stable@vger.kernel.org> [4.14] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/4837fe37adff1d159904f0c013471b1ecbcb455e;" The flag is set in mark_oom_victim and never cleared so it is stable in the exit_mmap path";yes;yes;yes 83;MDY6Q29tbWl0MjMyNTI5ODo2ODdjYjA4ODRhNzE0ZmY0ODRkMDM4ZTkxOTBlZGM4NzRlZGNmMTQ2;Wang Nan;Linus Torvalds;"mm, oom_reaper: gather each vma to prevent leaking TLB entry tlb_gather_mmu(&tlb, mm, 0, -1) means gathering the whole virtual memory space. In this case, tlb->fullmm is true. Some archs like arm64 doesn't flush TLB when tlb->fullmm is true: commit 5a7862e83000 (""arm64: tlbflush: avoid flushing when fullmm == 1""). Which causes leaking of tlb entries. Will clarifies his patch: ""Basically, we tag each address space with an ASID (PCID on x86) which is resident in the TLB. This means we can elide TLB invalidation when pulling down a full mm because we won't ever assign that ASID to another mm without doing TLB invalidation elsewhere (which actually just nukes the whole TLB). I think that means that we could potentially not fault on a kernel uaccess, because we could hit in the TLB"" There could be a window between complete_signal() sending IPI to other cores and all threads sharing this mm are really kicked off from cores. In this window, the oom reaper may calls tlb_flush_mmu_tlbonly() to flush TLB then frees pages. However, due to the above problem, the TLB entries are not really flushed on arm64. Other threads are possible to access these pages through TLB entries. Moreover, a copy_to_user() can also write to these pages without generating page fault, causes use-after-free bugs. This patch gathers each vma instead of gathering full vm space. In this case tlb->fullmm is not true. The behavior of oom reaper become similar to munmapping before do_exit, which should be safe for all archs. Link: http://lkml.kernel.org/r/20171107095453.179940-1-wangnan0@huawei.com Fixes: aac453635549 (""mm, oom: introduce oom reaper"") Signed-off-by: Wang Nan <wangnan0@huawei.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Will Deacon <will.deacon@arm.com> Cc: Bob Liu <liubo95@huawei.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/687cb0884a714ff484d038e9190edc874edcf146;mm, oom_reaper: gather each vma to prevent leaking TLB entry;yes;yes;no 83;MDY6Q29tbWl0MjMyNTI5ODo2ODdjYjA4ODRhNzE0ZmY0ODRkMDM4ZTkxOTBlZGM4NzRlZGNmMTQ2;Wang Nan;Linus Torvalds;"mm, oom_reaper: gather each vma to prevent leaking TLB entry tlb_gather_mmu(&tlb, mm, 0, -1) means gathering the whole virtual memory space. In this case, tlb->fullmm is true. Some archs like arm64 doesn't flush TLB when tlb->fullmm is true: commit 5a7862e83000 (""arm64: tlbflush: avoid flushing when fullmm == 1""). Which causes leaking of tlb entries. Will clarifies his patch: ""Basically, we tag each address space with an ASID (PCID on x86) which is resident in the TLB. This means we can elide TLB invalidation when pulling down a full mm because we won't ever assign that ASID to another mm without doing TLB invalidation elsewhere (which actually just nukes the whole TLB). I think that means that we could potentially not fault on a kernel uaccess, because we could hit in the TLB"" There could be a window between complete_signal() sending IPI to other cores and all threads sharing this mm are really kicked off from cores. In this window, the oom reaper may calls tlb_flush_mmu_tlbonly() to flush TLB then frees pages. However, due to the above problem, the TLB entries are not really flushed on arm64. Other threads are possible to access these pages through TLB entries. Moreover, a copy_to_user() can also write to these pages without generating page fault, causes use-after-free bugs. This patch gathers each vma instead of gathering full vm space. In this case tlb->fullmm is not true. The behavior of oom reaper become similar to munmapping before do_exit, which should be safe for all archs. Link: http://lkml.kernel.org/r/20171107095453.179940-1-wangnan0@huawei.com Fixes: aac453635549 (""mm, oom: introduce oom reaper"") Signed-off-by: Wang Nan <wangnan0@huawei.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Will Deacon <will.deacon@arm.com> Cc: Bob Liu <liubo95@huawei.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/687cb0884a714ff484d038e9190edc874edcf146;"tlb_gather_mmu(&tlb, mm, 0, -1) means gathering the whole virtual memory space";no;no;yes 83;MDY6Q29tbWl0MjMyNTI5ODo2ODdjYjA4ODRhNzE0ZmY0ODRkMDM4ZTkxOTBlZGM4NzRlZGNmMTQ2;Wang Nan;Linus Torvalds;"mm, oom_reaper: gather each vma to prevent leaking TLB entry tlb_gather_mmu(&tlb, mm, 0, -1) means gathering the whole virtual memory space. In this case, tlb->fullmm is true. Some archs like arm64 doesn't flush TLB when tlb->fullmm is true: commit 5a7862e83000 (""arm64: tlbflush: avoid flushing when fullmm == 1""). Which causes leaking of tlb entries. Will clarifies his patch: ""Basically, we tag each address space with an ASID (PCID on x86) which is resident in the TLB. This means we can elide TLB invalidation when pulling down a full mm because we won't ever assign that ASID to another mm without doing TLB invalidation elsewhere (which actually just nukes the whole TLB). I think that means that we could potentially not fault on a kernel uaccess, because we could hit in the TLB"" There could be a window between complete_signal() sending IPI to other cores and all threads sharing this mm are really kicked off from cores. In this window, the oom reaper may calls tlb_flush_mmu_tlbonly() to flush TLB then frees pages. However, due to the above problem, the TLB entries are not really flushed on arm64. Other threads are possible to access these pages through TLB entries. Moreover, a copy_to_user() can also write to these pages without generating page fault, causes use-after-free bugs. This patch gathers each vma instead of gathering full vm space. In this case tlb->fullmm is not true. The behavior of oom reaper become similar to munmapping before do_exit, which should be safe for all archs. Link: http://lkml.kernel.org/r/20171107095453.179940-1-wangnan0@huawei.com Fixes: aac453635549 (""mm, oom: introduce oom reaper"") Signed-off-by: Wang Nan <wangnan0@huawei.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Will Deacon <will.deacon@arm.com> Cc: Bob Liu <liubo95@huawei.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/687cb0884a714ff484d038e9190edc874edcf146; In this case, tlb->fullmm is true;no;no;yes 83;MDY6Q29tbWl0MjMyNTI5ODo2ODdjYjA4ODRhNzE0ZmY0ODRkMDM4ZTkxOTBlZGM4NzRlZGNmMTQ2;Wang Nan;Linus Torvalds;"mm, oom_reaper: gather each vma to prevent leaking TLB entry tlb_gather_mmu(&tlb, mm, 0, -1) means gathering the whole virtual memory space. In this case, tlb->fullmm is true. Some archs like arm64 doesn't flush TLB when tlb->fullmm is true: commit 5a7862e83000 (""arm64: tlbflush: avoid flushing when fullmm == 1""). Which causes leaking of tlb entries. Will clarifies his patch: ""Basically, we tag each address space with an ASID (PCID on x86) which is resident in the TLB. This means we can elide TLB invalidation when pulling down a full mm because we won't ever assign that ASID to another mm without doing TLB invalidation elsewhere (which actually just nukes the whole TLB). I think that means that we could potentially not fault on a kernel uaccess, because we could hit in the TLB"" There could be a window between complete_signal() sending IPI to other cores and all threads sharing this mm are really kicked off from cores. In this window, the oom reaper may calls tlb_flush_mmu_tlbonly() to flush TLB then frees pages. However, due to the above problem, the TLB entries are not really flushed on arm64. Other threads are possible to access these pages through TLB entries. Moreover, a copy_to_user() can also write to these pages without generating page fault, causes use-after-free bugs. This patch gathers each vma instead of gathering full vm space. In this case tlb->fullmm is not true. The behavior of oom reaper become similar to munmapping before do_exit, which should be safe for all archs. Link: http://lkml.kernel.org/r/20171107095453.179940-1-wangnan0@huawei.com Fixes: aac453635549 (""mm, oom: introduce oom reaper"") Signed-off-by: Wang Nan <wangnan0@huawei.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Will Deacon <will.deacon@arm.com> Cc: Bob Liu <liubo95@huawei.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/687cb0884a714ff484d038e9190edc874edcf146;" Some archs like arm64 doesn't flush TLB when tlb->fullmm is true";no;no;yes 83;MDY6Q29tbWl0MjMyNTI5ODo2ODdjYjA4ODRhNzE0ZmY0ODRkMDM4ZTkxOTBlZGM4NzRlZGNmMTQ2;Wang Nan;Linus Torvalds;"mm, oom_reaper: gather each vma to prevent leaking TLB entry tlb_gather_mmu(&tlb, mm, 0, -1) means gathering the whole virtual memory space. In this case, tlb->fullmm is true. Some archs like arm64 doesn't flush TLB when tlb->fullmm is true: commit 5a7862e83000 (""arm64: tlbflush: avoid flushing when fullmm == 1""). Which causes leaking of tlb entries. Will clarifies his patch: ""Basically, we tag each address space with an ASID (PCID on x86) which is resident in the TLB. This means we can elide TLB invalidation when pulling down a full mm because we won't ever assign that ASID to another mm without doing TLB invalidation elsewhere (which actually just nukes the whole TLB). I think that means that we could potentially not fault on a kernel uaccess, because we could hit in the TLB"" There could be a window between complete_signal() sending IPI to other cores and all threads sharing this mm are really kicked off from cores. In this window, the oom reaper may calls tlb_flush_mmu_tlbonly() to flush TLB then frees pages. However, due to the above problem, the TLB entries are not really flushed on arm64. Other threads are possible to access these pages through TLB entries. Moreover, a copy_to_user() can also write to these pages without generating page fault, causes use-after-free bugs. This patch gathers each vma instead of gathering full vm space. In this case tlb->fullmm is not true. The behavior of oom reaper become similar to munmapping before do_exit, which should be safe for all archs. Link: http://lkml.kernel.org/r/20171107095453.179940-1-wangnan0@huawei.com Fixes: aac453635549 (""mm, oom: introduce oom reaper"") Signed-off-by: Wang Nan <wangnan0@huawei.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Will Deacon <will.deacon@arm.com> Cc: Bob Liu <liubo95@huawei.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/687cb0884a714ff484d038e9190edc874edcf146;" commit 5a7862e83000 (""arm64: tlbflush: avoid flushing when fullmm == 1"")";no;no;yes 83;MDY6Q29tbWl0MjMyNTI5ODo2ODdjYjA4ODRhNzE0ZmY0ODRkMDM4ZTkxOTBlZGM4NzRlZGNmMTQ2;Wang Nan;Linus Torvalds;"mm, oom_reaper: gather each vma to prevent leaking TLB entry tlb_gather_mmu(&tlb, mm, 0, -1) means gathering the whole virtual memory space. In this case, tlb->fullmm is true. Some archs like arm64 doesn't flush TLB when tlb->fullmm is true: commit 5a7862e83000 (""arm64: tlbflush: avoid flushing when fullmm == 1""). Which causes leaking of tlb entries. Will clarifies his patch: ""Basically, we tag each address space with an ASID (PCID on x86) which is resident in the TLB. This means we can elide TLB invalidation when pulling down a full mm because we won't ever assign that ASID to another mm without doing TLB invalidation elsewhere (which actually just nukes the whole TLB). I think that means that we could potentially not fault on a kernel uaccess, because we could hit in the TLB"" There could be a window between complete_signal() sending IPI to other cores and all threads sharing this mm are really kicked off from cores. In this window, the oom reaper may calls tlb_flush_mmu_tlbonly() to flush TLB then frees pages. However, due to the above problem, the TLB entries are not really flushed on arm64. Other threads are possible to access these pages through TLB entries. Moreover, a copy_to_user() can also write to these pages without generating page fault, causes use-after-free bugs. This patch gathers each vma instead of gathering full vm space. In this case tlb->fullmm is not true. The behavior of oom reaper become similar to munmapping before do_exit, which should be safe for all archs. Link: http://lkml.kernel.org/r/20171107095453.179940-1-wangnan0@huawei.com Fixes: aac453635549 (""mm, oom: introduce oom reaper"") Signed-off-by: Wang Nan <wangnan0@huawei.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Will Deacon <will.deacon@arm.com> Cc: Bob Liu <liubo95@huawei.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/687cb0884a714ff484d038e9190edc874edcf146;Which causes leaking of tlb entries;no;yes;yes 83;MDY6Q29tbWl0MjMyNTI5ODo2ODdjYjA4ODRhNzE0ZmY0ODRkMDM4ZTkxOTBlZGM4NzRlZGNmMTQ2;Wang Nan;Linus Torvalds;"mm, oom_reaper: gather each vma to prevent leaking TLB entry tlb_gather_mmu(&tlb, mm, 0, -1) means gathering the whole virtual memory space. In this case, tlb->fullmm is true. Some archs like arm64 doesn't flush TLB when tlb->fullmm is true: commit 5a7862e83000 (""arm64: tlbflush: avoid flushing when fullmm == 1""). Which causes leaking of tlb entries. Will clarifies his patch: ""Basically, we tag each address space with an ASID (PCID on x86) which is resident in the TLB. This means we can elide TLB invalidation when pulling down a full mm because we won't ever assign that ASID to another mm without doing TLB invalidation elsewhere (which actually just nukes the whole TLB). I think that means that we could potentially not fault on a kernel uaccess, because we could hit in the TLB"" There could be a window between complete_signal() sending IPI to other cores and all threads sharing this mm are really kicked off from cores. In this window, the oom reaper may calls tlb_flush_mmu_tlbonly() to flush TLB then frees pages. However, due to the above problem, the TLB entries are not really flushed on arm64. Other threads are possible to access these pages through TLB entries. Moreover, a copy_to_user() can also write to these pages without generating page fault, causes use-after-free bugs. This patch gathers each vma instead of gathering full vm space. In this case tlb->fullmm is not true. The behavior of oom reaper become similar to munmapping before do_exit, which should be safe for all archs. Link: http://lkml.kernel.org/r/20171107095453.179940-1-wangnan0@huawei.com Fixes: aac453635549 (""mm, oom: introduce oom reaper"") Signed-off-by: Wang Nan <wangnan0@huawei.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Will Deacon <will.deacon@arm.com> Cc: Bob Liu <liubo95@huawei.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/687cb0884a714ff484d038e9190edc874edcf146;Will clarifies his patch;no;no;yes 83;MDY6Q29tbWl0MjMyNTI5ODo2ODdjYjA4ODRhNzE0ZmY0ODRkMDM4ZTkxOTBlZGM4NzRlZGNmMTQ2;Wang Nan;Linus Torvalds;"mm, oom_reaper: gather each vma to prevent leaking TLB entry tlb_gather_mmu(&tlb, mm, 0, -1) means gathering the whole virtual memory space. In this case, tlb->fullmm is true. Some archs like arm64 doesn't flush TLB when tlb->fullmm is true: commit 5a7862e83000 (""arm64: tlbflush: avoid flushing when fullmm == 1""). Which causes leaking of tlb entries. Will clarifies his patch: ""Basically, we tag each address space with an ASID (PCID on x86) which is resident in the TLB. This means we can elide TLB invalidation when pulling down a full mm because we won't ever assign that ASID to another mm without doing TLB invalidation elsewhere (which actually just nukes the whole TLB). I think that means that we could potentially not fault on a kernel uaccess, because we could hit in the TLB"" There could be a window between complete_signal() sending IPI to other cores and all threads sharing this mm are really kicked off from cores. In this window, the oom reaper may calls tlb_flush_mmu_tlbonly() to flush TLB then frees pages. However, due to the above problem, the TLB entries are not really flushed on arm64. Other threads are possible to access these pages through TLB entries. Moreover, a copy_to_user() can also write to these pages without generating page fault, causes use-after-free bugs. This patch gathers each vma instead of gathering full vm space. In this case tlb->fullmm is not true. The behavior of oom reaper become similar to munmapping before do_exit, which should be safe for all archs. Link: http://lkml.kernel.org/r/20171107095453.179940-1-wangnan0@huawei.com Fixes: aac453635549 (""mm, oom: introduce oom reaper"") Signed-off-by: Wang Nan <wangnan0@huawei.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Will Deacon <will.deacon@arm.com> Cc: Bob Liu <liubo95@huawei.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/687cb0884a714ff484d038e9190edc874edcf146;" ""Basically, we tag each address space with an ASID (PCID on x86) which is resident in the TLB";yes;no;yes 83;MDY6Q29tbWl0MjMyNTI5ODo2ODdjYjA4ODRhNzE0ZmY0ODRkMDM4ZTkxOTBlZGM4NzRlZGNmMTQ2;Wang Nan;Linus Torvalds;"mm, oom_reaper: gather each vma to prevent leaking TLB entry tlb_gather_mmu(&tlb, mm, 0, -1) means gathering the whole virtual memory space. In this case, tlb->fullmm is true. Some archs like arm64 doesn't flush TLB when tlb->fullmm is true: commit 5a7862e83000 (""arm64: tlbflush: avoid flushing when fullmm == 1""). Which causes leaking of tlb entries. Will clarifies his patch: ""Basically, we tag each address space with an ASID (PCID on x86) which is resident in the TLB. This means we can elide TLB invalidation when pulling down a full mm because we won't ever assign that ASID to another mm without doing TLB invalidation elsewhere (which actually just nukes the whole TLB). I think that means that we could potentially not fault on a kernel uaccess, because we could hit in the TLB"" There could be a window between complete_signal() sending IPI to other cores and all threads sharing this mm are really kicked off from cores. In this window, the oom reaper may calls tlb_flush_mmu_tlbonly() to flush TLB then frees pages. However, due to the above problem, the TLB entries are not really flushed on arm64. Other threads are possible to access these pages through TLB entries. Moreover, a copy_to_user() can also write to these pages without generating page fault, causes use-after-free bugs. This patch gathers each vma instead of gathering full vm space. In this case tlb->fullmm is not true. The behavior of oom reaper become similar to munmapping before do_exit, which should be safe for all archs. Link: http://lkml.kernel.org/r/20171107095453.179940-1-wangnan0@huawei.com Fixes: aac453635549 (""mm, oom: introduce oom reaper"") Signed-off-by: Wang Nan <wangnan0@huawei.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Will Deacon <will.deacon@arm.com> Cc: Bob Liu <liubo95@huawei.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/687cb0884a714ff484d038e9190edc874edcf146;"This means we can elide TLB invalidation when pulling down a full mm because we won't ever assign that ASID to another mm without doing TLB invalidation elsewhere (which actually just nukes the whole TLB)";no;yes;yes 83;MDY6Q29tbWl0MjMyNTI5ODo2ODdjYjA4ODRhNzE0ZmY0ODRkMDM4ZTkxOTBlZGM4NzRlZGNmMTQ2;Wang Nan;Linus Torvalds;"mm, oom_reaper: gather each vma to prevent leaking TLB entry tlb_gather_mmu(&tlb, mm, 0, -1) means gathering the whole virtual memory space. In this case, tlb->fullmm is true. Some archs like arm64 doesn't flush TLB when tlb->fullmm is true: commit 5a7862e83000 (""arm64: tlbflush: avoid flushing when fullmm == 1""). Which causes leaking of tlb entries. Will clarifies his patch: ""Basically, we tag each address space with an ASID (PCID on x86) which is resident in the TLB. This means we can elide TLB invalidation when pulling down a full mm because we won't ever assign that ASID to another mm without doing TLB invalidation elsewhere (which actually just nukes the whole TLB). I think that means that we could potentially not fault on a kernel uaccess, because we could hit in the TLB"" There could be a window between complete_signal() sending IPI to other cores and all threads sharing this mm are really kicked off from cores. In this window, the oom reaper may calls tlb_flush_mmu_tlbonly() to flush TLB then frees pages. However, due to the above problem, the TLB entries are not really flushed on arm64. Other threads are possible to access these pages through TLB entries. Moreover, a copy_to_user() can also write to these pages without generating page fault, causes use-after-free bugs. This patch gathers each vma instead of gathering full vm space. In this case tlb->fullmm is not true. The behavior of oom reaper become similar to munmapping before do_exit, which should be safe for all archs. Link: http://lkml.kernel.org/r/20171107095453.179940-1-wangnan0@huawei.com Fixes: aac453635549 (""mm, oom: introduce oom reaper"") Signed-off-by: Wang Nan <wangnan0@huawei.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Will Deacon <will.deacon@arm.com> Cc: Bob Liu <liubo95@huawei.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/687cb0884a714ff484d038e9190edc874edcf146;" I think that means that we could potentially not fault on a kernel uaccess, because we could hit in the TLB"" There could be a window between complete_signal() sending IPI to other cores and all threads sharing this mm are really kicked off from cores";no;yes;yes 83;MDY6Q29tbWl0MjMyNTI5ODo2ODdjYjA4ODRhNzE0ZmY0ODRkMDM4ZTkxOTBlZGM4NzRlZGNmMTQ2;Wang Nan;Linus Torvalds;"mm, oom_reaper: gather each vma to prevent leaking TLB entry tlb_gather_mmu(&tlb, mm, 0, -1) means gathering the whole virtual memory space. In this case, tlb->fullmm is true. Some archs like arm64 doesn't flush TLB when tlb->fullmm is true: commit 5a7862e83000 (""arm64: tlbflush: avoid flushing when fullmm == 1""). Which causes leaking of tlb entries. Will clarifies his patch: ""Basically, we tag each address space with an ASID (PCID on x86) which is resident in the TLB. This means we can elide TLB invalidation when pulling down a full mm because we won't ever assign that ASID to another mm without doing TLB invalidation elsewhere (which actually just nukes the whole TLB). I think that means that we could potentially not fault on a kernel uaccess, because we could hit in the TLB"" There could be a window between complete_signal() sending IPI to other cores and all threads sharing this mm are really kicked off from cores. In this window, the oom reaper may calls tlb_flush_mmu_tlbonly() to flush TLB then frees pages. However, due to the above problem, the TLB entries are not really flushed on arm64. Other threads are possible to access these pages through TLB entries. Moreover, a copy_to_user() can also write to these pages without generating page fault, causes use-after-free bugs. This patch gathers each vma instead of gathering full vm space. In this case tlb->fullmm is not true. The behavior of oom reaper become similar to munmapping before do_exit, which should be safe for all archs. Link: http://lkml.kernel.org/r/20171107095453.179940-1-wangnan0@huawei.com Fixes: aac453635549 (""mm, oom: introduce oom reaper"") Signed-off-by: Wang Nan <wangnan0@huawei.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Will Deacon <will.deacon@arm.com> Cc: Bob Liu <liubo95@huawei.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/687cb0884a714ff484d038e9190edc874edcf146;"In this window, the oom reaper may calls tlb_flush_mmu_tlbonly() to flush TLB then frees pages";no;no;yes 83;MDY6Q29tbWl0MjMyNTI5ODo2ODdjYjA4ODRhNzE0ZmY0ODRkMDM4ZTkxOTBlZGM4NzRlZGNmMTQ2;Wang Nan;Linus Torvalds;"mm, oom_reaper: gather each vma to prevent leaking TLB entry tlb_gather_mmu(&tlb, mm, 0, -1) means gathering the whole virtual memory space. In this case, tlb->fullmm is true. Some archs like arm64 doesn't flush TLB when tlb->fullmm is true: commit 5a7862e83000 (""arm64: tlbflush: avoid flushing when fullmm == 1""). Which causes leaking of tlb entries. Will clarifies his patch: ""Basically, we tag each address space with an ASID (PCID on x86) which is resident in the TLB. This means we can elide TLB invalidation when pulling down a full mm because we won't ever assign that ASID to another mm without doing TLB invalidation elsewhere (which actually just nukes the whole TLB). I think that means that we could potentially not fault on a kernel uaccess, because we could hit in the TLB"" There could be a window between complete_signal() sending IPI to other cores and all threads sharing this mm are really kicked off from cores. In this window, the oom reaper may calls tlb_flush_mmu_tlbonly() to flush TLB then frees pages. However, due to the above problem, the TLB entries are not really flushed on arm64. Other threads are possible to access these pages through TLB entries. Moreover, a copy_to_user() can also write to these pages without generating page fault, causes use-after-free bugs. This patch gathers each vma instead of gathering full vm space. In this case tlb->fullmm is not true. The behavior of oom reaper become similar to munmapping before do_exit, which should be safe for all archs. Link: http://lkml.kernel.org/r/20171107095453.179940-1-wangnan0@huawei.com Fixes: aac453635549 (""mm, oom: introduce oom reaper"") Signed-off-by: Wang Nan <wangnan0@huawei.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Will Deacon <will.deacon@arm.com> Cc: Bob Liu <liubo95@huawei.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/687cb0884a714ff484d038e9190edc874edcf146;" However, due to the above problem, the TLB entries are not really flushed on arm64";no;no;yes 83;MDY6Q29tbWl0MjMyNTI5ODo2ODdjYjA4ODRhNzE0ZmY0ODRkMDM4ZTkxOTBlZGM4NzRlZGNmMTQ2;Wang Nan;Linus Torvalds;"mm, oom_reaper: gather each vma to prevent leaking TLB entry tlb_gather_mmu(&tlb, mm, 0, -1) means gathering the whole virtual memory space. In this case, tlb->fullmm is true. Some archs like arm64 doesn't flush TLB when tlb->fullmm is true: commit 5a7862e83000 (""arm64: tlbflush: avoid flushing when fullmm == 1""). Which causes leaking of tlb entries. Will clarifies his patch: ""Basically, we tag each address space with an ASID (PCID on x86) which is resident in the TLB. This means we can elide TLB invalidation when pulling down a full mm because we won't ever assign that ASID to another mm without doing TLB invalidation elsewhere (which actually just nukes the whole TLB). I think that means that we could potentially not fault on a kernel uaccess, because we could hit in the TLB"" There could be a window between complete_signal() sending IPI to other cores and all threads sharing this mm are really kicked off from cores. In this window, the oom reaper may calls tlb_flush_mmu_tlbonly() to flush TLB then frees pages. However, due to the above problem, the TLB entries are not really flushed on arm64. Other threads are possible to access these pages through TLB entries. Moreover, a copy_to_user() can also write to these pages without generating page fault, causes use-after-free bugs. This patch gathers each vma instead of gathering full vm space. In this case tlb->fullmm is not true. The behavior of oom reaper become similar to munmapping before do_exit, which should be safe for all archs. Link: http://lkml.kernel.org/r/20171107095453.179940-1-wangnan0@huawei.com Fixes: aac453635549 (""mm, oom: introduce oom reaper"") Signed-off-by: Wang Nan <wangnan0@huawei.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Will Deacon <will.deacon@arm.com> Cc: Bob Liu <liubo95@huawei.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/687cb0884a714ff484d038e9190edc874edcf146;" Other threads are possible to access these pages through TLB entries";no;no;yes 83;MDY6Q29tbWl0MjMyNTI5ODo2ODdjYjA4ODRhNzE0ZmY0ODRkMDM4ZTkxOTBlZGM4NzRlZGNmMTQ2;Wang Nan;Linus Torvalds;"mm, oom_reaper: gather each vma to prevent leaking TLB entry tlb_gather_mmu(&tlb, mm, 0, -1) means gathering the whole virtual memory space. In this case, tlb->fullmm is true. Some archs like arm64 doesn't flush TLB when tlb->fullmm is true: commit 5a7862e83000 (""arm64: tlbflush: avoid flushing when fullmm == 1""). Which causes leaking of tlb entries. Will clarifies his patch: ""Basically, we tag each address space with an ASID (PCID on x86) which is resident in the TLB. This means we can elide TLB invalidation when pulling down a full mm because we won't ever assign that ASID to another mm without doing TLB invalidation elsewhere (which actually just nukes the whole TLB). I think that means that we could potentially not fault on a kernel uaccess, because we could hit in the TLB"" There could be a window between complete_signal() sending IPI to other cores and all threads sharing this mm are really kicked off from cores. In this window, the oom reaper may calls tlb_flush_mmu_tlbonly() to flush TLB then frees pages. However, due to the above problem, the TLB entries are not really flushed on arm64. Other threads are possible to access these pages through TLB entries. Moreover, a copy_to_user() can also write to these pages without generating page fault, causes use-after-free bugs. This patch gathers each vma instead of gathering full vm space. In this case tlb->fullmm is not true. The behavior of oom reaper become similar to munmapping before do_exit, which should be safe for all archs. Link: http://lkml.kernel.org/r/20171107095453.179940-1-wangnan0@huawei.com Fixes: aac453635549 (""mm, oom: introduce oom reaper"") Signed-off-by: Wang Nan <wangnan0@huawei.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Will Deacon <will.deacon@arm.com> Cc: Bob Liu <liubo95@huawei.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/687cb0884a714ff484d038e9190edc874edcf146;" Moreover, a copy_to_user() can also write to these pages without generating page fault, causes use-after-free bugs";no;yes;yes 83;MDY6Q29tbWl0MjMyNTI5ODo2ODdjYjA4ODRhNzE0ZmY0ODRkMDM4ZTkxOTBlZGM4NzRlZGNmMTQ2;Wang Nan;Linus Torvalds;"mm, oom_reaper: gather each vma to prevent leaking TLB entry tlb_gather_mmu(&tlb, mm, 0, -1) means gathering the whole virtual memory space. In this case, tlb->fullmm is true. Some archs like arm64 doesn't flush TLB when tlb->fullmm is true: commit 5a7862e83000 (""arm64: tlbflush: avoid flushing when fullmm == 1""). Which causes leaking of tlb entries. Will clarifies his patch: ""Basically, we tag each address space with an ASID (PCID on x86) which is resident in the TLB. This means we can elide TLB invalidation when pulling down a full mm because we won't ever assign that ASID to another mm without doing TLB invalidation elsewhere (which actually just nukes the whole TLB). I think that means that we could potentially not fault on a kernel uaccess, because we could hit in the TLB"" There could be a window between complete_signal() sending IPI to other cores and all threads sharing this mm are really kicked off from cores. In this window, the oom reaper may calls tlb_flush_mmu_tlbonly() to flush TLB then frees pages. However, due to the above problem, the TLB entries are not really flushed on arm64. Other threads are possible to access these pages through TLB entries. Moreover, a copy_to_user() can also write to these pages without generating page fault, causes use-after-free bugs. This patch gathers each vma instead of gathering full vm space. In this case tlb->fullmm is not true. The behavior of oom reaper become similar to munmapping before do_exit, which should be safe for all archs. Link: http://lkml.kernel.org/r/20171107095453.179940-1-wangnan0@huawei.com Fixes: aac453635549 (""mm, oom: introduce oom reaper"") Signed-off-by: Wang Nan <wangnan0@huawei.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Will Deacon <will.deacon@arm.com> Cc: Bob Liu <liubo95@huawei.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/687cb0884a714ff484d038e9190edc874edcf146;This patch gathers each vma instead of gathering full vm space;yes;no;no 83;MDY6Q29tbWl0MjMyNTI5ODo2ODdjYjA4ODRhNzE0ZmY0ODRkMDM4ZTkxOTBlZGM4NzRlZGNmMTQ2;Wang Nan;Linus Torvalds;"mm, oom_reaper: gather each vma to prevent leaking TLB entry tlb_gather_mmu(&tlb, mm, 0, -1) means gathering the whole virtual memory space. In this case, tlb->fullmm is true. Some archs like arm64 doesn't flush TLB when tlb->fullmm is true: commit 5a7862e83000 (""arm64: tlbflush: avoid flushing when fullmm == 1""). Which causes leaking of tlb entries. Will clarifies his patch: ""Basically, we tag each address space with an ASID (PCID on x86) which is resident in the TLB. This means we can elide TLB invalidation when pulling down a full mm because we won't ever assign that ASID to another mm without doing TLB invalidation elsewhere (which actually just nukes the whole TLB). I think that means that we could potentially not fault on a kernel uaccess, because we could hit in the TLB"" There could be a window between complete_signal() sending IPI to other cores and all threads sharing this mm are really kicked off from cores. In this window, the oom reaper may calls tlb_flush_mmu_tlbonly() to flush TLB then frees pages. However, due to the above problem, the TLB entries are not really flushed on arm64. Other threads are possible to access these pages through TLB entries. Moreover, a copy_to_user() can also write to these pages without generating page fault, causes use-after-free bugs. This patch gathers each vma instead of gathering full vm space. In this case tlb->fullmm is not true. The behavior of oom reaper become similar to munmapping before do_exit, which should be safe for all archs. Link: http://lkml.kernel.org/r/20171107095453.179940-1-wangnan0@huawei.com Fixes: aac453635549 (""mm, oom: introduce oom reaper"") Signed-off-by: Wang Nan <wangnan0@huawei.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Will Deacon <will.deacon@arm.com> Cc: Bob Liu <liubo95@huawei.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/687cb0884a714ff484d038e9190edc874edcf146;" In this case tlb->fullmm is not true";yes;no;yes 83;MDY6Q29tbWl0MjMyNTI5ODo2ODdjYjA4ODRhNzE0ZmY0ODRkMDM4ZTkxOTBlZGM4NzRlZGNmMTQ2;Wang Nan;Linus Torvalds;"mm, oom_reaper: gather each vma to prevent leaking TLB entry tlb_gather_mmu(&tlb, mm, 0, -1) means gathering the whole virtual memory space. In this case, tlb->fullmm is true. Some archs like arm64 doesn't flush TLB when tlb->fullmm is true: commit 5a7862e83000 (""arm64: tlbflush: avoid flushing when fullmm == 1""). Which causes leaking of tlb entries. Will clarifies his patch: ""Basically, we tag each address space with an ASID (PCID on x86) which is resident in the TLB. This means we can elide TLB invalidation when pulling down a full mm because we won't ever assign that ASID to another mm without doing TLB invalidation elsewhere (which actually just nukes the whole TLB). I think that means that we could potentially not fault on a kernel uaccess, because we could hit in the TLB"" There could be a window between complete_signal() sending IPI to other cores and all threads sharing this mm are really kicked off from cores. In this window, the oom reaper may calls tlb_flush_mmu_tlbonly() to flush TLB then frees pages. However, due to the above problem, the TLB entries are not really flushed on arm64. Other threads are possible to access these pages through TLB entries. Moreover, a copy_to_user() can also write to these pages without generating page fault, causes use-after-free bugs. This patch gathers each vma instead of gathering full vm space. In this case tlb->fullmm is not true. The behavior of oom reaper become similar to munmapping before do_exit, which should be safe for all archs. Link: http://lkml.kernel.org/r/20171107095453.179940-1-wangnan0@huawei.com Fixes: aac453635549 (""mm, oom: introduce oom reaper"") Signed-off-by: Wang Nan <wangnan0@huawei.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Will Deacon <will.deacon@arm.com> Cc: Bob Liu <liubo95@huawei.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/687cb0884a714ff484d038e9190edc874edcf146;" The behavior of oom reaper become similar to munmapping before do_exit, which should be safe for all archs.";yes;yes;no 84;MDY6Q29tbWl0MjMyNTI5ODowMjA1Zjc1NTcxZTNhNzBjMzVmMGRkNWU2MDg3NzNjY2U5N2Q5ZGJi;Michal Hocko;Linus Torvalds;"mm: simplify nodemask printing alloc_warn() and dump_header() have to explicitly handle NULL nodemask which forces both paths to use pr_cont. We can do better. printk already handles NULL pointers properly so all we need is to teach nodemask_pr_args to handle NULL nodemask carefully. This allows simplification of both alloc_warn() and dump_header() and gets rid of pr_cont altogether. This patch has been motivated by patch from Joe Perches http://lkml.kernel.org/r/b31236dfe3fc924054fd7842bde678e71d193638.1509991345.git.joe@perches.com [akpm@linux-foundation.org: fix tile warning, per Arnd] Link: http://lkml.kernel.org/r/20171109100531.3cn2hcqnuj7mjaju@dhcp22.suse.cz Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Joe Perches <joe@perches.com> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/0205f75571e3a70c35f0dd5e608773cce97d9dbb;mm: simplify nodemask printing;yes;yes;no 84;MDY6Q29tbWl0MjMyNTI5ODowMjA1Zjc1NTcxZTNhNzBjMzVmMGRkNWU2MDg3NzNjY2U5N2Q5ZGJi;Michal Hocko;Linus Torvalds;"mm: simplify nodemask printing alloc_warn() and dump_header() have to explicitly handle NULL nodemask which forces both paths to use pr_cont. We can do better. printk already handles NULL pointers properly so all we need is to teach nodemask_pr_args to handle NULL nodemask carefully. This allows simplification of both alloc_warn() and dump_header() and gets rid of pr_cont altogether. This patch has been motivated by patch from Joe Perches http://lkml.kernel.org/r/b31236dfe3fc924054fd7842bde678e71d193638.1509991345.git.joe@perches.com [akpm@linux-foundation.org: fix tile warning, per Arnd] Link: http://lkml.kernel.org/r/20171109100531.3cn2hcqnuj7mjaju@dhcp22.suse.cz Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Joe Perches <joe@perches.com> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/0205f75571e3a70c35f0dd5e608773cce97d9dbb;"alloc_warn() and dump_header() have to explicitly handle NULL nodemask which forces both paths to use pr_cont";no;no;yes 84;MDY6Q29tbWl0MjMyNTI5ODowMjA1Zjc1NTcxZTNhNzBjMzVmMGRkNWU2MDg3NzNjY2U5N2Q5ZGJi;Michal Hocko;Linus Torvalds;"mm: simplify nodemask printing alloc_warn() and dump_header() have to explicitly handle NULL nodemask which forces both paths to use pr_cont. We can do better. printk already handles NULL pointers properly so all we need is to teach nodemask_pr_args to handle NULL nodemask carefully. This allows simplification of both alloc_warn() and dump_header() and gets rid of pr_cont altogether. This patch has been motivated by patch from Joe Perches http://lkml.kernel.org/r/b31236dfe3fc924054fd7842bde678e71d193638.1509991345.git.joe@perches.com [akpm@linux-foundation.org: fix tile warning, per Arnd] Link: http://lkml.kernel.org/r/20171109100531.3cn2hcqnuj7mjaju@dhcp22.suse.cz Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Joe Perches <joe@perches.com> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/0205f75571e3a70c35f0dd5e608773cce97d9dbb; We can do better;no;yes;no 84;MDY6Q29tbWl0MjMyNTI5ODowMjA1Zjc1NTcxZTNhNzBjMzVmMGRkNWU2MDg3NzNjY2U5N2Q5ZGJi;Michal Hocko;Linus Torvalds;"mm: simplify nodemask printing alloc_warn() and dump_header() have to explicitly handle NULL nodemask which forces both paths to use pr_cont. We can do better. printk already handles NULL pointers properly so all we need is to teach nodemask_pr_args to handle NULL nodemask carefully. This allows simplification of both alloc_warn() and dump_header() and gets rid of pr_cont altogether. This patch has been motivated by patch from Joe Perches http://lkml.kernel.org/r/b31236dfe3fc924054fd7842bde678e71d193638.1509991345.git.joe@perches.com [akpm@linux-foundation.org: fix tile warning, per Arnd] Link: http://lkml.kernel.org/r/20171109100531.3cn2hcqnuj7mjaju@dhcp22.suse.cz Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Joe Perches <joe@perches.com> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/0205f75571e3a70c35f0dd5e608773cce97d9dbb;" printk already handles NULL pointers properly so all we need is to teach nodemask_pr_args to handle NULL nodemask carefully";yes;yes;yes 84;MDY6Q29tbWl0MjMyNTI5ODowMjA1Zjc1NTcxZTNhNzBjMzVmMGRkNWU2MDg3NzNjY2U5N2Q5ZGJi;Michal Hocko;Linus Torvalds;"mm: simplify nodemask printing alloc_warn() and dump_header() have to explicitly handle NULL nodemask which forces both paths to use pr_cont. We can do better. printk already handles NULL pointers properly so all we need is to teach nodemask_pr_args to handle NULL nodemask carefully. This allows simplification of both alloc_warn() and dump_header() and gets rid of pr_cont altogether. This patch has been motivated by patch from Joe Perches http://lkml.kernel.org/r/b31236dfe3fc924054fd7842bde678e71d193638.1509991345.git.joe@perches.com [akpm@linux-foundation.org: fix tile warning, per Arnd] Link: http://lkml.kernel.org/r/20171109100531.3cn2hcqnuj7mjaju@dhcp22.suse.cz Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Joe Perches <joe@perches.com> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/0205f75571e3a70c35f0dd5e608773cce97d9dbb;" This allows simplification of both alloc_warn() and dump_header() and gets rid of pr_cont altogether";yes;yes;no 84;MDY6Q29tbWl0MjMyNTI5ODowMjA1Zjc1NTcxZTNhNzBjMzVmMGRkNWU2MDg3NzNjY2U5N2Q5ZGJi;Michal Hocko;Linus Torvalds;"mm: simplify nodemask printing alloc_warn() and dump_header() have to explicitly handle NULL nodemask which forces both paths to use pr_cont. We can do better. printk already handles NULL pointers properly so all we need is to teach nodemask_pr_args to handle NULL nodemask carefully. This allows simplification of both alloc_warn() and dump_header() and gets rid of pr_cont altogether. This patch has been motivated by patch from Joe Perches http://lkml.kernel.org/r/b31236dfe3fc924054fd7842bde678e71d193638.1509991345.git.joe@perches.com [akpm@linux-foundation.org: fix tile warning, per Arnd] Link: http://lkml.kernel.org/r/20171109100531.3cn2hcqnuj7mjaju@dhcp22.suse.cz Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Joe Perches <joe@perches.com> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/0205f75571e3a70c35f0dd5e608773cce97d9dbb;This patch has been motivated by patch from Joe Perches;no;no;yes 85;MDY6Q29tbWl0MjMyNTI5ODpjNTA4NDJjOGUxY2RkY2RiNjlkM2VjZTRmNGRmMDA1YTBlNmM1Y2Vi;Tetsuo Handa;Linus Torvalds;"mm,oom_reaper: remove pointless kthread_run() error check Since oom_init() is called before userspace processes start, memory allocation failure for creating the OOM reaper kernel thread will let the OOM killer call panic() rather than wake up the OOM reaper. Link: http://lkml.kernel.org/r/1510137800-4602-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/c50842c8e1cddcdb69d3ece4f4df005a0e6c5ceb;mm,oom_reaper: remove pointless kthread_run() error check;yes;yes;no 85;MDY6Q29tbWl0MjMyNTI5ODpjNTA4NDJjOGUxY2RkY2RiNjlkM2VjZTRmNGRmMDA1YTBlNmM1Y2Vi;Tetsuo Handa;Linus Torvalds;"mm,oom_reaper: remove pointless kthread_run() error check Since oom_init() is called before userspace processes start, memory allocation failure for creating the OOM reaper kernel thread will let the OOM killer call panic() rather than wake up the OOM reaper. Link: http://lkml.kernel.org/r/1510137800-4602-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/c50842c8e1cddcdb69d3ece4f4df005a0e6c5ceb;"Since oom_init() is called before userspace processes start, memory allocation failure for creating the OOM reaper kernel thread will let the OOM killer call panic() rather than wake up the OOM reaper.";yes;yes;yes 86;MDY6Q29tbWl0MjMyNTI5ODphZjViMGY2YTA5ZTQyYzlmNGZhODc3MzVmMmEzNjY3NDg3NjdiNjg2;Kirill A. Shutemov;Linus Torvalds;"mm: consolidate page table accounting Currently, we account page tables separately for each page table level, but that's redundant -- we only make use of total memory allocated to page tables for oom_badness calculation. We also provide the information to userspace, but it has dubious value there too. This patch switches page table accounting to single counter. mm->pgtables_bytes is now used to account all page table levels. We use bytes, because page table size for different levels of page table tree may be different. The change has user-visible effect: we don't have VmPMD and VmPUD reported in /proc/[pid]/status. Not sure if anybody uses them. (As alternative, we can always report 0 kB for them.) OOM-killer report is also slightly changed: we now report pgtables_bytes instead of nr_ptes, nr_pmd, nr_puds. Apart from reducing number of counters per-mm, the benefit is that we now calculate oom_badness() more correctly for machines which have different size of page tables depending on level or where page tables are less than a page in size. The only downside can be debuggability because we do not know which page table level could leak. But I do not remember many bugs that would be caught by separate counters so I wouldn't lose sleep over this. [akpm@linux-foundation.org: fix mm/huge_memory.c] Link: http://lkml.kernel.org/r/20171006100651.44742-2-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> [kirill.shutemov@linux.intel.com: fix build] Link: http://lkml.kernel.org/r/20171016150113.ikfxy3e7zzfvsr4w@black.fi.intel.com Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/af5b0f6a09e42c9f4fa87735f2a366748767b686;mm: consolidate page table accounting;yes;yes;no 86;MDY6Q29tbWl0MjMyNTI5ODphZjViMGY2YTA5ZTQyYzlmNGZhODc3MzVmMmEzNjY3NDg3NjdiNjg2;Kirill A. Shutemov;Linus Torvalds;"mm: consolidate page table accounting Currently, we account page tables separately for each page table level, but that's redundant -- we only make use of total memory allocated to page tables for oom_badness calculation. We also provide the information to userspace, but it has dubious value there too. This patch switches page table accounting to single counter. mm->pgtables_bytes is now used to account all page table levels. We use bytes, because page table size for different levels of page table tree may be different. The change has user-visible effect: we don't have VmPMD and VmPUD reported in /proc/[pid]/status. Not sure if anybody uses them. (As alternative, we can always report 0 kB for them.) OOM-killer report is also slightly changed: we now report pgtables_bytes instead of nr_ptes, nr_pmd, nr_puds. Apart from reducing number of counters per-mm, the benefit is that we now calculate oom_badness() more correctly for machines which have different size of page tables depending on level or where page tables are less than a page in size. The only downside can be debuggability because we do not know which page table level could leak. But I do not remember many bugs that would be caught by separate counters so I wouldn't lose sleep over this. [akpm@linux-foundation.org: fix mm/huge_memory.c] Link: http://lkml.kernel.org/r/20171006100651.44742-2-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> [kirill.shutemov@linux.intel.com: fix build] Link: http://lkml.kernel.org/r/20171016150113.ikfxy3e7zzfvsr4w@black.fi.intel.com Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/af5b0f6a09e42c9f4fa87735f2a366748767b686;"Currently, we account page tables separately for each page table level, but that's redundant -- we only make use of total memory allocated to page tables for oom_badness calculation";no;yes;yes 86;MDY6Q29tbWl0MjMyNTI5ODphZjViMGY2YTA5ZTQyYzlmNGZhODc3MzVmMmEzNjY3NDg3NjdiNjg2;Kirill A. Shutemov;Linus Torvalds;"mm: consolidate page table accounting Currently, we account page tables separately for each page table level, but that's redundant -- we only make use of total memory allocated to page tables for oom_badness calculation. We also provide the information to userspace, but it has dubious value there too. This patch switches page table accounting to single counter. mm->pgtables_bytes is now used to account all page table levels. We use bytes, because page table size for different levels of page table tree may be different. The change has user-visible effect: we don't have VmPMD and VmPUD reported in /proc/[pid]/status. Not sure if anybody uses them. (As alternative, we can always report 0 kB for them.) OOM-killer report is also slightly changed: we now report pgtables_bytes instead of nr_ptes, nr_pmd, nr_puds. Apart from reducing number of counters per-mm, the benefit is that we now calculate oom_badness() more correctly for machines which have different size of page tables depending on level or where page tables are less than a page in size. The only downside can be debuggability because we do not know which page table level could leak. But I do not remember many bugs that would be caught by separate counters so I wouldn't lose sleep over this. [akpm@linux-foundation.org: fix mm/huge_memory.c] Link: http://lkml.kernel.org/r/20171006100651.44742-2-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> [kirill.shutemov@linux.intel.com: fix build] Link: http://lkml.kernel.org/r/20171016150113.ikfxy3e7zzfvsr4w@black.fi.intel.com Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/af5b0f6a09e42c9f4fa87735f2a366748767b686;" We also provide the information to userspace, but it has dubious value there too";no;yes;yes 86;MDY6Q29tbWl0MjMyNTI5ODphZjViMGY2YTA5ZTQyYzlmNGZhODc3MzVmMmEzNjY3NDg3NjdiNjg2;Kirill A. Shutemov;Linus Torvalds;"mm: consolidate page table accounting Currently, we account page tables separately for each page table level, but that's redundant -- we only make use of total memory allocated to page tables for oom_badness calculation. We also provide the information to userspace, but it has dubious value there too. This patch switches page table accounting to single counter. mm->pgtables_bytes is now used to account all page table levels. We use bytes, because page table size for different levels of page table tree may be different. The change has user-visible effect: we don't have VmPMD and VmPUD reported in /proc/[pid]/status. Not sure if anybody uses them. (As alternative, we can always report 0 kB for them.) OOM-killer report is also slightly changed: we now report pgtables_bytes instead of nr_ptes, nr_pmd, nr_puds. Apart from reducing number of counters per-mm, the benefit is that we now calculate oom_badness() more correctly for machines which have different size of page tables depending on level or where page tables are less than a page in size. The only downside can be debuggability because we do not know which page table level could leak. But I do not remember many bugs that would be caught by separate counters so I wouldn't lose sleep over this. [akpm@linux-foundation.org: fix mm/huge_memory.c] Link: http://lkml.kernel.org/r/20171006100651.44742-2-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> [kirill.shutemov@linux.intel.com: fix build] Link: http://lkml.kernel.org/r/20171016150113.ikfxy3e7zzfvsr4w@black.fi.intel.com Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/af5b0f6a09e42c9f4fa87735f2a366748767b686;This patch switches page table accounting to single counter;yes;no;no 86;MDY6Q29tbWl0MjMyNTI5ODphZjViMGY2YTA5ZTQyYzlmNGZhODc3MzVmMmEzNjY3NDg3NjdiNjg2;Kirill A. Shutemov;Linus Torvalds;"mm: consolidate page table accounting Currently, we account page tables separately for each page table level, but that's redundant -- we only make use of total memory allocated to page tables for oom_badness calculation. We also provide the information to userspace, but it has dubious value there too. This patch switches page table accounting to single counter. mm->pgtables_bytes is now used to account all page table levels. We use bytes, because page table size for different levels of page table tree may be different. The change has user-visible effect: we don't have VmPMD and VmPUD reported in /proc/[pid]/status. Not sure if anybody uses them. (As alternative, we can always report 0 kB for them.) OOM-killer report is also slightly changed: we now report pgtables_bytes instead of nr_ptes, nr_pmd, nr_puds. Apart from reducing number of counters per-mm, the benefit is that we now calculate oom_badness() more correctly for machines which have different size of page tables depending on level or where page tables are less than a page in size. The only downside can be debuggability because we do not know which page table level could leak. But I do not remember many bugs that would be caught by separate counters so I wouldn't lose sleep over this. [akpm@linux-foundation.org: fix mm/huge_memory.c] Link: http://lkml.kernel.org/r/20171006100651.44742-2-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> [kirill.shutemov@linux.intel.com: fix build] Link: http://lkml.kernel.org/r/20171016150113.ikfxy3e7zzfvsr4w@black.fi.intel.com Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/af5b0f6a09e42c9f4fa87735f2a366748767b686;mm->pgtables_bytes is now used to account all page table levels;yes;no;no 86;MDY6Q29tbWl0MjMyNTI5ODphZjViMGY2YTA5ZTQyYzlmNGZhODc3MzVmMmEzNjY3NDg3NjdiNjg2;Kirill A. Shutemov;Linus Torvalds;"mm: consolidate page table accounting Currently, we account page tables separately for each page table level, but that's redundant -- we only make use of total memory allocated to page tables for oom_badness calculation. We also provide the information to userspace, but it has dubious value there too. This patch switches page table accounting to single counter. mm->pgtables_bytes is now used to account all page table levels. We use bytes, because page table size for different levels of page table tree may be different. The change has user-visible effect: we don't have VmPMD and VmPUD reported in /proc/[pid]/status. Not sure if anybody uses them. (As alternative, we can always report 0 kB for them.) OOM-killer report is also slightly changed: we now report pgtables_bytes instead of nr_ptes, nr_pmd, nr_puds. Apart from reducing number of counters per-mm, the benefit is that we now calculate oom_badness() more correctly for machines which have different size of page tables depending on level or where page tables are less than a page in size. The only downside can be debuggability because we do not know which page table level could leak. But I do not remember many bugs that would be caught by separate counters so I wouldn't lose sleep over this. [akpm@linux-foundation.org: fix mm/huge_memory.c] Link: http://lkml.kernel.org/r/20171006100651.44742-2-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> [kirill.shutemov@linux.intel.com: fix build] Link: http://lkml.kernel.org/r/20171016150113.ikfxy3e7zzfvsr4w@black.fi.intel.com Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/af5b0f6a09e42c9f4fa87735f2a366748767b686;" We use bytes, because page table size for different levels of page table tree may be different";yes;yes;no 86;MDY6Q29tbWl0MjMyNTI5ODphZjViMGY2YTA5ZTQyYzlmNGZhODc3MzVmMmEzNjY3NDg3NjdiNjg2;Kirill A. Shutemov;Linus Torvalds;"mm: consolidate page table accounting Currently, we account page tables separately for each page table level, but that's redundant -- we only make use of total memory allocated to page tables for oom_badness calculation. We also provide the information to userspace, but it has dubious value there too. This patch switches page table accounting to single counter. mm->pgtables_bytes is now used to account all page table levels. We use bytes, because page table size for different levels of page table tree may be different. The change has user-visible effect: we don't have VmPMD and VmPUD reported in /proc/[pid]/status. Not sure if anybody uses them. (As alternative, we can always report 0 kB for them.) OOM-killer report is also slightly changed: we now report pgtables_bytes instead of nr_ptes, nr_pmd, nr_puds. Apart from reducing number of counters per-mm, the benefit is that we now calculate oom_badness() more correctly for machines which have different size of page tables depending on level or where page tables are less than a page in size. The only downside can be debuggability because we do not know which page table level could leak. But I do not remember many bugs that would be caught by separate counters so I wouldn't lose sleep over this. [akpm@linux-foundation.org: fix mm/huge_memory.c] Link: http://lkml.kernel.org/r/20171006100651.44742-2-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> [kirill.shutemov@linux.intel.com: fix build] Link: http://lkml.kernel.org/r/20171016150113.ikfxy3e7zzfvsr4w@black.fi.intel.com Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/af5b0f6a09e42c9f4fa87735f2a366748767b686;"The change has user-visible effect: we don't have VmPMD and VmPUD reported in /proc/[pid]/status";yes;no;yes 86;MDY6Q29tbWl0MjMyNTI5ODphZjViMGY2YTA5ZTQyYzlmNGZhODc3MzVmMmEzNjY3NDg3NjdiNjg2;Kirill A. Shutemov;Linus Torvalds;"mm: consolidate page table accounting Currently, we account page tables separately for each page table level, but that's redundant -- we only make use of total memory allocated to page tables for oom_badness calculation. We also provide the information to userspace, but it has dubious value there too. This patch switches page table accounting to single counter. mm->pgtables_bytes is now used to account all page table levels. We use bytes, because page table size for different levels of page table tree may be different. The change has user-visible effect: we don't have VmPMD and VmPUD reported in /proc/[pid]/status. Not sure if anybody uses them. (As alternative, we can always report 0 kB for them.) OOM-killer report is also slightly changed: we now report pgtables_bytes instead of nr_ptes, nr_pmd, nr_puds. Apart from reducing number of counters per-mm, the benefit is that we now calculate oom_badness() more correctly for machines which have different size of page tables depending on level or where page tables are less than a page in size. The only downside can be debuggability because we do not know which page table level could leak. But I do not remember many bugs that would be caught by separate counters so I wouldn't lose sleep over this. [akpm@linux-foundation.org: fix mm/huge_memory.c] Link: http://lkml.kernel.org/r/20171006100651.44742-2-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> [kirill.shutemov@linux.intel.com: fix build] Link: http://lkml.kernel.org/r/20171016150113.ikfxy3e7zzfvsr4w@black.fi.intel.com Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/af5b0f6a09e42c9f4fa87735f2a366748767b686; Not sure if anybody uses them;no;no;yes 86;MDY6Q29tbWl0MjMyNTI5ODphZjViMGY2YTA5ZTQyYzlmNGZhODc3MzVmMmEzNjY3NDg3NjdiNjg2;Kirill A. Shutemov;Linus Torvalds;"mm: consolidate page table accounting Currently, we account page tables separately for each page table level, but that's redundant -- we only make use of total memory allocated to page tables for oom_badness calculation. We also provide the information to userspace, but it has dubious value there too. This patch switches page table accounting to single counter. mm->pgtables_bytes is now used to account all page table levels. We use bytes, because page table size for different levels of page table tree may be different. The change has user-visible effect: we don't have VmPMD and VmPUD reported in /proc/[pid]/status. Not sure if anybody uses them. (As alternative, we can always report 0 kB for them.) OOM-killer report is also slightly changed: we now report pgtables_bytes instead of nr_ptes, nr_pmd, nr_puds. Apart from reducing number of counters per-mm, the benefit is that we now calculate oom_badness() more correctly for machines which have different size of page tables depending on level or where page tables are less than a page in size. The only downside can be debuggability because we do not know which page table level could leak. But I do not remember many bugs that would be caught by separate counters so I wouldn't lose sleep over this. [akpm@linux-foundation.org: fix mm/huge_memory.c] Link: http://lkml.kernel.org/r/20171006100651.44742-2-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> [kirill.shutemov@linux.intel.com: fix build] Link: http://lkml.kernel.org/r/20171016150113.ikfxy3e7zzfvsr4w@black.fi.intel.com Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/af5b0f6a09e42c9f4fa87735f2a366748767b686;" (As alternative, we can always report 0 kB for them.) OOM-killer report is also slightly changed: we now report pgtables_bytes instead of nr_ptes, nr_pmd, nr_puds";yes;no;no 86;MDY6Q29tbWl0MjMyNTI5ODphZjViMGY2YTA5ZTQyYzlmNGZhODc3MzVmMmEzNjY3NDg3NjdiNjg2;Kirill A. Shutemov;Linus Torvalds;"mm: consolidate page table accounting Currently, we account page tables separately for each page table level, but that's redundant -- we only make use of total memory allocated to page tables for oom_badness calculation. We also provide the information to userspace, but it has dubious value there too. This patch switches page table accounting to single counter. mm->pgtables_bytes is now used to account all page table levels. We use bytes, because page table size for different levels of page table tree may be different. The change has user-visible effect: we don't have VmPMD and VmPUD reported in /proc/[pid]/status. Not sure if anybody uses them. (As alternative, we can always report 0 kB for them.) OOM-killer report is also slightly changed: we now report pgtables_bytes instead of nr_ptes, nr_pmd, nr_puds. Apart from reducing number of counters per-mm, the benefit is that we now calculate oom_badness() more correctly for machines which have different size of page tables depending on level or where page tables are less than a page in size. The only downside can be debuggability because we do not know which page table level could leak. But I do not remember many bugs that would be caught by separate counters so I wouldn't lose sleep over this. [akpm@linux-foundation.org: fix mm/huge_memory.c] Link: http://lkml.kernel.org/r/20171006100651.44742-2-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> [kirill.shutemov@linux.intel.com: fix build] Link: http://lkml.kernel.org/r/20171016150113.ikfxy3e7zzfvsr4w@black.fi.intel.com Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/af5b0f6a09e42c9f4fa87735f2a366748767b686;"Apart from reducing number of counters per-mm, the benefit is that we now calculate oom_badness() more correctly for machines which have different size of page tables depending on level or where page tables are less than a page in size";yes;yes;no 86;MDY6Q29tbWl0MjMyNTI5ODphZjViMGY2YTA5ZTQyYzlmNGZhODc3MzVmMmEzNjY3NDg3NjdiNjg2;Kirill A. Shutemov;Linus Torvalds;"mm: consolidate page table accounting Currently, we account page tables separately for each page table level, but that's redundant -- we only make use of total memory allocated to page tables for oom_badness calculation. We also provide the information to userspace, but it has dubious value there too. This patch switches page table accounting to single counter. mm->pgtables_bytes is now used to account all page table levels. We use bytes, because page table size for different levels of page table tree may be different. The change has user-visible effect: we don't have VmPMD and VmPUD reported in /proc/[pid]/status. Not sure if anybody uses them. (As alternative, we can always report 0 kB for them.) OOM-killer report is also slightly changed: we now report pgtables_bytes instead of nr_ptes, nr_pmd, nr_puds. Apart from reducing number of counters per-mm, the benefit is that we now calculate oom_badness() more correctly for machines which have different size of page tables depending on level or where page tables are less than a page in size. The only downside can be debuggability because we do not know which page table level could leak. But I do not remember many bugs that would be caught by separate counters so I wouldn't lose sleep over this. [akpm@linux-foundation.org: fix mm/huge_memory.c] Link: http://lkml.kernel.org/r/20171006100651.44742-2-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> [kirill.shutemov@linux.intel.com: fix build] Link: http://lkml.kernel.org/r/20171016150113.ikfxy3e7zzfvsr4w@black.fi.intel.com Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/af5b0f6a09e42c9f4fa87735f2a366748767b686;"The only downside can be debuggability because we do not know which page table level could leak";yes;no;no 86;MDY6Q29tbWl0MjMyNTI5ODphZjViMGY2YTA5ZTQyYzlmNGZhODc3MzVmMmEzNjY3NDg3NjdiNjg2;Kirill A. Shutemov;Linus Torvalds;"mm: consolidate page table accounting Currently, we account page tables separately for each page table level, but that's redundant -- we only make use of total memory allocated to page tables for oom_badness calculation. We also provide the information to userspace, but it has dubious value there too. This patch switches page table accounting to single counter. mm->pgtables_bytes is now used to account all page table levels. We use bytes, because page table size for different levels of page table tree may be different. The change has user-visible effect: we don't have VmPMD and VmPUD reported in /proc/[pid]/status. Not sure if anybody uses them. (As alternative, we can always report 0 kB for them.) OOM-killer report is also slightly changed: we now report pgtables_bytes instead of nr_ptes, nr_pmd, nr_puds. Apart from reducing number of counters per-mm, the benefit is that we now calculate oom_badness() more correctly for machines which have different size of page tables depending on level or where page tables are less than a page in size. The only downside can be debuggability because we do not know which page table level could leak. But I do not remember many bugs that would be caught by separate counters so I wouldn't lose sleep over this. [akpm@linux-foundation.org: fix mm/huge_memory.c] Link: http://lkml.kernel.org/r/20171006100651.44742-2-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> [kirill.shutemov@linux.intel.com: fix build] Link: http://lkml.kernel.org/r/20171016150113.ikfxy3e7zzfvsr4w@black.fi.intel.com Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/af5b0f6a09e42c9f4fa87735f2a366748767b686;" But I do not remember many bugs that would be caught by separate counters so I wouldn't lose sleep over this.";yes;yes;yes 87;MDY6Q29tbWl0MjMyNTI5ODpjNDgxMjkwOWY1ZDVhOWI3ZjFjODVhMmQ5NWJlMzg4YTA2NmNkYTUy;Kirill A. Shutemov;Linus Torvalds;"mm: introduce wrappers to access mm->nr_ptes Let's add wrappers for ->nr_ptes with the same interface as for nr_pmd and nr_pud. The patch also makes nr_ptes accounting dependent onto CONFIG_MMU. Page table accounting doesn't make sense if you don't have page tables. It's preparation for consolidation of page-table counters in mm_struct. Link: http://lkml.kernel.org/r/20171006100651.44742-1-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/c4812909f5d5a9b7f1c85a2d95be388a066cda52;mm: introduce wrappers to access mm->nr_ptes;yes;yes;no 87;MDY6Q29tbWl0MjMyNTI5ODpjNDgxMjkwOWY1ZDVhOWI3ZjFjODVhMmQ5NWJlMzg4YTA2NmNkYTUy;Kirill A. Shutemov;Linus Torvalds;"mm: introduce wrappers to access mm->nr_ptes Let's add wrappers for ->nr_ptes with the same interface as for nr_pmd and nr_pud. The patch also makes nr_ptes accounting dependent onto CONFIG_MMU. Page table accounting doesn't make sense if you don't have page tables. It's preparation for consolidation of page-table counters in mm_struct. Link: http://lkml.kernel.org/r/20171006100651.44742-1-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/c4812909f5d5a9b7f1c85a2d95be388a066cda52;"Let's add wrappers for ->nr_ptes with the same interface as for nr_pmd and nr_pud";yes;yes;no 87;MDY6Q29tbWl0MjMyNTI5ODpjNDgxMjkwOWY1ZDVhOWI3ZjFjODVhMmQ5NWJlMzg4YTA2NmNkYTUy;Kirill A. Shutemov;Linus Torvalds;"mm: introduce wrappers to access mm->nr_ptes Let's add wrappers for ->nr_ptes with the same interface as for nr_pmd and nr_pud. The patch also makes nr_ptes accounting dependent onto CONFIG_MMU. Page table accounting doesn't make sense if you don't have page tables. It's preparation for consolidation of page-table counters in mm_struct. Link: http://lkml.kernel.org/r/20171006100651.44742-1-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/c4812909f5d5a9b7f1c85a2d95be388a066cda52;The patch also makes nr_ptes accounting dependent onto CONFIG_MMU;yes;no;no 87;MDY6Q29tbWl0MjMyNTI5ODpjNDgxMjkwOWY1ZDVhOWI3ZjFjODVhMmQ5NWJlMzg4YTA2NmNkYTUy;Kirill A. Shutemov;Linus Torvalds;"mm: introduce wrappers to access mm->nr_ptes Let's add wrappers for ->nr_ptes with the same interface as for nr_pmd and nr_pud. The patch also makes nr_ptes accounting dependent onto CONFIG_MMU. Page table accounting doesn't make sense if you don't have page tables. It's preparation for consolidation of page-table counters in mm_struct. Link: http://lkml.kernel.org/r/20171006100651.44742-1-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/c4812909f5d5a9b7f1c85a2d95be388a066cda52;" Page table accounting doesn't make sense if you don't have page tables";no;yes;yes 87;MDY6Q29tbWl0MjMyNTI5ODpjNDgxMjkwOWY1ZDVhOWI3ZjFjODVhMmQ5NWJlMzg4YTA2NmNkYTUy;Kirill A. Shutemov;Linus Torvalds;"mm: introduce wrappers to access mm->nr_ptes Let's add wrappers for ->nr_ptes with the same interface as for nr_pmd and nr_pud. The patch also makes nr_ptes accounting dependent onto CONFIG_MMU. Page table accounting doesn't make sense if you don't have page tables. It's preparation for consolidation of page-table counters in mm_struct. Link: http://lkml.kernel.org/r/20171006100651.44742-1-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/c4812909f5d5a9b7f1c85a2d95be388a066cda52;It's preparation for consolidation of page-table counters in mm_struct.;no;yes;no 88;MDY6Q29tbWl0MjMyNTI5ODpiNGU5OGQ5YWM3NzU5MDdjYzUzZmIwOGZjYjY3NzZkZWI3Njk0ZTMw;Kirill A. Shutemov;Linus Torvalds;"mm: account pud page tables On a machine with 5-level paging support a process can allocate significant amount of memory and stay unnoticed by oom-killer and memory cgroup. The trick is to allocate a lot of PUD page tables. We don't account PUD page tables, only PMD and PTE. We already addressed the same issue for PMD page tables, see commit dc6c9a35b66b (""mm: account pmd page tables to the process""). Introduction of 5-level paging brings the same issue for PUD page tables. The patch expands accounting to PUD level. [kirill.shutemov@linux.intel.com: s/pmd_t/pud_t/] Link: http://lkml.kernel.org/r/20171004074305.x35eh5u7ybbt5kar@black.fi.intel.com [heiko.carstens@de.ibm.com: s390/mm: fix pud table accounting] Link: http://lkml.kernel.org/r/20171103090551.18231-1-heiko.carstens@de.ibm.com Link: http://lkml.kernel.org/r/20171002080427.3320-1-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/b4e98d9ac775907cc53fb08fcb6776deb7694e30;mm: account pud page tables;yes;no;no 88;MDY6Q29tbWl0MjMyNTI5ODpiNGU5OGQ5YWM3NzU5MDdjYzUzZmIwOGZjYjY3NzZkZWI3Njk0ZTMw;Kirill A. Shutemov;Linus Torvalds;"mm: account pud page tables On a machine with 5-level paging support a process can allocate significant amount of memory and stay unnoticed by oom-killer and memory cgroup. The trick is to allocate a lot of PUD page tables. We don't account PUD page tables, only PMD and PTE. We already addressed the same issue for PMD page tables, see commit dc6c9a35b66b (""mm: account pmd page tables to the process""). Introduction of 5-level paging brings the same issue for PUD page tables. The patch expands accounting to PUD level. [kirill.shutemov@linux.intel.com: s/pmd_t/pud_t/] Link: http://lkml.kernel.org/r/20171004074305.x35eh5u7ybbt5kar@black.fi.intel.com [heiko.carstens@de.ibm.com: s390/mm: fix pud table accounting] Link: http://lkml.kernel.org/r/20171103090551.18231-1-heiko.carstens@de.ibm.com Link: http://lkml.kernel.org/r/20171002080427.3320-1-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/b4e98d9ac775907cc53fb08fcb6776deb7694e30;"On a machine with 5-level paging support a process can allocate significant amount of memory and stay unnoticed by oom-killer and memory cgroup";no;yes;yes 88;MDY6Q29tbWl0MjMyNTI5ODpiNGU5OGQ5YWM3NzU5MDdjYzUzZmIwOGZjYjY3NzZkZWI3Njk0ZTMw;Kirill A. Shutemov;Linus Torvalds;"mm: account pud page tables On a machine with 5-level paging support a process can allocate significant amount of memory and stay unnoticed by oom-killer and memory cgroup. The trick is to allocate a lot of PUD page tables. We don't account PUD page tables, only PMD and PTE. We already addressed the same issue for PMD page tables, see commit dc6c9a35b66b (""mm: account pmd page tables to the process""). Introduction of 5-level paging brings the same issue for PUD page tables. The patch expands accounting to PUD level. [kirill.shutemov@linux.intel.com: s/pmd_t/pud_t/] Link: http://lkml.kernel.org/r/20171004074305.x35eh5u7ybbt5kar@black.fi.intel.com [heiko.carstens@de.ibm.com: s390/mm: fix pud table accounting] Link: http://lkml.kernel.org/r/20171103090551.18231-1-heiko.carstens@de.ibm.com Link: http://lkml.kernel.org/r/20171002080427.3320-1-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/b4e98d9ac775907cc53fb08fcb6776deb7694e30; The trick is to allocate a lot of PUD page tables;yes;no;yes 88;MDY6Q29tbWl0MjMyNTI5ODpiNGU5OGQ5YWM3NzU5MDdjYzUzZmIwOGZjYjY3NzZkZWI3Njk0ZTMw;Kirill A. Shutemov;Linus Torvalds;"mm: account pud page tables On a machine with 5-level paging support a process can allocate significant amount of memory and stay unnoticed by oom-killer and memory cgroup. The trick is to allocate a lot of PUD page tables. We don't account PUD page tables, only PMD and PTE. We already addressed the same issue for PMD page tables, see commit dc6c9a35b66b (""mm: account pmd page tables to the process""). Introduction of 5-level paging brings the same issue for PUD page tables. The patch expands accounting to PUD level. [kirill.shutemov@linux.intel.com: s/pmd_t/pud_t/] Link: http://lkml.kernel.org/r/20171004074305.x35eh5u7ybbt5kar@black.fi.intel.com [heiko.carstens@de.ibm.com: s390/mm: fix pud table accounting] Link: http://lkml.kernel.org/r/20171103090551.18231-1-heiko.carstens@de.ibm.com Link: http://lkml.kernel.org/r/20171002080427.3320-1-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/b4e98d9ac775907cc53fb08fcb6776deb7694e30;" We don't account PUD page tables, only PMD and PTE";yes;no;yes 88;MDY6Q29tbWl0MjMyNTI5ODpiNGU5OGQ5YWM3NzU5MDdjYzUzZmIwOGZjYjY3NzZkZWI3Njk0ZTMw;Kirill A. Shutemov;Linus Torvalds;"mm: account pud page tables On a machine with 5-level paging support a process can allocate significant amount of memory and stay unnoticed by oom-killer and memory cgroup. The trick is to allocate a lot of PUD page tables. We don't account PUD page tables, only PMD and PTE. We already addressed the same issue for PMD page tables, see commit dc6c9a35b66b (""mm: account pmd page tables to the process""). Introduction of 5-level paging brings the same issue for PUD page tables. The patch expands accounting to PUD level. [kirill.shutemov@linux.intel.com: s/pmd_t/pud_t/] Link: http://lkml.kernel.org/r/20171004074305.x35eh5u7ybbt5kar@black.fi.intel.com [heiko.carstens@de.ibm.com: s390/mm: fix pud table accounting] Link: http://lkml.kernel.org/r/20171103090551.18231-1-heiko.carstens@de.ibm.com Link: http://lkml.kernel.org/r/20171002080427.3320-1-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/b4e98d9ac775907cc53fb08fcb6776deb7694e30;"We already addressed the same issue for PMD page tables, see commit dc6c9a35b66b (""mm: account pmd page tables to the process"")";no;no;yes 88;MDY6Q29tbWl0MjMyNTI5ODpiNGU5OGQ5YWM3NzU5MDdjYzUzZmIwOGZjYjY3NzZkZWI3Njk0ZTMw;Kirill A. Shutemov;Linus Torvalds;"mm: account pud page tables On a machine with 5-level paging support a process can allocate significant amount of memory and stay unnoticed by oom-killer and memory cgroup. The trick is to allocate a lot of PUD page tables. We don't account PUD page tables, only PMD and PTE. We already addressed the same issue for PMD page tables, see commit dc6c9a35b66b (""mm: account pmd page tables to the process""). Introduction of 5-level paging brings the same issue for PUD page tables. The patch expands accounting to PUD level. [kirill.shutemov@linux.intel.com: s/pmd_t/pud_t/] Link: http://lkml.kernel.org/r/20171004074305.x35eh5u7ybbt5kar@black.fi.intel.com [heiko.carstens@de.ibm.com: s390/mm: fix pud table accounting] Link: http://lkml.kernel.org/r/20171103090551.18231-1-heiko.carstens@de.ibm.com Link: http://lkml.kernel.org/r/20171002080427.3320-1-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/b4e98d9ac775907cc53fb08fcb6776deb7694e30;"Introduction of 5-level paging brings the same issue for PUD page tables";no;no;yes 88;MDY6Q29tbWl0MjMyNTI5ODpiNGU5OGQ5YWM3NzU5MDdjYzUzZmIwOGZjYjY3NzZkZWI3Njk0ZTMw;Kirill A. Shutemov;Linus Torvalds;"mm: account pud page tables On a machine with 5-level paging support a process can allocate significant amount of memory and stay unnoticed by oom-killer and memory cgroup. The trick is to allocate a lot of PUD page tables. We don't account PUD page tables, only PMD and PTE. We already addressed the same issue for PMD page tables, see commit dc6c9a35b66b (""mm: account pmd page tables to the process""). Introduction of 5-level paging brings the same issue for PUD page tables. The patch expands accounting to PUD level. [kirill.shutemov@linux.intel.com: s/pmd_t/pud_t/] Link: http://lkml.kernel.org/r/20171004074305.x35eh5u7ybbt5kar@black.fi.intel.com [heiko.carstens@de.ibm.com: s390/mm: fix pud table accounting] Link: http://lkml.kernel.org/r/20171103090551.18231-1-heiko.carstens@de.ibm.com Link: http://lkml.kernel.org/r/20171002080427.3320-1-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/b4e98d9ac775907cc53fb08fcb6776deb7694e30;The patch expands accounting to PUD level.;yes;yes;no 90;MDY6Q29tbWl0MjMyNTI5ODo0ZDRiYmQ4NTI2YThmYmViMmMwOTBlYTM2MDIxMWZjZWZmOTUyMzgz;Michal Hocko;Linus Torvalds;"mm, oom_reaper: skip mm structs with mmu notifiers Andrea has noticed that the oom_reaper doesn't invalidate the range via mmu notifiers (mmu_notifier_invalidate_range_start/end) and that can corrupt the memory of the kvm guest for example. tlb_flush_mmu_tlbonly already invokes mmu notifiers but that is not sufficient as per Andrea: ""mmu_notifier_invalidate_range cannot be used in replacement of mmu_notifier_invalidate_range_start/end. For KVM mmu_notifier_invalidate_range is a noop and rightfully so. A MMU notifier implementation has to implement either ->invalidate_range method or the invalidate_range_start/end methods, not both. And if you implement invalidate_range_start/end like KVM is forced to do, calling mmu_notifier_invalidate_range in common code is a noop for KVM. For those MMU notifiers that can get away only implementing ->invalidate_range, the ->invalidate_range is implicitly called by mmu_notifier_invalidate_range_end(). And only those secondary MMUs that share the same pagetable with the primary MMU (like AMD iommuv2) can get away only implementing ->invalidate_range"" As the callback is allowed to sleep and the implementation is out of hand of the MM it is safer to simply bail out if there is an mmu notifier registered. In order to not fail too early make the mm_has_notifiers check under the oom_lock and have a little nap before failing to give the current oom victim some more time to exit. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/20170913113427.2291-1-mhocko@kernel.org Fixes: aac453635549 (""mm, oom: introduce oom reaper"") Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/4d4bbd8526a8fbeb2c090ea360211fceff952383;mm, oom_reaper: skip mm structs with mmu notifiers;yes;no;no 90;MDY6Q29tbWl0MjMyNTI5ODo0ZDRiYmQ4NTI2YThmYmViMmMwOTBlYTM2MDIxMWZjZWZmOTUyMzgz;Michal Hocko;Linus Torvalds;"mm, oom_reaper: skip mm structs with mmu notifiers Andrea has noticed that the oom_reaper doesn't invalidate the range via mmu notifiers (mmu_notifier_invalidate_range_start/end) and that can corrupt the memory of the kvm guest for example. tlb_flush_mmu_tlbonly already invokes mmu notifiers but that is not sufficient as per Andrea: ""mmu_notifier_invalidate_range cannot be used in replacement of mmu_notifier_invalidate_range_start/end. For KVM mmu_notifier_invalidate_range is a noop and rightfully so. A MMU notifier implementation has to implement either ->invalidate_range method or the invalidate_range_start/end methods, not both. And if you implement invalidate_range_start/end like KVM is forced to do, calling mmu_notifier_invalidate_range in common code is a noop for KVM. For those MMU notifiers that can get away only implementing ->invalidate_range, the ->invalidate_range is implicitly called by mmu_notifier_invalidate_range_end(). And only those secondary MMUs that share the same pagetable with the primary MMU (like AMD iommuv2) can get away only implementing ->invalidate_range"" As the callback is allowed to sleep and the implementation is out of hand of the MM it is safer to simply bail out if there is an mmu notifier registered. In order to not fail too early make the mm_has_notifiers check under the oom_lock and have a little nap before failing to give the current oom victim some more time to exit. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/20170913113427.2291-1-mhocko@kernel.org Fixes: aac453635549 (""mm, oom: introduce oom reaper"") Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/4d4bbd8526a8fbeb2c090ea360211fceff952383;"Andrea has noticed that the oom_reaper doesn't invalidate the range via mmu notifiers (mmu_notifier_invalidate_range_start/end) and that can corrupt the memory of the kvm guest for example";no;yes;yes 90;MDY6Q29tbWl0MjMyNTI5ODo0ZDRiYmQ4NTI2YThmYmViMmMwOTBlYTM2MDIxMWZjZWZmOTUyMzgz;Michal Hocko;Linus Torvalds;"mm, oom_reaper: skip mm structs with mmu notifiers Andrea has noticed that the oom_reaper doesn't invalidate the range via mmu notifiers (mmu_notifier_invalidate_range_start/end) and that can corrupt the memory of the kvm guest for example. tlb_flush_mmu_tlbonly already invokes mmu notifiers but that is not sufficient as per Andrea: ""mmu_notifier_invalidate_range cannot be used in replacement of mmu_notifier_invalidate_range_start/end. For KVM mmu_notifier_invalidate_range is a noop and rightfully so. A MMU notifier implementation has to implement either ->invalidate_range method or the invalidate_range_start/end methods, not both. And if you implement invalidate_range_start/end like KVM is forced to do, calling mmu_notifier_invalidate_range in common code is a noop for KVM. For those MMU notifiers that can get away only implementing ->invalidate_range, the ->invalidate_range is implicitly called by mmu_notifier_invalidate_range_end(). And only those secondary MMUs that share the same pagetable with the primary MMU (like AMD iommuv2) can get away only implementing ->invalidate_range"" As the callback is allowed to sleep and the implementation is out of hand of the MM it is safer to simply bail out if there is an mmu notifier registered. In order to not fail too early make the mm_has_notifiers check under the oom_lock and have a little nap before failing to give the current oom victim some more time to exit. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/20170913113427.2291-1-mhocko@kernel.org Fixes: aac453635549 (""mm, oom: introduce oom reaper"") Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/4d4bbd8526a8fbeb2c090ea360211fceff952383;"tlb_flush_mmu_tlbonly already invokes mmu notifiers but that is not sufficient as per Andrea";no;yes;yes 90;MDY6Q29tbWl0MjMyNTI5ODo0ZDRiYmQ4NTI2YThmYmViMmMwOTBlYTM2MDIxMWZjZWZmOTUyMzgz;Michal Hocko;Linus Torvalds;"mm, oom_reaper: skip mm structs with mmu notifiers Andrea has noticed that the oom_reaper doesn't invalidate the range via mmu notifiers (mmu_notifier_invalidate_range_start/end) and that can corrupt the memory of the kvm guest for example. tlb_flush_mmu_tlbonly already invokes mmu notifiers but that is not sufficient as per Andrea: ""mmu_notifier_invalidate_range cannot be used in replacement of mmu_notifier_invalidate_range_start/end. For KVM mmu_notifier_invalidate_range is a noop and rightfully so. A MMU notifier implementation has to implement either ->invalidate_range method or the invalidate_range_start/end methods, not both. And if you implement invalidate_range_start/end like KVM is forced to do, calling mmu_notifier_invalidate_range in common code is a noop for KVM. For those MMU notifiers that can get away only implementing ->invalidate_range, the ->invalidate_range is implicitly called by mmu_notifier_invalidate_range_end(). And only those secondary MMUs that share the same pagetable with the primary MMU (like AMD iommuv2) can get away only implementing ->invalidate_range"" As the callback is allowed to sleep and the implementation is out of hand of the MM it is safer to simply bail out if there is an mmu notifier registered. In order to not fail too early make the mm_has_notifiers check under the oom_lock and have a little nap before failing to give the current oom victim some more time to exit. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/20170913113427.2291-1-mhocko@kernel.org Fixes: aac453635549 (""mm, oom: introduce oom reaper"") Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/4d4bbd8526a8fbeb2c090ea360211fceff952383;" ""mmu_notifier_invalidate_range cannot be used in replacement of mmu_notifier_invalidate_range_start/end";no;no;yes 90;MDY6Q29tbWl0MjMyNTI5ODo0ZDRiYmQ4NTI2YThmYmViMmMwOTBlYTM2MDIxMWZjZWZmOTUyMzgz;Michal Hocko;Linus Torvalds;"mm, oom_reaper: skip mm structs with mmu notifiers Andrea has noticed that the oom_reaper doesn't invalidate the range via mmu notifiers (mmu_notifier_invalidate_range_start/end) and that can corrupt the memory of the kvm guest for example. tlb_flush_mmu_tlbonly already invokes mmu notifiers but that is not sufficient as per Andrea: ""mmu_notifier_invalidate_range cannot be used in replacement of mmu_notifier_invalidate_range_start/end. For KVM mmu_notifier_invalidate_range is a noop and rightfully so. A MMU notifier implementation has to implement either ->invalidate_range method or the invalidate_range_start/end methods, not both. And if you implement invalidate_range_start/end like KVM is forced to do, calling mmu_notifier_invalidate_range in common code is a noop for KVM. For those MMU notifiers that can get away only implementing ->invalidate_range, the ->invalidate_range is implicitly called by mmu_notifier_invalidate_range_end(). And only those secondary MMUs that share the same pagetable with the primary MMU (like AMD iommuv2) can get away only implementing ->invalidate_range"" As the callback is allowed to sleep and the implementation is out of hand of the MM it is safer to simply bail out if there is an mmu notifier registered. In order to not fail too early make the mm_has_notifiers check under the oom_lock and have a little nap before failing to give the current oom victim some more time to exit. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/20170913113427.2291-1-mhocko@kernel.org Fixes: aac453635549 (""mm, oom: introduce oom reaper"") Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/4d4bbd8526a8fbeb2c090ea360211fceff952383;"For KVM mmu_notifier_invalidate_range is a noop and rightfully so";no;no;yes 90;MDY6Q29tbWl0MjMyNTI5ODo0ZDRiYmQ4NTI2YThmYmViMmMwOTBlYTM2MDIxMWZjZWZmOTUyMzgz;Michal Hocko;Linus Torvalds;"mm, oom_reaper: skip mm structs with mmu notifiers Andrea has noticed that the oom_reaper doesn't invalidate the range via mmu notifiers (mmu_notifier_invalidate_range_start/end) and that can corrupt the memory of the kvm guest for example. tlb_flush_mmu_tlbonly already invokes mmu notifiers but that is not sufficient as per Andrea: ""mmu_notifier_invalidate_range cannot be used in replacement of mmu_notifier_invalidate_range_start/end. For KVM mmu_notifier_invalidate_range is a noop and rightfully so. A MMU notifier implementation has to implement either ->invalidate_range method or the invalidate_range_start/end methods, not both. And if you implement invalidate_range_start/end like KVM is forced to do, calling mmu_notifier_invalidate_range in common code is a noop for KVM. For those MMU notifiers that can get away only implementing ->invalidate_range, the ->invalidate_range is implicitly called by mmu_notifier_invalidate_range_end(). And only those secondary MMUs that share the same pagetable with the primary MMU (like AMD iommuv2) can get away only implementing ->invalidate_range"" As the callback is allowed to sleep and the implementation is out of hand of the MM it is safer to simply bail out if there is an mmu notifier registered. In order to not fail too early make the mm_has_notifiers check under the oom_lock and have a little nap before failing to give the current oom victim some more time to exit. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/20170913113427.2291-1-mhocko@kernel.org Fixes: aac453635549 (""mm, oom: introduce oom reaper"") Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/4d4bbd8526a8fbeb2c090ea360211fceff952383;"A MMU notifier implementation has to implement either ->invalidate_range method or the invalidate_range_start/end methods, not both";no;no;yes 90;MDY6Q29tbWl0MjMyNTI5ODo0ZDRiYmQ4NTI2YThmYmViMmMwOTBlYTM2MDIxMWZjZWZmOTUyMzgz;Michal Hocko;Linus Torvalds;"mm, oom_reaper: skip mm structs with mmu notifiers Andrea has noticed that the oom_reaper doesn't invalidate the range via mmu notifiers (mmu_notifier_invalidate_range_start/end) and that can corrupt the memory of the kvm guest for example. tlb_flush_mmu_tlbonly already invokes mmu notifiers but that is not sufficient as per Andrea: ""mmu_notifier_invalidate_range cannot be used in replacement of mmu_notifier_invalidate_range_start/end. For KVM mmu_notifier_invalidate_range is a noop and rightfully so. A MMU notifier implementation has to implement either ->invalidate_range method or the invalidate_range_start/end methods, not both. And if you implement invalidate_range_start/end like KVM is forced to do, calling mmu_notifier_invalidate_range in common code is a noop for KVM. For those MMU notifiers that can get away only implementing ->invalidate_range, the ->invalidate_range is implicitly called by mmu_notifier_invalidate_range_end(). And only those secondary MMUs that share the same pagetable with the primary MMU (like AMD iommuv2) can get away only implementing ->invalidate_range"" As the callback is allowed to sleep and the implementation is out of hand of the MM it is safer to simply bail out if there is an mmu notifier registered. In order to not fail too early make the mm_has_notifiers check under the oom_lock and have a little nap before failing to give the current oom victim some more time to exit. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/20170913113427.2291-1-mhocko@kernel.org Fixes: aac453635549 (""mm, oom: introduce oom reaper"") Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/4d4bbd8526a8fbeb2c090ea360211fceff952383;"And if you implement invalidate_range_start/end like KVM is forced to do, calling mmu_notifier_invalidate_range in common code is a noop for KVM";no;no;yes 90;MDY6Q29tbWl0MjMyNTI5ODo0ZDRiYmQ4NTI2YThmYmViMmMwOTBlYTM2MDIxMWZjZWZmOTUyMzgz;Michal Hocko;Linus Torvalds;"mm, oom_reaper: skip mm structs with mmu notifiers Andrea has noticed that the oom_reaper doesn't invalidate the range via mmu notifiers (mmu_notifier_invalidate_range_start/end) and that can corrupt the memory of the kvm guest for example. tlb_flush_mmu_tlbonly already invokes mmu notifiers but that is not sufficient as per Andrea: ""mmu_notifier_invalidate_range cannot be used in replacement of mmu_notifier_invalidate_range_start/end. For KVM mmu_notifier_invalidate_range is a noop and rightfully so. A MMU notifier implementation has to implement either ->invalidate_range method or the invalidate_range_start/end methods, not both. And if you implement invalidate_range_start/end like KVM is forced to do, calling mmu_notifier_invalidate_range in common code is a noop for KVM. For those MMU notifiers that can get away only implementing ->invalidate_range, the ->invalidate_range is implicitly called by mmu_notifier_invalidate_range_end(). And only those secondary MMUs that share the same pagetable with the primary MMU (like AMD iommuv2) can get away only implementing ->invalidate_range"" As the callback is allowed to sleep and the implementation is out of hand of the MM it is safer to simply bail out if there is an mmu notifier registered. In order to not fail too early make the mm_has_notifiers check under the oom_lock and have a little nap before failing to give the current oom victim some more time to exit. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/20170913113427.2291-1-mhocko@kernel.org Fixes: aac453635549 (""mm, oom: introduce oom reaper"") Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/4d4bbd8526a8fbeb2c090ea360211fceff952383;" For those MMU notifiers that can get away only implementing ->invalidate_range, the ->invalidate_range is implicitly called by mmu_notifier_invalidate_range_end()";no;no;yes 90;MDY6Q29tbWl0MjMyNTI5ODo0ZDRiYmQ4NTI2YThmYmViMmMwOTBlYTM2MDIxMWZjZWZmOTUyMzgz;Michal Hocko;Linus Torvalds;"mm, oom_reaper: skip mm structs with mmu notifiers Andrea has noticed that the oom_reaper doesn't invalidate the range via mmu notifiers (mmu_notifier_invalidate_range_start/end) and that can corrupt the memory of the kvm guest for example. tlb_flush_mmu_tlbonly already invokes mmu notifiers but that is not sufficient as per Andrea: ""mmu_notifier_invalidate_range cannot be used in replacement of mmu_notifier_invalidate_range_start/end. For KVM mmu_notifier_invalidate_range is a noop and rightfully so. A MMU notifier implementation has to implement either ->invalidate_range method or the invalidate_range_start/end methods, not both. And if you implement invalidate_range_start/end like KVM is forced to do, calling mmu_notifier_invalidate_range in common code is a noop for KVM. For those MMU notifiers that can get away only implementing ->invalidate_range, the ->invalidate_range is implicitly called by mmu_notifier_invalidate_range_end(). And only those secondary MMUs that share the same pagetable with the primary MMU (like AMD iommuv2) can get away only implementing ->invalidate_range"" As the callback is allowed to sleep and the implementation is out of hand of the MM it is safer to simply bail out if there is an mmu notifier registered. In order to not fail too early make the mm_has_notifiers check under the oom_lock and have a little nap before failing to give the current oom victim some more time to exit. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/20170913113427.2291-1-mhocko@kernel.org Fixes: aac453635549 (""mm, oom: introduce oom reaper"") Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/4d4bbd8526a8fbeb2c090ea360211fceff952383;"And only those secondary MMUs that share the same pagetable with the primary MMU (like AMD iommuv2) can get away only implementing ->invalidate_range"" As the callback is allowed to sleep and the implementation is out of hand of the MM it is safer to simply bail out if there is an mmu notifier registered";yes;yes;yes 90;MDY6Q29tbWl0MjMyNTI5ODo0ZDRiYmQ4NTI2YThmYmViMmMwOTBlYTM2MDIxMWZjZWZmOTUyMzgz;Michal Hocko;Linus Torvalds;"mm, oom_reaper: skip mm structs with mmu notifiers Andrea has noticed that the oom_reaper doesn't invalidate the range via mmu notifiers (mmu_notifier_invalidate_range_start/end) and that can corrupt the memory of the kvm guest for example. tlb_flush_mmu_tlbonly already invokes mmu notifiers but that is not sufficient as per Andrea: ""mmu_notifier_invalidate_range cannot be used in replacement of mmu_notifier_invalidate_range_start/end. For KVM mmu_notifier_invalidate_range is a noop and rightfully so. A MMU notifier implementation has to implement either ->invalidate_range method or the invalidate_range_start/end methods, not both. And if you implement invalidate_range_start/end like KVM is forced to do, calling mmu_notifier_invalidate_range in common code is a noop for KVM. For those MMU notifiers that can get away only implementing ->invalidate_range, the ->invalidate_range is implicitly called by mmu_notifier_invalidate_range_end(). And only those secondary MMUs that share the same pagetable with the primary MMU (like AMD iommuv2) can get away only implementing ->invalidate_range"" As the callback is allowed to sleep and the implementation is out of hand of the MM it is safer to simply bail out if there is an mmu notifier registered. In order to not fail too early make the mm_has_notifiers check under the oom_lock and have a little nap before failing to give the current oom victim some more time to exit. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/20170913113427.2291-1-mhocko@kernel.org Fixes: aac453635549 (""mm, oom: introduce oom reaper"") Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/4d4bbd8526a8fbeb2c090ea360211fceff952383;" In order to not fail too early make the mm_has_notifiers check under the oom_lock and have a little nap before failing to give the current oom victim some more time to exit.";yes;yes;no 91;MDY6Q29tbWl0MjMyNTI5ODoyMTI5MjU4MDI0NTQ2NzJlNmNkMjk0OWE3MjdmNWUyYzEzNzdiZjA2;Andrea Arcangeli;Linus Torvalds;"mm: oom: let oom_reap_task and exit_mmap run concurrently This is purely required because exit_aio() may block and exit_mmap() may never start, if the oom_reap_task cannot start running on a mm with mm_users == 0. At the same time if the OOM reaper doesn't wait at all for the memory of the current OOM candidate to be freed by exit_mmap->unmap_vmas, it would generate a spurious OOM kill. If it wasn't because of the exit_aio or similar blocking functions in the last mmput, it would be enough to change the oom_reap_task() in the case it finds mm_users == 0, to wait for a timeout or to wait for __mmput to set MMF_OOM_SKIP itself, but it's not just exit_mmap the problem here so the concurrency of exit_mmap and oom_reap_task is apparently warranted. It's a non standard runtime, exit_mmap() runs without mmap_sem, and oom_reap_task runs with the mmap_sem for reading as usual (kind of MADV_DONTNEED). The race between the two is solved with a combination of tsk_is_oom_victim() (serialized by task_lock) and MMF_OOM_SKIP (serialized by a dummy down_write/up_write cycle on the same lines of the ksm_exit method). If the oom_reap_task() may be running concurrently during exit_mmap, exit_mmap will wait it to finish in down_write (before taking down mm structures that would make the oom_reap_task fail with use after free). If exit_mmap comes first, oom_reap_task() will skip the mm if MMF_OOM_SKIP is already set and in turn all memory is already freed and furthermore the mm data structures may already have been taken down by free_pgtables. [aarcange@redhat.com: incremental one liner] Link: http://lkml.kernel.org/r/20170726164319.GC29716@redhat.com [rientjes@google.com: remove unused mmput_async] Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1708141733130.50317@chino.kir.corp.google.com [aarcange@redhat.com: microoptimization] Link: http://lkml.kernel.org/r/20170817171240.GB5066@redhat.com Link: http://lkml.kernel.org/r/20170726162912.GA29716@redhat.com Fixes: 26db62f179d1 (""oom: keep mm of the killed task available"") Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: David Rientjes <rientjes@google.com> Reported-by: David Rientjes <rientjes@google.com> Tested-by: David Rientjes <rientjes@google.com> Reviewed-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: ""Kirill A. Shutemov"" <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/212925802454672e6cd2949a727f5e2c1377bf06;mm: oom: let oom_reap_task and exit_mmap run concurrently;yes;yes;no 91;MDY6Q29tbWl0MjMyNTI5ODoyMTI5MjU4MDI0NTQ2NzJlNmNkMjk0OWE3MjdmNWUyYzEzNzdiZjA2;Andrea Arcangeli;Linus Torvalds;"mm: oom: let oom_reap_task and exit_mmap run concurrently This is purely required because exit_aio() may block and exit_mmap() may never start, if the oom_reap_task cannot start running on a mm with mm_users == 0. At the same time if the OOM reaper doesn't wait at all for the memory of the current OOM candidate to be freed by exit_mmap->unmap_vmas, it would generate a spurious OOM kill. If it wasn't because of the exit_aio or similar blocking functions in the last mmput, it would be enough to change the oom_reap_task() in the case it finds mm_users == 0, to wait for a timeout or to wait for __mmput to set MMF_OOM_SKIP itself, but it's not just exit_mmap the problem here so the concurrency of exit_mmap and oom_reap_task is apparently warranted. It's a non standard runtime, exit_mmap() runs without mmap_sem, and oom_reap_task runs with the mmap_sem for reading as usual (kind of MADV_DONTNEED). The race between the two is solved with a combination of tsk_is_oom_victim() (serialized by task_lock) and MMF_OOM_SKIP (serialized by a dummy down_write/up_write cycle on the same lines of the ksm_exit method). If the oom_reap_task() may be running concurrently during exit_mmap, exit_mmap will wait it to finish in down_write (before taking down mm structures that would make the oom_reap_task fail with use after free). If exit_mmap comes first, oom_reap_task() will skip the mm if MMF_OOM_SKIP is already set and in turn all memory is already freed and furthermore the mm data structures may already have been taken down by free_pgtables. [aarcange@redhat.com: incremental one liner] Link: http://lkml.kernel.org/r/20170726164319.GC29716@redhat.com [rientjes@google.com: remove unused mmput_async] Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1708141733130.50317@chino.kir.corp.google.com [aarcange@redhat.com: microoptimization] Link: http://lkml.kernel.org/r/20170817171240.GB5066@redhat.com Link: http://lkml.kernel.org/r/20170726162912.GA29716@redhat.com Fixes: 26db62f179d1 (""oom: keep mm of the killed task available"") Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: David Rientjes <rientjes@google.com> Reported-by: David Rientjes <rientjes@google.com> Tested-by: David Rientjes <rientjes@google.com> Reviewed-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: ""Kirill A. Shutemov"" <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/212925802454672e6cd2949a727f5e2c1377bf06;"This is purely required because exit_aio() may block and exit_mmap() may never start, if the oom_reap_task cannot start running on a mm with mm_users == 0";no;yes;yes 91;MDY6Q29tbWl0MjMyNTI5ODoyMTI5MjU4MDI0NTQ2NzJlNmNkMjk0OWE3MjdmNWUyYzEzNzdiZjA2;Andrea Arcangeli;Linus Torvalds;"mm: oom: let oom_reap_task and exit_mmap run concurrently This is purely required because exit_aio() may block and exit_mmap() may never start, if the oom_reap_task cannot start running on a mm with mm_users == 0. At the same time if the OOM reaper doesn't wait at all for the memory of the current OOM candidate to be freed by exit_mmap->unmap_vmas, it would generate a spurious OOM kill. If it wasn't because of the exit_aio or similar blocking functions in the last mmput, it would be enough to change the oom_reap_task() in the case it finds mm_users == 0, to wait for a timeout or to wait for __mmput to set MMF_OOM_SKIP itself, but it's not just exit_mmap the problem here so the concurrency of exit_mmap and oom_reap_task is apparently warranted. It's a non standard runtime, exit_mmap() runs without mmap_sem, and oom_reap_task runs with the mmap_sem for reading as usual (kind of MADV_DONTNEED). The race between the two is solved with a combination of tsk_is_oom_victim() (serialized by task_lock) and MMF_OOM_SKIP (serialized by a dummy down_write/up_write cycle on the same lines of the ksm_exit method). If the oom_reap_task() may be running concurrently during exit_mmap, exit_mmap will wait it to finish in down_write (before taking down mm structures that would make the oom_reap_task fail with use after free). If exit_mmap comes first, oom_reap_task() will skip the mm if MMF_OOM_SKIP is already set and in turn all memory is already freed and furthermore the mm data structures may already have been taken down by free_pgtables. [aarcange@redhat.com: incremental one liner] Link: http://lkml.kernel.org/r/20170726164319.GC29716@redhat.com [rientjes@google.com: remove unused mmput_async] Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1708141733130.50317@chino.kir.corp.google.com [aarcange@redhat.com: microoptimization] Link: http://lkml.kernel.org/r/20170817171240.GB5066@redhat.com Link: http://lkml.kernel.org/r/20170726162912.GA29716@redhat.com Fixes: 26db62f179d1 (""oom: keep mm of the killed task available"") Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: David Rientjes <rientjes@google.com> Reported-by: David Rientjes <rientjes@google.com> Tested-by: David Rientjes <rientjes@google.com> Reviewed-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: ""Kirill A. Shutemov"" <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/212925802454672e6cd2949a727f5e2c1377bf06;"At the same time if the OOM reaper doesn't wait at all for the memory of the current OOM candidate to be freed by exit_mmap->unmap_vmas, it would generate a spurious OOM kill";no;yes;yes 91;MDY6Q29tbWl0MjMyNTI5ODoyMTI5MjU4MDI0NTQ2NzJlNmNkMjk0OWE3MjdmNWUyYzEzNzdiZjA2;Andrea Arcangeli;Linus Torvalds;"mm: oom: let oom_reap_task and exit_mmap run concurrently This is purely required because exit_aio() may block and exit_mmap() may never start, if the oom_reap_task cannot start running on a mm with mm_users == 0. At the same time if the OOM reaper doesn't wait at all for the memory of the current OOM candidate to be freed by exit_mmap->unmap_vmas, it would generate a spurious OOM kill. If it wasn't because of the exit_aio or similar blocking functions in the last mmput, it would be enough to change the oom_reap_task() in the case it finds mm_users == 0, to wait for a timeout or to wait for __mmput to set MMF_OOM_SKIP itself, but it's not just exit_mmap the problem here so the concurrency of exit_mmap and oom_reap_task is apparently warranted. It's a non standard runtime, exit_mmap() runs without mmap_sem, and oom_reap_task runs with the mmap_sem for reading as usual (kind of MADV_DONTNEED). The race between the two is solved with a combination of tsk_is_oom_victim() (serialized by task_lock) and MMF_OOM_SKIP (serialized by a dummy down_write/up_write cycle on the same lines of the ksm_exit method). If the oom_reap_task() may be running concurrently during exit_mmap, exit_mmap will wait it to finish in down_write (before taking down mm structures that would make the oom_reap_task fail with use after free). If exit_mmap comes first, oom_reap_task() will skip the mm if MMF_OOM_SKIP is already set and in turn all memory is already freed and furthermore the mm data structures may already have been taken down by free_pgtables. [aarcange@redhat.com: incremental one liner] Link: http://lkml.kernel.org/r/20170726164319.GC29716@redhat.com [rientjes@google.com: remove unused mmput_async] Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1708141733130.50317@chino.kir.corp.google.com [aarcange@redhat.com: microoptimization] Link: http://lkml.kernel.org/r/20170817171240.GB5066@redhat.com Link: http://lkml.kernel.org/r/20170726162912.GA29716@redhat.com Fixes: 26db62f179d1 (""oom: keep mm of the killed task available"") Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: David Rientjes <rientjes@google.com> Reported-by: David Rientjes <rientjes@google.com> Tested-by: David Rientjes <rientjes@google.com> Reviewed-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: ""Kirill A. Shutemov"" <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/212925802454672e6cd2949a727f5e2c1377bf06;"If it wasn't because of the exit_aio or similar blocking functions in the last mmput, it would be enough to change the oom_reap_task() in the case it finds mm_users == 0, to wait for a timeout or to wait for __mmput to set MMF_OOM_SKIP itself, but it's not just exit_mmap the problem here so the concurrency of exit_mmap and oom_reap_task is apparently warranted";no;yes;yes 91;MDY6Q29tbWl0MjMyNTI5ODoyMTI5MjU4MDI0NTQ2NzJlNmNkMjk0OWE3MjdmNWUyYzEzNzdiZjA2;Andrea Arcangeli;Linus Torvalds;"mm: oom: let oom_reap_task and exit_mmap run concurrently This is purely required because exit_aio() may block and exit_mmap() may never start, if the oom_reap_task cannot start running on a mm with mm_users == 0. At the same time if the OOM reaper doesn't wait at all for the memory of the current OOM candidate to be freed by exit_mmap->unmap_vmas, it would generate a spurious OOM kill. If it wasn't because of the exit_aio or similar blocking functions in the last mmput, it would be enough to change the oom_reap_task() in the case it finds mm_users == 0, to wait for a timeout or to wait for __mmput to set MMF_OOM_SKIP itself, but it's not just exit_mmap the problem here so the concurrency of exit_mmap and oom_reap_task is apparently warranted. It's a non standard runtime, exit_mmap() runs without mmap_sem, and oom_reap_task runs with the mmap_sem for reading as usual (kind of MADV_DONTNEED). The race between the two is solved with a combination of tsk_is_oom_victim() (serialized by task_lock) and MMF_OOM_SKIP (serialized by a dummy down_write/up_write cycle on the same lines of the ksm_exit method). If the oom_reap_task() may be running concurrently during exit_mmap, exit_mmap will wait it to finish in down_write (before taking down mm structures that would make the oom_reap_task fail with use after free). If exit_mmap comes first, oom_reap_task() will skip the mm if MMF_OOM_SKIP is already set and in turn all memory is already freed and furthermore the mm data structures may already have been taken down by free_pgtables. [aarcange@redhat.com: incremental one liner] Link: http://lkml.kernel.org/r/20170726164319.GC29716@redhat.com [rientjes@google.com: remove unused mmput_async] Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1708141733130.50317@chino.kir.corp.google.com [aarcange@redhat.com: microoptimization] Link: http://lkml.kernel.org/r/20170817171240.GB5066@redhat.com Link: http://lkml.kernel.org/r/20170726162912.GA29716@redhat.com Fixes: 26db62f179d1 (""oom: keep mm of the killed task available"") Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: David Rientjes <rientjes@google.com> Reported-by: David Rientjes <rientjes@google.com> Tested-by: David Rientjes <rientjes@google.com> Reviewed-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: ""Kirill A. Shutemov"" <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/212925802454672e6cd2949a727f5e2c1377bf06;"It's a non standard runtime, exit_mmap() runs without mmap_sem, and oom_reap_task runs with the mmap_sem for reading as usual (kind of MADV_DONTNEED)";yes;no;yes 91;MDY6Q29tbWl0MjMyNTI5ODoyMTI5MjU4MDI0NTQ2NzJlNmNkMjk0OWE3MjdmNWUyYzEzNzdiZjA2;Andrea Arcangeli;Linus Torvalds;"mm: oom: let oom_reap_task and exit_mmap run concurrently This is purely required because exit_aio() may block and exit_mmap() may never start, if the oom_reap_task cannot start running on a mm with mm_users == 0. At the same time if the OOM reaper doesn't wait at all for the memory of the current OOM candidate to be freed by exit_mmap->unmap_vmas, it would generate a spurious OOM kill. If it wasn't because of the exit_aio or similar blocking functions in the last mmput, it would be enough to change the oom_reap_task() in the case it finds mm_users == 0, to wait for a timeout or to wait for __mmput to set MMF_OOM_SKIP itself, but it's not just exit_mmap the problem here so the concurrency of exit_mmap and oom_reap_task is apparently warranted. It's a non standard runtime, exit_mmap() runs without mmap_sem, and oom_reap_task runs with the mmap_sem for reading as usual (kind of MADV_DONTNEED). The race between the two is solved with a combination of tsk_is_oom_victim() (serialized by task_lock) and MMF_OOM_SKIP (serialized by a dummy down_write/up_write cycle on the same lines of the ksm_exit method). If the oom_reap_task() may be running concurrently during exit_mmap, exit_mmap will wait it to finish in down_write (before taking down mm structures that would make the oom_reap_task fail with use after free). If exit_mmap comes first, oom_reap_task() will skip the mm if MMF_OOM_SKIP is already set and in turn all memory is already freed and furthermore the mm data structures may already have been taken down by free_pgtables. [aarcange@redhat.com: incremental one liner] Link: http://lkml.kernel.org/r/20170726164319.GC29716@redhat.com [rientjes@google.com: remove unused mmput_async] Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1708141733130.50317@chino.kir.corp.google.com [aarcange@redhat.com: microoptimization] Link: http://lkml.kernel.org/r/20170817171240.GB5066@redhat.com Link: http://lkml.kernel.org/r/20170726162912.GA29716@redhat.com Fixes: 26db62f179d1 (""oom: keep mm of the killed task available"") Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: David Rientjes <rientjes@google.com> Reported-by: David Rientjes <rientjes@google.com> Tested-by: David Rientjes <rientjes@google.com> Reviewed-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: ""Kirill A. Shutemov"" <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/212925802454672e6cd2949a727f5e2c1377bf06;"The race between the two is solved with a combination of tsk_is_oom_victim() (serialized by task_lock) and MMF_OOM_SKIP (serialized by a dummy down_write/up_write cycle on the same lines of the ksm_exit method)";yes;yes;no 91;MDY6Q29tbWl0MjMyNTI5ODoyMTI5MjU4MDI0NTQ2NzJlNmNkMjk0OWE3MjdmNWUyYzEzNzdiZjA2;Andrea Arcangeli;Linus Torvalds;"mm: oom: let oom_reap_task and exit_mmap run concurrently This is purely required because exit_aio() may block and exit_mmap() may never start, if the oom_reap_task cannot start running on a mm with mm_users == 0. At the same time if the OOM reaper doesn't wait at all for the memory of the current OOM candidate to be freed by exit_mmap->unmap_vmas, it would generate a spurious OOM kill. If it wasn't because of the exit_aio or similar blocking functions in the last mmput, it would be enough to change the oom_reap_task() in the case it finds mm_users == 0, to wait for a timeout or to wait for __mmput to set MMF_OOM_SKIP itself, but it's not just exit_mmap the problem here so the concurrency of exit_mmap and oom_reap_task is apparently warranted. It's a non standard runtime, exit_mmap() runs without mmap_sem, and oom_reap_task runs with the mmap_sem for reading as usual (kind of MADV_DONTNEED). The race between the two is solved with a combination of tsk_is_oom_victim() (serialized by task_lock) and MMF_OOM_SKIP (serialized by a dummy down_write/up_write cycle on the same lines of the ksm_exit method). If the oom_reap_task() may be running concurrently during exit_mmap, exit_mmap will wait it to finish in down_write (before taking down mm structures that would make the oom_reap_task fail with use after free). If exit_mmap comes first, oom_reap_task() will skip the mm if MMF_OOM_SKIP is already set and in turn all memory is already freed and furthermore the mm data structures may already have been taken down by free_pgtables. [aarcange@redhat.com: incremental one liner] Link: http://lkml.kernel.org/r/20170726164319.GC29716@redhat.com [rientjes@google.com: remove unused mmput_async] Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1708141733130.50317@chino.kir.corp.google.com [aarcange@redhat.com: microoptimization] Link: http://lkml.kernel.org/r/20170817171240.GB5066@redhat.com Link: http://lkml.kernel.org/r/20170726162912.GA29716@redhat.com Fixes: 26db62f179d1 (""oom: keep mm of the killed task available"") Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: David Rientjes <rientjes@google.com> Reported-by: David Rientjes <rientjes@google.com> Tested-by: David Rientjes <rientjes@google.com> Reviewed-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: ""Kirill A. Shutemov"" <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/212925802454672e6cd2949a727f5e2c1377bf06;"If the oom_reap_task() may be running concurrently during exit_mmap, exit_mmap will wait it to finish in down_write (before taking down mm structures that would make the oom_reap_task fail with use after free)";yes;yes;yes 91;MDY6Q29tbWl0MjMyNTI5ODoyMTI5MjU4MDI0NTQ2NzJlNmNkMjk0OWE3MjdmNWUyYzEzNzdiZjA2;Andrea Arcangeli;Linus Torvalds;"mm: oom: let oom_reap_task and exit_mmap run concurrently This is purely required because exit_aio() may block and exit_mmap() may never start, if the oom_reap_task cannot start running on a mm with mm_users == 0. At the same time if the OOM reaper doesn't wait at all for the memory of the current OOM candidate to be freed by exit_mmap->unmap_vmas, it would generate a spurious OOM kill. If it wasn't because of the exit_aio or similar blocking functions in the last mmput, it would be enough to change the oom_reap_task() in the case it finds mm_users == 0, to wait for a timeout or to wait for __mmput to set MMF_OOM_SKIP itself, but it's not just exit_mmap the problem here so the concurrency of exit_mmap and oom_reap_task is apparently warranted. It's a non standard runtime, exit_mmap() runs without mmap_sem, and oom_reap_task runs with the mmap_sem for reading as usual (kind of MADV_DONTNEED). The race between the two is solved with a combination of tsk_is_oom_victim() (serialized by task_lock) and MMF_OOM_SKIP (serialized by a dummy down_write/up_write cycle on the same lines of the ksm_exit method). If the oom_reap_task() may be running concurrently during exit_mmap, exit_mmap will wait it to finish in down_write (before taking down mm structures that would make the oom_reap_task fail with use after free). If exit_mmap comes first, oom_reap_task() will skip the mm if MMF_OOM_SKIP is already set and in turn all memory is already freed and furthermore the mm data structures may already have been taken down by free_pgtables. [aarcange@redhat.com: incremental one liner] Link: http://lkml.kernel.org/r/20170726164319.GC29716@redhat.com [rientjes@google.com: remove unused mmput_async] Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1708141733130.50317@chino.kir.corp.google.com [aarcange@redhat.com: microoptimization] Link: http://lkml.kernel.org/r/20170817171240.GB5066@redhat.com Link: http://lkml.kernel.org/r/20170726162912.GA29716@redhat.com Fixes: 26db62f179d1 (""oom: keep mm of the killed task available"") Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: David Rientjes <rientjes@google.com> Reported-by: David Rientjes <rientjes@google.com> Tested-by: David Rientjes <rientjes@google.com> Reviewed-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: ""Kirill A. Shutemov"" <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/212925802454672e6cd2949a727f5e2c1377bf06;"If exit_mmap comes first, oom_reap_task() will skip the mm if MMF_OOM_SKIP is already set and in turn all memory is already freed and furthermore the mm data structures may already have been taken down by free_pgtables.";yes;no;yes 92;MDY6Q29tbWl0MjMyNTI5ODpjZDA0YWUxZTJkYzhlMzY1MWI4YzQyN2VjMWI5NTAwYzZlZWQ3Yjkw;Michal Hocko;Linus Torvalds;"mm, oom: do not rely on TIF_MEMDIE for memory reserves access For ages we have been relying on TIF_MEMDIE thread flag to mark OOM victims and then, among other things, to give these threads full access to memory reserves. There are few shortcomings of this implementation, though. First of all and the most serious one is that the full access to memory reserves is quite dangerous because we leave no safety room for the system to operate and potentially do last emergency steps to move on. Secondly this flag is per task_struct while the OOM killer operates on mm_struct granularity so all processes sharing the given mm are killed. Giving the full access to all these task_structs could lead to a quick memory reserves depletion. We have tried to reduce this risk by giving TIF_MEMDIE only to the main thread and the currently allocating task but that doesn't really solve this problem while it surely opens up a room for corner cases - e.g. GFP_NO{FS,IO} requests might loop inside the allocator without access to memory reserves because a particular thread was not the group leader. Now that we have the oom reaper and that all oom victims are reapable after 1b51e65eab64 (""oom, oom_reaper: allow to reap mm shared by the kthreads"") we can be more conservative and grant only partial access to memory reserves because there are reasonable chances of the parallel memory freeing. We still want some access to reserves because we do not want other consumers to eat up the victim's freed memory. oom victims will still contend with __GFP_HIGH users but those shouldn't be so aggressive to starve oom victims completely. Introduce ALLOC_OOM flag and give all tsk_is_oom_victim tasks access to the half of the reserves. This makes the access to reserves independent on which task has passed through mark_oom_victim. Also drop any usage of TIF_MEMDIE from the page allocator proper and replace it by tsk_is_oom_victim as well which will make page_alloc.c completely TIF_MEMDIE free finally. CONFIG_MMU=n doesn't have oom reaper so let's stick to the original ALLOC_NO_WATERMARKS approach. There is a demand to make the oom killer memcg aware which will imply many tasks killed at once. This change will allow such a usecase without worrying about complete memory reserves depletion. Link: http://lkml.kernel.org/r/20170810075019.28998-2-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/cd04ae1e2dc8e3651b8c427ec1b9500c6eed7b90;mm, oom: do not rely on TIF_MEMDIE for memory reserves access;yes;no;no 92;MDY6Q29tbWl0MjMyNTI5ODpjZDA0YWUxZTJkYzhlMzY1MWI4YzQyN2VjMWI5NTAwYzZlZWQ3Yjkw;Michal Hocko;Linus Torvalds;"mm, oom: do not rely on TIF_MEMDIE for memory reserves access For ages we have been relying on TIF_MEMDIE thread flag to mark OOM victims and then, among other things, to give these threads full access to memory reserves. There are few shortcomings of this implementation, though. First of all and the most serious one is that the full access to memory reserves is quite dangerous because we leave no safety room for the system to operate and potentially do last emergency steps to move on. Secondly this flag is per task_struct while the OOM killer operates on mm_struct granularity so all processes sharing the given mm are killed. Giving the full access to all these task_structs could lead to a quick memory reserves depletion. We have tried to reduce this risk by giving TIF_MEMDIE only to the main thread and the currently allocating task but that doesn't really solve this problem while it surely opens up a room for corner cases - e.g. GFP_NO{FS,IO} requests might loop inside the allocator without access to memory reserves because a particular thread was not the group leader. Now that we have the oom reaper and that all oom victims are reapable after 1b51e65eab64 (""oom, oom_reaper: allow to reap mm shared by the kthreads"") we can be more conservative and grant only partial access to memory reserves because there are reasonable chances of the parallel memory freeing. We still want some access to reserves because we do not want other consumers to eat up the victim's freed memory. oom victims will still contend with __GFP_HIGH users but those shouldn't be so aggressive to starve oom victims completely. Introduce ALLOC_OOM flag and give all tsk_is_oom_victim tasks access to the half of the reserves. This makes the access to reserves independent on which task has passed through mark_oom_victim. Also drop any usage of TIF_MEMDIE from the page allocator proper and replace it by tsk_is_oom_victim as well which will make page_alloc.c completely TIF_MEMDIE free finally. CONFIG_MMU=n doesn't have oom reaper so let's stick to the original ALLOC_NO_WATERMARKS approach. There is a demand to make the oom killer memcg aware which will imply many tasks killed at once. This change will allow such a usecase without worrying about complete memory reserves depletion. Link: http://lkml.kernel.org/r/20170810075019.28998-2-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/cd04ae1e2dc8e3651b8c427ec1b9500c6eed7b90;"For ages we have been relying on TIF_MEMDIE thread flag to mark OOM victims and then, among other things, to give these threads full access to memory reserves";no;no;yes 92;MDY6Q29tbWl0MjMyNTI5ODpjZDA0YWUxZTJkYzhlMzY1MWI4YzQyN2VjMWI5NTAwYzZlZWQ3Yjkw;Michal Hocko;Linus Torvalds;"mm, oom: do not rely on TIF_MEMDIE for memory reserves access For ages we have been relying on TIF_MEMDIE thread flag to mark OOM victims and then, among other things, to give these threads full access to memory reserves. There are few shortcomings of this implementation, though. First of all and the most serious one is that the full access to memory reserves is quite dangerous because we leave no safety room for the system to operate and potentially do last emergency steps to move on. Secondly this flag is per task_struct while the OOM killer operates on mm_struct granularity so all processes sharing the given mm are killed. Giving the full access to all these task_structs could lead to a quick memory reserves depletion. We have tried to reduce this risk by giving TIF_MEMDIE only to the main thread and the currently allocating task but that doesn't really solve this problem while it surely opens up a room for corner cases - e.g. GFP_NO{FS,IO} requests might loop inside the allocator without access to memory reserves because a particular thread was not the group leader. Now that we have the oom reaper and that all oom victims are reapable after 1b51e65eab64 (""oom, oom_reaper: allow to reap mm shared by the kthreads"") we can be more conservative and grant only partial access to memory reserves because there are reasonable chances of the parallel memory freeing. We still want some access to reserves because we do not want other consumers to eat up the victim's freed memory. oom victims will still contend with __GFP_HIGH users but those shouldn't be so aggressive to starve oom victims completely. Introduce ALLOC_OOM flag and give all tsk_is_oom_victim tasks access to the half of the reserves. This makes the access to reserves independent on which task has passed through mark_oom_victim. Also drop any usage of TIF_MEMDIE from the page allocator proper and replace it by tsk_is_oom_victim as well which will make page_alloc.c completely TIF_MEMDIE free finally. CONFIG_MMU=n doesn't have oom reaper so let's stick to the original ALLOC_NO_WATERMARKS approach. There is a demand to make the oom killer memcg aware which will imply many tasks killed at once. This change will allow such a usecase without worrying about complete memory reserves depletion. Link: http://lkml.kernel.org/r/20170810075019.28998-2-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/cd04ae1e2dc8e3651b8c427ec1b9500c6eed7b90;" There are few shortcomings of this implementation, though";no;yes;yes 92;MDY6Q29tbWl0MjMyNTI5ODpjZDA0YWUxZTJkYzhlMzY1MWI4YzQyN2VjMWI5NTAwYzZlZWQ3Yjkw;Michal Hocko;Linus Torvalds;"mm, oom: do not rely on TIF_MEMDIE for memory reserves access For ages we have been relying on TIF_MEMDIE thread flag to mark OOM victims and then, among other things, to give these threads full access to memory reserves. There are few shortcomings of this implementation, though. First of all and the most serious one is that the full access to memory reserves is quite dangerous because we leave no safety room for the system to operate and potentially do last emergency steps to move on. Secondly this flag is per task_struct while the OOM killer operates on mm_struct granularity so all processes sharing the given mm are killed. Giving the full access to all these task_structs could lead to a quick memory reserves depletion. We have tried to reduce this risk by giving TIF_MEMDIE only to the main thread and the currently allocating task but that doesn't really solve this problem while it surely opens up a room for corner cases - e.g. GFP_NO{FS,IO} requests might loop inside the allocator without access to memory reserves because a particular thread was not the group leader. Now that we have the oom reaper and that all oom victims are reapable after 1b51e65eab64 (""oom, oom_reaper: allow to reap mm shared by the kthreads"") we can be more conservative and grant only partial access to memory reserves because there are reasonable chances of the parallel memory freeing. We still want some access to reserves because we do not want other consumers to eat up the victim's freed memory. oom victims will still contend with __GFP_HIGH users but those shouldn't be so aggressive to starve oom victims completely. Introduce ALLOC_OOM flag and give all tsk_is_oom_victim tasks access to the half of the reserves. This makes the access to reserves independent on which task has passed through mark_oom_victim. Also drop any usage of TIF_MEMDIE from the page allocator proper and replace it by tsk_is_oom_victim as well which will make page_alloc.c completely TIF_MEMDIE free finally. CONFIG_MMU=n doesn't have oom reaper so let's stick to the original ALLOC_NO_WATERMARKS approach. There is a demand to make the oom killer memcg aware which will imply many tasks killed at once. This change will allow such a usecase without worrying about complete memory reserves depletion. Link: http://lkml.kernel.org/r/20170810075019.28998-2-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/cd04ae1e2dc8e3651b8c427ec1b9500c6eed7b90;"First of all and the most serious one is that the full access to memory reserves is quite dangerous because we leave no safety room for the system to operate and potentially do last emergency steps to move on";no;yes;yes 92;MDY6Q29tbWl0MjMyNTI5ODpjZDA0YWUxZTJkYzhlMzY1MWI4YzQyN2VjMWI5NTAwYzZlZWQ3Yjkw;Michal Hocko;Linus Torvalds;"mm, oom: do not rely on TIF_MEMDIE for memory reserves access For ages we have been relying on TIF_MEMDIE thread flag to mark OOM victims and then, among other things, to give these threads full access to memory reserves. There are few shortcomings of this implementation, though. First of all and the most serious one is that the full access to memory reserves is quite dangerous because we leave no safety room for the system to operate and potentially do last emergency steps to move on. Secondly this flag is per task_struct while the OOM killer operates on mm_struct granularity so all processes sharing the given mm are killed. Giving the full access to all these task_structs could lead to a quick memory reserves depletion. We have tried to reduce this risk by giving TIF_MEMDIE only to the main thread and the currently allocating task but that doesn't really solve this problem while it surely opens up a room for corner cases - e.g. GFP_NO{FS,IO} requests might loop inside the allocator without access to memory reserves because a particular thread was not the group leader. Now that we have the oom reaper and that all oom victims are reapable after 1b51e65eab64 (""oom, oom_reaper: allow to reap mm shared by the kthreads"") we can be more conservative and grant only partial access to memory reserves because there are reasonable chances of the parallel memory freeing. We still want some access to reserves because we do not want other consumers to eat up the victim's freed memory. oom victims will still contend with __GFP_HIGH users but those shouldn't be so aggressive to starve oom victims completely. Introduce ALLOC_OOM flag and give all tsk_is_oom_victim tasks access to the half of the reserves. This makes the access to reserves independent on which task has passed through mark_oom_victim. Also drop any usage of TIF_MEMDIE from the page allocator proper and replace it by tsk_is_oom_victim as well which will make page_alloc.c completely TIF_MEMDIE free finally. CONFIG_MMU=n doesn't have oom reaper so let's stick to the original ALLOC_NO_WATERMARKS approach. There is a demand to make the oom killer memcg aware which will imply many tasks killed at once. This change will allow such a usecase without worrying about complete memory reserves depletion. Link: http://lkml.kernel.org/r/20170810075019.28998-2-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/cd04ae1e2dc8e3651b8c427ec1b9500c6eed7b90;"Secondly this flag is per task_struct while the OOM killer operates on mm_struct granularity so all processes sharing the given mm are killed";no;yes;yes 92;MDY6Q29tbWl0MjMyNTI5ODpjZDA0YWUxZTJkYzhlMzY1MWI4YzQyN2VjMWI5NTAwYzZlZWQ3Yjkw;Michal Hocko;Linus Torvalds;"mm, oom: do not rely on TIF_MEMDIE for memory reserves access For ages we have been relying on TIF_MEMDIE thread flag to mark OOM victims and then, among other things, to give these threads full access to memory reserves. There are few shortcomings of this implementation, though. First of all and the most serious one is that the full access to memory reserves is quite dangerous because we leave no safety room for the system to operate and potentially do last emergency steps to move on. Secondly this flag is per task_struct while the OOM killer operates on mm_struct granularity so all processes sharing the given mm are killed. Giving the full access to all these task_structs could lead to a quick memory reserves depletion. We have tried to reduce this risk by giving TIF_MEMDIE only to the main thread and the currently allocating task but that doesn't really solve this problem while it surely opens up a room for corner cases - e.g. GFP_NO{FS,IO} requests might loop inside the allocator without access to memory reserves because a particular thread was not the group leader. Now that we have the oom reaper and that all oom victims are reapable after 1b51e65eab64 (""oom, oom_reaper: allow to reap mm shared by the kthreads"") we can be more conservative and grant only partial access to memory reserves because there are reasonable chances of the parallel memory freeing. We still want some access to reserves because we do not want other consumers to eat up the victim's freed memory. oom victims will still contend with __GFP_HIGH users but those shouldn't be so aggressive to starve oom victims completely. Introduce ALLOC_OOM flag and give all tsk_is_oom_victim tasks access to the half of the reserves. This makes the access to reserves independent on which task has passed through mark_oom_victim. Also drop any usage of TIF_MEMDIE from the page allocator proper and replace it by tsk_is_oom_victim as well which will make page_alloc.c completely TIF_MEMDIE free finally. CONFIG_MMU=n doesn't have oom reaper so let's stick to the original ALLOC_NO_WATERMARKS approach. There is a demand to make the oom killer memcg aware which will imply many tasks killed at once. This change will allow such a usecase without worrying about complete memory reserves depletion. Link: http://lkml.kernel.org/r/20170810075019.28998-2-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/cd04ae1e2dc8e3651b8c427ec1b9500c6eed7b90;"Giving the full access to all these task_structs could lead to a quick memory reserves depletion";no;yes;yes 92;MDY6Q29tbWl0MjMyNTI5ODpjZDA0YWUxZTJkYzhlMzY1MWI4YzQyN2VjMWI5NTAwYzZlZWQ3Yjkw;Michal Hocko;Linus Torvalds;"mm, oom: do not rely on TIF_MEMDIE for memory reserves access For ages we have been relying on TIF_MEMDIE thread flag to mark OOM victims and then, among other things, to give these threads full access to memory reserves. There are few shortcomings of this implementation, though. First of all and the most serious one is that the full access to memory reserves is quite dangerous because we leave no safety room for the system to operate and potentially do last emergency steps to move on. Secondly this flag is per task_struct while the OOM killer operates on mm_struct granularity so all processes sharing the given mm are killed. Giving the full access to all these task_structs could lead to a quick memory reserves depletion. We have tried to reduce this risk by giving TIF_MEMDIE only to the main thread and the currently allocating task but that doesn't really solve this problem while it surely opens up a room for corner cases - e.g. GFP_NO{FS,IO} requests might loop inside the allocator without access to memory reserves because a particular thread was not the group leader. Now that we have the oom reaper and that all oom victims are reapable after 1b51e65eab64 (""oom, oom_reaper: allow to reap mm shared by the kthreads"") we can be more conservative and grant only partial access to memory reserves because there are reasonable chances of the parallel memory freeing. We still want some access to reserves because we do not want other consumers to eat up the victim's freed memory. oom victims will still contend with __GFP_HIGH users but those shouldn't be so aggressive to starve oom victims completely. Introduce ALLOC_OOM flag and give all tsk_is_oom_victim tasks access to the half of the reserves. This makes the access to reserves independent on which task has passed through mark_oom_victim. Also drop any usage of TIF_MEMDIE from the page allocator proper and replace it by tsk_is_oom_victim as well which will make page_alloc.c completely TIF_MEMDIE free finally. CONFIG_MMU=n doesn't have oom reaper so let's stick to the original ALLOC_NO_WATERMARKS approach. There is a demand to make the oom killer memcg aware which will imply many tasks killed at once. This change will allow such a usecase without worrying about complete memory reserves depletion. Link: http://lkml.kernel.org/r/20170810075019.28998-2-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/cd04ae1e2dc8e3651b8c427ec1b9500c6eed7b90;" We have tried to reduce this risk by giving TIF_MEMDIE only to the main thread and the currently allocating task but that doesn't really solve this problem while it surely opens up a room for corner cases - e.g";yes;yes;yes 92;MDY6Q29tbWl0MjMyNTI5ODpjZDA0YWUxZTJkYzhlMzY1MWI4YzQyN2VjMWI5NTAwYzZlZWQ3Yjkw;Michal Hocko;Linus Torvalds;"mm, oom: do not rely on TIF_MEMDIE for memory reserves access For ages we have been relying on TIF_MEMDIE thread flag to mark OOM victims and then, among other things, to give these threads full access to memory reserves. There are few shortcomings of this implementation, though. First of all and the most serious one is that the full access to memory reserves is quite dangerous because we leave no safety room for the system to operate and potentially do last emergency steps to move on. Secondly this flag is per task_struct while the OOM killer operates on mm_struct granularity so all processes sharing the given mm are killed. Giving the full access to all these task_structs could lead to a quick memory reserves depletion. We have tried to reduce this risk by giving TIF_MEMDIE only to the main thread and the currently allocating task but that doesn't really solve this problem while it surely opens up a room for corner cases - e.g. GFP_NO{FS,IO} requests might loop inside the allocator without access to memory reserves because a particular thread was not the group leader. Now that we have the oom reaper and that all oom victims are reapable after 1b51e65eab64 (""oom, oom_reaper: allow to reap mm shared by the kthreads"") we can be more conservative and grant only partial access to memory reserves because there are reasonable chances of the parallel memory freeing. We still want some access to reserves because we do not want other consumers to eat up the victim's freed memory. oom victims will still contend with __GFP_HIGH users but those shouldn't be so aggressive to starve oom victims completely. Introduce ALLOC_OOM flag and give all tsk_is_oom_victim tasks access to the half of the reserves. This makes the access to reserves independent on which task has passed through mark_oom_victim. Also drop any usage of TIF_MEMDIE from the page allocator proper and replace it by tsk_is_oom_victim as well which will make page_alloc.c completely TIF_MEMDIE free finally. CONFIG_MMU=n doesn't have oom reaper so let's stick to the original ALLOC_NO_WATERMARKS approach. There is a demand to make the oom killer memcg aware which will imply many tasks killed at once. This change will allow such a usecase without worrying about complete memory reserves depletion. Link: http://lkml.kernel.org/r/20170810075019.28998-2-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/cd04ae1e2dc8e3651b8c427ec1b9500c6eed7b90;" GFP_NO{FS,IO} requests might loop inside the allocator without access to memory reserves because a particular thread was not the group leader";no;no;yes 92;MDY6Q29tbWl0MjMyNTI5ODpjZDA0YWUxZTJkYzhlMzY1MWI4YzQyN2VjMWI5NTAwYzZlZWQ3Yjkw;Michal Hocko;Linus Torvalds;"mm, oom: do not rely on TIF_MEMDIE for memory reserves access For ages we have been relying on TIF_MEMDIE thread flag to mark OOM victims and then, among other things, to give these threads full access to memory reserves. There are few shortcomings of this implementation, though. First of all and the most serious one is that the full access to memory reserves is quite dangerous because we leave no safety room for the system to operate and potentially do last emergency steps to move on. Secondly this flag is per task_struct while the OOM killer operates on mm_struct granularity so all processes sharing the given mm are killed. Giving the full access to all these task_structs could lead to a quick memory reserves depletion. We have tried to reduce this risk by giving TIF_MEMDIE only to the main thread and the currently allocating task but that doesn't really solve this problem while it surely opens up a room for corner cases - e.g. GFP_NO{FS,IO} requests might loop inside the allocator without access to memory reserves because a particular thread was not the group leader. Now that we have the oom reaper and that all oom victims are reapable after 1b51e65eab64 (""oom, oom_reaper: allow to reap mm shared by the kthreads"") we can be more conservative and grant only partial access to memory reserves because there are reasonable chances of the parallel memory freeing. We still want some access to reserves because we do not want other consumers to eat up the victim's freed memory. oom victims will still contend with __GFP_HIGH users but those shouldn't be so aggressive to starve oom victims completely. Introduce ALLOC_OOM flag and give all tsk_is_oom_victim tasks access to the half of the reserves. This makes the access to reserves independent on which task has passed through mark_oom_victim. Also drop any usage of TIF_MEMDIE from the page allocator proper and replace it by tsk_is_oom_victim as well which will make page_alloc.c completely TIF_MEMDIE free finally. CONFIG_MMU=n doesn't have oom reaper so let's stick to the original ALLOC_NO_WATERMARKS approach. There is a demand to make the oom killer memcg aware which will imply many tasks killed at once. This change will allow such a usecase without worrying about complete memory reserves depletion. Link: http://lkml.kernel.org/r/20170810075019.28998-2-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/cd04ae1e2dc8e3651b8c427ec1b9500c6eed7b90;"Now that we have the oom reaper and that all oom victims are reapable after 1b51e65eab64 (""oom, oom_reaper: allow to reap mm shared by the kthreads"") we can be more conservative and grant only partial access to memory reserves because there are reasonable chances of the parallel memory freeing";yes;yes;yes 92;MDY6Q29tbWl0MjMyNTI5ODpjZDA0YWUxZTJkYzhlMzY1MWI4YzQyN2VjMWI5NTAwYzZlZWQ3Yjkw;Michal Hocko;Linus Torvalds;"mm, oom: do not rely on TIF_MEMDIE for memory reserves access For ages we have been relying on TIF_MEMDIE thread flag to mark OOM victims and then, among other things, to give these threads full access to memory reserves. There are few shortcomings of this implementation, though. First of all and the most serious one is that the full access to memory reserves is quite dangerous because we leave no safety room for the system to operate and potentially do last emergency steps to move on. Secondly this flag is per task_struct while the OOM killer operates on mm_struct granularity so all processes sharing the given mm are killed. Giving the full access to all these task_structs could lead to a quick memory reserves depletion. We have tried to reduce this risk by giving TIF_MEMDIE only to the main thread and the currently allocating task but that doesn't really solve this problem while it surely opens up a room for corner cases - e.g. GFP_NO{FS,IO} requests might loop inside the allocator without access to memory reserves because a particular thread was not the group leader. Now that we have the oom reaper and that all oom victims are reapable after 1b51e65eab64 (""oom, oom_reaper: allow to reap mm shared by the kthreads"") we can be more conservative and grant only partial access to memory reserves because there are reasonable chances of the parallel memory freeing. We still want some access to reserves because we do not want other consumers to eat up the victim's freed memory. oom victims will still contend with __GFP_HIGH users but those shouldn't be so aggressive to starve oom victims completely. Introduce ALLOC_OOM flag and give all tsk_is_oom_victim tasks access to the half of the reserves. This makes the access to reserves independent on which task has passed through mark_oom_victim. Also drop any usage of TIF_MEMDIE from the page allocator proper and replace it by tsk_is_oom_victim as well which will make page_alloc.c completely TIF_MEMDIE free finally. CONFIG_MMU=n doesn't have oom reaper so let's stick to the original ALLOC_NO_WATERMARKS approach. There is a demand to make the oom killer memcg aware which will imply many tasks killed at once. This change will allow such a usecase without worrying about complete memory reserves depletion. Link: http://lkml.kernel.org/r/20170810075019.28998-2-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/cd04ae1e2dc8e3651b8c427ec1b9500c6eed7b90;" We still want some access to reserves because we do not want other consumers to eat up the victim's freed memory";yes;yes;no 92;MDY6Q29tbWl0MjMyNTI5ODpjZDA0YWUxZTJkYzhlMzY1MWI4YzQyN2VjMWI5NTAwYzZlZWQ3Yjkw;Michal Hocko;Linus Torvalds;"mm, oom: do not rely on TIF_MEMDIE for memory reserves access For ages we have been relying on TIF_MEMDIE thread flag to mark OOM victims and then, among other things, to give these threads full access to memory reserves. There are few shortcomings of this implementation, though. First of all and the most serious one is that the full access to memory reserves is quite dangerous because we leave no safety room for the system to operate and potentially do last emergency steps to move on. Secondly this flag is per task_struct while the OOM killer operates on mm_struct granularity so all processes sharing the given mm are killed. Giving the full access to all these task_structs could lead to a quick memory reserves depletion. We have tried to reduce this risk by giving TIF_MEMDIE only to the main thread and the currently allocating task but that doesn't really solve this problem while it surely opens up a room for corner cases - e.g. GFP_NO{FS,IO} requests might loop inside the allocator without access to memory reserves because a particular thread was not the group leader. Now that we have the oom reaper and that all oom victims are reapable after 1b51e65eab64 (""oom, oom_reaper: allow to reap mm shared by the kthreads"") we can be more conservative and grant only partial access to memory reserves because there are reasonable chances of the parallel memory freeing. We still want some access to reserves because we do not want other consumers to eat up the victim's freed memory. oom victims will still contend with __GFP_HIGH users but those shouldn't be so aggressive to starve oom victims completely. Introduce ALLOC_OOM flag and give all tsk_is_oom_victim tasks access to the half of the reserves. This makes the access to reserves independent on which task has passed through mark_oom_victim. Also drop any usage of TIF_MEMDIE from the page allocator proper and replace it by tsk_is_oom_victim as well which will make page_alloc.c completely TIF_MEMDIE free finally. CONFIG_MMU=n doesn't have oom reaper so let's stick to the original ALLOC_NO_WATERMARKS approach. There is a demand to make the oom killer memcg aware which will imply many tasks killed at once. This change will allow such a usecase without worrying about complete memory reserves depletion. Link: http://lkml.kernel.org/r/20170810075019.28998-2-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/cd04ae1e2dc8e3651b8c427ec1b9500c6eed7b90;" oom victims will still contend with __GFP_HIGH users but those shouldn't be so aggressive to starve oom victims completely";yes;yes;yes 92;MDY6Q29tbWl0MjMyNTI5ODpjZDA0YWUxZTJkYzhlMzY1MWI4YzQyN2VjMWI5NTAwYzZlZWQ3Yjkw;Michal Hocko;Linus Torvalds;"mm, oom: do not rely on TIF_MEMDIE for memory reserves access For ages we have been relying on TIF_MEMDIE thread flag to mark OOM victims and then, among other things, to give these threads full access to memory reserves. There are few shortcomings of this implementation, though. First of all and the most serious one is that the full access to memory reserves is quite dangerous because we leave no safety room for the system to operate and potentially do last emergency steps to move on. Secondly this flag is per task_struct while the OOM killer operates on mm_struct granularity so all processes sharing the given mm are killed. Giving the full access to all these task_structs could lead to a quick memory reserves depletion. We have tried to reduce this risk by giving TIF_MEMDIE only to the main thread and the currently allocating task but that doesn't really solve this problem while it surely opens up a room for corner cases - e.g. GFP_NO{FS,IO} requests might loop inside the allocator without access to memory reserves because a particular thread was not the group leader. Now that we have the oom reaper and that all oom victims are reapable after 1b51e65eab64 (""oom, oom_reaper: allow to reap mm shared by the kthreads"") we can be more conservative and grant only partial access to memory reserves because there are reasonable chances of the parallel memory freeing. We still want some access to reserves because we do not want other consumers to eat up the victim's freed memory. oom victims will still contend with __GFP_HIGH users but those shouldn't be so aggressive to starve oom victims completely. Introduce ALLOC_OOM flag and give all tsk_is_oom_victim tasks access to the half of the reserves. This makes the access to reserves independent on which task has passed through mark_oom_victim. Also drop any usage of TIF_MEMDIE from the page allocator proper and replace it by tsk_is_oom_victim as well which will make page_alloc.c completely TIF_MEMDIE free finally. CONFIG_MMU=n doesn't have oom reaper so let's stick to the original ALLOC_NO_WATERMARKS approach. There is a demand to make the oom killer memcg aware which will imply many tasks killed at once. This change will allow such a usecase without worrying about complete memory reserves depletion. Link: http://lkml.kernel.org/r/20170810075019.28998-2-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/cd04ae1e2dc8e3651b8c427ec1b9500c6eed7b90;"Introduce ALLOC_OOM flag and give all tsk_is_oom_victim tasks access to the half of the reserves";yes;no;no 92;MDY6Q29tbWl0MjMyNTI5ODpjZDA0YWUxZTJkYzhlMzY1MWI4YzQyN2VjMWI5NTAwYzZlZWQ3Yjkw;Michal Hocko;Linus Torvalds;"mm, oom: do not rely on TIF_MEMDIE for memory reserves access For ages we have been relying on TIF_MEMDIE thread flag to mark OOM victims and then, among other things, to give these threads full access to memory reserves. There are few shortcomings of this implementation, though. First of all and the most serious one is that the full access to memory reserves is quite dangerous because we leave no safety room for the system to operate and potentially do last emergency steps to move on. Secondly this flag is per task_struct while the OOM killer operates on mm_struct granularity so all processes sharing the given mm are killed. Giving the full access to all these task_structs could lead to a quick memory reserves depletion. We have tried to reduce this risk by giving TIF_MEMDIE only to the main thread and the currently allocating task but that doesn't really solve this problem while it surely opens up a room for corner cases - e.g. GFP_NO{FS,IO} requests might loop inside the allocator without access to memory reserves because a particular thread was not the group leader. Now that we have the oom reaper and that all oom victims are reapable after 1b51e65eab64 (""oom, oom_reaper: allow to reap mm shared by the kthreads"") we can be more conservative and grant only partial access to memory reserves because there are reasonable chances of the parallel memory freeing. We still want some access to reserves because we do not want other consumers to eat up the victim's freed memory. oom victims will still contend with __GFP_HIGH users but those shouldn't be so aggressive to starve oom victims completely. Introduce ALLOC_OOM flag and give all tsk_is_oom_victim tasks access to the half of the reserves. This makes the access to reserves independent on which task has passed through mark_oom_victim. Also drop any usage of TIF_MEMDIE from the page allocator proper and replace it by tsk_is_oom_victim as well which will make page_alloc.c completely TIF_MEMDIE free finally. CONFIG_MMU=n doesn't have oom reaper so let's stick to the original ALLOC_NO_WATERMARKS approach. There is a demand to make the oom killer memcg aware which will imply many tasks killed at once. This change will allow such a usecase without worrying about complete memory reserves depletion. Link: http://lkml.kernel.org/r/20170810075019.28998-2-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/cd04ae1e2dc8e3651b8c427ec1b9500c6eed7b90;" This makes the access to reserves independent on which task has passed through mark_oom_victim";yes;yes;no 92;MDY6Q29tbWl0MjMyNTI5ODpjZDA0YWUxZTJkYzhlMzY1MWI4YzQyN2VjMWI5NTAwYzZlZWQ3Yjkw;Michal Hocko;Linus Torvalds;"mm, oom: do not rely on TIF_MEMDIE for memory reserves access For ages we have been relying on TIF_MEMDIE thread flag to mark OOM victims and then, among other things, to give these threads full access to memory reserves. There are few shortcomings of this implementation, though. First of all and the most serious one is that the full access to memory reserves is quite dangerous because we leave no safety room for the system to operate and potentially do last emergency steps to move on. Secondly this flag is per task_struct while the OOM killer operates on mm_struct granularity so all processes sharing the given mm are killed. Giving the full access to all these task_structs could lead to a quick memory reserves depletion. We have tried to reduce this risk by giving TIF_MEMDIE only to the main thread and the currently allocating task but that doesn't really solve this problem while it surely opens up a room for corner cases - e.g. GFP_NO{FS,IO} requests might loop inside the allocator without access to memory reserves because a particular thread was not the group leader. Now that we have the oom reaper and that all oom victims are reapable after 1b51e65eab64 (""oom, oom_reaper: allow to reap mm shared by the kthreads"") we can be more conservative and grant only partial access to memory reserves because there are reasonable chances of the parallel memory freeing. We still want some access to reserves because we do not want other consumers to eat up the victim's freed memory. oom victims will still contend with __GFP_HIGH users but those shouldn't be so aggressive to starve oom victims completely. Introduce ALLOC_OOM flag and give all tsk_is_oom_victim tasks access to the half of the reserves. This makes the access to reserves independent on which task has passed through mark_oom_victim. Also drop any usage of TIF_MEMDIE from the page allocator proper and replace it by tsk_is_oom_victim as well which will make page_alloc.c completely TIF_MEMDIE free finally. CONFIG_MMU=n doesn't have oom reaper so let's stick to the original ALLOC_NO_WATERMARKS approach. There is a demand to make the oom killer memcg aware which will imply many tasks killed at once. This change will allow such a usecase without worrying about complete memory reserves depletion. Link: http://lkml.kernel.org/r/20170810075019.28998-2-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/cd04ae1e2dc8e3651b8c427ec1b9500c6eed7b90;" Also drop any usage of TIF_MEMDIE from the page allocator proper and replace it by tsk_is_oom_victim as well which will make page_alloc.c completely TIF_MEMDIE free finally";yes;yes;no 92;MDY6Q29tbWl0MjMyNTI5ODpjZDA0YWUxZTJkYzhlMzY1MWI4YzQyN2VjMWI5NTAwYzZlZWQ3Yjkw;Michal Hocko;Linus Torvalds;"mm, oom: do not rely on TIF_MEMDIE for memory reserves access For ages we have been relying on TIF_MEMDIE thread flag to mark OOM victims and then, among other things, to give these threads full access to memory reserves. There are few shortcomings of this implementation, though. First of all and the most serious one is that the full access to memory reserves is quite dangerous because we leave no safety room for the system to operate and potentially do last emergency steps to move on. Secondly this flag is per task_struct while the OOM killer operates on mm_struct granularity so all processes sharing the given mm are killed. Giving the full access to all these task_structs could lead to a quick memory reserves depletion. We have tried to reduce this risk by giving TIF_MEMDIE only to the main thread and the currently allocating task but that doesn't really solve this problem while it surely opens up a room for corner cases - e.g. GFP_NO{FS,IO} requests might loop inside the allocator without access to memory reserves because a particular thread was not the group leader. Now that we have the oom reaper and that all oom victims are reapable after 1b51e65eab64 (""oom, oom_reaper: allow to reap mm shared by the kthreads"") we can be more conservative and grant only partial access to memory reserves because there are reasonable chances of the parallel memory freeing. We still want some access to reserves because we do not want other consumers to eat up the victim's freed memory. oom victims will still contend with __GFP_HIGH users but those shouldn't be so aggressive to starve oom victims completely. Introduce ALLOC_OOM flag and give all tsk_is_oom_victim tasks access to the half of the reserves. This makes the access to reserves independent on which task has passed through mark_oom_victim. Also drop any usage of TIF_MEMDIE from the page allocator proper and replace it by tsk_is_oom_victim as well which will make page_alloc.c completely TIF_MEMDIE free finally. CONFIG_MMU=n doesn't have oom reaper so let's stick to the original ALLOC_NO_WATERMARKS approach. There is a demand to make the oom killer memcg aware which will imply many tasks killed at once. This change will allow such a usecase without worrying about complete memory reserves depletion. Link: http://lkml.kernel.org/r/20170810075019.28998-2-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/cd04ae1e2dc8e3651b8c427ec1b9500c6eed7b90;"CONFIG_MMU=n doesn't have oom reaper so let's stick to the original ALLOC_NO_WATERMARKS approach";yes;yes;yes 92;MDY6Q29tbWl0MjMyNTI5ODpjZDA0YWUxZTJkYzhlMzY1MWI4YzQyN2VjMWI5NTAwYzZlZWQ3Yjkw;Michal Hocko;Linus Torvalds;"mm, oom: do not rely on TIF_MEMDIE for memory reserves access For ages we have been relying on TIF_MEMDIE thread flag to mark OOM victims and then, among other things, to give these threads full access to memory reserves. There are few shortcomings of this implementation, though. First of all and the most serious one is that the full access to memory reserves is quite dangerous because we leave no safety room for the system to operate and potentially do last emergency steps to move on. Secondly this flag is per task_struct while the OOM killer operates on mm_struct granularity so all processes sharing the given mm are killed. Giving the full access to all these task_structs could lead to a quick memory reserves depletion. We have tried to reduce this risk by giving TIF_MEMDIE only to the main thread and the currently allocating task but that doesn't really solve this problem while it surely opens up a room for corner cases - e.g. GFP_NO{FS,IO} requests might loop inside the allocator without access to memory reserves because a particular thread was not the group leader. Now that we have the oom reaper and that all oom victims are reapable after 1b51e65eab64 (""oom, oom_reaper: allow to reap mm shared by the kthreads"") we can be more conservative and grant only partial access to memory reserves because there are reasonable chances of the parallel memory freeing. We still want some access to reserves because we do not want other consumers to eat up the victim's freed memory. oom victims will still contend with __GFP_HIGH users but those shouldn't be so aggressive to starve oom victims completely. Introduce ALLOC_OOM flag and give all tsk_is_oom_victim tasks access to the half of the reserves. This makes the access to reserves independent on which task has passed through mark_oom_victim. Also drop any usage of TIF_MEMDIE from the page allocator proper and replace it by tsk_is_oom_victim as well which will make page_alloc.c completely TIF_MEMDIE free finally. CONFIG_MMU=n doesn't have oom reaper so let's stick to the original ALLOC_NO_WATERMARKS approach. There is a demand to make the oom killer memcg aware which will imply many tasks killed at once. This change will allow such a usecase without worrying about complete memory reserves depletion. Link: http://lkml.kernel.org/r/20170810075019.28998-2-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/cd04ae1e2dc8e3651b8c427ec1b9500c6eed7b90;"There is a demand to make the oom killer memcg aware which will imply many tasks killed at once";yes;no;yes 92;MDY6Q29tbWl0MjMyNTI5ODpjZDA0YWUxZTJkYzhlMzY1MWI4YzQyN2VjMWI5NTAwYzZlZWQ3Yjkw;Michal Hocko;Linus Torvalds;"mm, oom: do not rely on TIF_MEMDIE for memory reserves access For ages we have been relying on TIF_MEMDIE thread flag to mark OOM victims and then, among other things, to give these threads full access to memory reserves. There are few shortcomings of this implementation, though. First of all and the most serious one is that the full access to memory reserves is quite dangerous because we leave no safety room for the system to operate and potentially do last emergency steps to move on. Secondly this flag is per task_struct while the OOM killer operates on mm_struct granularity so all processes sharing the given mm are killed. Giving the full access to all these task_structs could lead to a quick memory reserves depletion. We have tried to reduce this risk by giving TIF_MEMDIE only to the main thread and the currently allocating task but that doesn't really solve this problem while it surely opens up a room for corner cases - e.g. GFP_NO{FS,IO} requests might loop inside the allocator without access to memory reserves because a particular thread was not the group leader. Now that we have the oom reaper and that all oom victims are reapable after 1b51e65eab64 (""oom, oom_reaper: allow to reap mm shared by the kthreads"") we can be more conservative and grant only partial access to memory reserves because there are reasonable chances of the parallel memory freeing. We still want some access to reserves because we do not want other consumers to eat up the victim's freed memory. oom victims will still contend with __GFP_HIGH users but those shouldn't be so aggressive to starve oom victims completely. Introduce ALLOC_OOM flag and give all tsk_is_oom_victim tasks access to the half of the reserves. This makes the access to reserves independent on which task has passed through mark_oom_victim. Also drop any usage of TIF_MEMDIE from the page allocator proper and replace it by tsk_is_oom_victim as well which will make page_alloc.c completely TIF_MEMDIE free finally. CONFIG_MMU=n doesn't have oom reaper so let's stick to the original ALLOC_NO_WATERMARKS approach. There is a demand to make the oom killer memcg aware which will imply many tasks killed at once. This change will allow such a usecase without worrying about complete memory reserves depletion. Link: http://lkml.kernel.org/r/20170810075019.28998-2-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/cd04ae1e2dc8e3651b8c427ec1b9500c6eed7b90;" This change will allow such a usecase without worrying about complete memory reserves depletion.";yes;yes;no 93;MDY6Q29tbWl0MjMyNTI5ODo0MjI1ODBjM2NlYTdmYWFjYTY3ZjYxOTkzNzViNzk1NjVkM2Q4ZWJk;Roman Gushchin;Linus Torvalds;"mm/oom_kill.c: add tracepoints for oom reaper-related events During the debugging of the problem described in https://lkml.org/lkml/2017/5/17/542 and fixed by Tetsuo Handa in https://lkml.org/lkml/2017/5/19/383 , I've found that the existing debug output is not really useful to understand issues related to the oom reaper. So, I assume, that adding some tracepoints might help with debugging of similar issues. Trace the following events: 1) a process is marked as an oom victim, 2) a process is added to the oom reaper list, 3) the oom reaper starts reaping process's mm, 4) the oom reaper finished reaping, 5) the oom reaper skips reaping. How it works in practice? Below is an example which show how the problem mentioned above can be found: one process is added twice to the oom_reaper list: $ cd /sys/kernel/debug/tracing $ echo ""oom:mark_victim"" > set_event $ echo ""oom:wake_reaper"" >> set_event $ echo ""oom:skip_task_reaping"" >> set_event $ echo ""oom:start_task_reaping"" >> set_event $ echo ""oom:finish_task_reaping"" >> set_event $ cat trace_pipe allocate-502 [001] .... 91.836405: mark_victim: pid=502 allocate-502 [001] .N.. 91.837356: wake_reaper: pid=502 allocate-502 [000] .N.. 91.871149: wake_reaper: pid=502 oom_reaper-23 [000] .... 91.871177: start_task_reaping: pid=502 oom_reaper-23 [000] .N.. 91.879511: finish_task_reaping: pid=502 oom_reaper-23 [000] .... 91.879580: skip_task_reaping: pid=502 Link: http://lkml.kernel.org/r/20170530185231.GA13412@castle Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/422580c3cea7faaca67f6199375b79565d3d8ebd;mm/oom_kill.c: add tracepoints for oom reaper-related events;yes;yes;no 93;MDY6Q29tbWl0MjMyNTI5ODo0MjI1ODBjM2NlYTdmYWFjYTY3ZjYxOTkzNzViNzk1NjVkM2Q4ZWJk;Roman Gushchin;Linus Torvalds;"mm/oom_kill.c: add tracepoints for oom reaper-related events During the debugging of the problem described in https://lkml.org/lkml/2017/5/17/542 and fixed by Tetsuo Handa in https://lkml.org/lkml/2017/5/19/383 , I've found that the existing debug output is not really useful to understand issues related to the oom reaper. So, I assume, that adding some tracepoints might help with debugging of similar issues. Trace the following events: 1) a process is marked as an oom victim, 2) a process is added to the oom reaper list, 3) the oom reaper starts reaping process's mm, 4) the oom reaper finished reaping, 5) the oom reaper skips reaping. How it works in practice? Below is an example which show how the problem mentioned above can be found: one process is added twice to the oom_reaper list: $ cd /sys/kernel/debug/tracing $ echo ""oom:mark_victim"" > set_event $ echo ""oom:wake_reaper"" >> set_event $ echo ""oom:skip_task_reaping"" >> set_event $ echo ""oom:start_task_reaping"" >> set_event $ echo ""oom:finish_task_reaping"" >> set_event $ cat trace_pipe allocate-502 [001] .... 91.836405: mark_victim: pid=502 allocate-502 [001] .N.. 91.837356: wake_reaper: pid=502 allocate-502 [000] .N.. 91.871149: wake_reaper: pid=502 oom_reaper-23 [000] .... 91.871177: start_task_reaping: pid=502 oom_reaper-23 [000] .N.. 91.879511: finish_task_reaping: pid=502 oom_reaper-23 [000] .... 91.879580: skip_task_reaping: pid=502 Link: http://lkml.kernel.org/r/20170530185231.GA13412@castle Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/422580c3cea7faaca67f6199375b79565d3d8ebd;"During the debugging of the problem described in and fixed by Tetsuo Handa in , I've found that the existing debug output is not really useful to understand issues related to the oom reaper";no;yes;yes 93;MDY6Q29tbWl0MjMyNTI5ODo0MjI1ODBjM2NlYTdmYWFjYTY3ZjYxOTkzNzViNzk1NjVkM2Q4ZWJk;Roman Gushchin;Linus Torvalds;"mm/oom_kill.c: add tracepoints for oom reaper-related events During the debugging of the problem described in https://lkml.org/lkml/2017/5/17/542 and fixed by Tetsuo Handa in https://lkml.org/lkml/2017/5/19/383 , I've found that the existing debug output is not really useful to understand issues related to the oom reaper. So, I assume, that adding some tracepoints might help with debugging of similar issues. Trace the following events: 1) a process is marked as an oom victim, 2) a process is added to the oom reaper list, 3) the oom reaper starts reaping process's mm, 4) the oom reaper finished reaping, 5) the oom reaper skips reaping. How it works in practice? Below is an example which show how the problem mentioned above can be found: one process is added twice to the oom_reaper list: $ cd /sys/kernel/debug/tracing $ echo ""oom:mark_victim"" > set_event $ echo ""oom:wake_reaper"" >> set_event $ echo ""oom:skip_task_reaping"" >> set_event $ echo ""oom:start_task_reaping"" >> set_event $ echo ""oom:finish_task_reaping"" >> set_event $ cat trace_pipe allocate-502 [001] .... 91.836405: mark_victim: pid=502 allocate-502 [001] .N.. 91.837356: wake_reaper: pid=502 allocate-502 [000] .N.. 91.871149: wake_reaper: pid=502 oom_reaper-23 [000] .... 91.871177: start_task_reaping: pid=502 oom_reaper-23 [000] .N.. 91.879511: finish_task_reaping: pid=502 oom_reaper-23 [000] .... 91.879580: skip_task_reaping: pid=502 Link: http://lkml.kernel.org/r/20170530185231.GA13412@castle Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/422580c3cea7faaca67f6199375b79565d3d8ebd;"So, I assume, that adding some tracepoints might help with debugging of similar issues";yes;yes;no 93;MDY6Q29tbWl0MjMyNTI5ODo0MjI1ODBjM2NlYTdmYWFjYTY3ZjYxOTkzNzViNzk1NjVkM2Q4ZWJk;Roman Gushchin;Linus Torvalds;"mm/oom_kill.c: add tracepoints for oom reaper-related events During the debugging of the problem described in https://lkml.org/lkml/2017/5/17/542 and fixed by Tetsuo Handa in https://lkml.org/lkml/2017/5/19/383 , I've found that the existing debug output is not really useful to understand issues related to the oom reaper. So, I assume, that adding some tracepoints might help with debugging of similar issues. Trace the following events: 1) a process is marked as an oom victim, 2) a process is added to the oom reaper list, 3) the oom reaper starts reaping process's mm, 4) the oom reaper finished reaping, 5) the oom reaper skips reaping. How it works in practice? Below is an example which show how the problem mentioned above can be found: one process is added twice to the oom_reaper list: $ cd /sys/kernel/debug/tracing $ echo ""oom:mark_victim"" > set_event $ echo ""oom:wake_reaper"" >> set_event $ echo ""oom:skip_task_reaping"" >> set_event $ echo ""oom:start_task_reaping"" >> set_event $ echo ""oom:finish_task_reaping"" >> set_event $ cat trace_pipe allocate-502 [001] .... 91.836405: mark_victim: pid=502 allocate-502 [001] .N.. 91.837356: wake_reaper: pid=502 allocate-502 [000] .N.. 91.871149: wake_reaper: pid=502 oom_reaper-23 [000] .... 91.871177: start_task_reaping: pid=502 oom_reaper-23 [000] .N.. 91.879511: finish_task_reaping: pid=502 oom_reaper-23 [000] .... 91.879580: skip_task_reaping: pid=502 Link: http://lkml.kernel.org/r/20170530185231.GA13412@castle Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/422580c3cea7faaca67f6199375b79565d3d8ebd;Trace the following events;yes;no;no 93;MDY6Q29tbWl0MjMyNTI5ODo0MjI1ODBjM2NlYTdmYWFjYTY3ZjYxOTkzNzViNzk1NjVkM2Q4ZWJk;Roman Gushchin;Linus Torvalds;"mm/oom_kill.c: add tracepoints for oom reaper-related events During the debugging of the problem described in https://lkml.org/lkml/2017/5/17/542 and fixed by Tetsuo Handa in https://lkml.org/lkml/2017/5/19/383 , I've found that the existing debug output is not really useful to understand issues related to the oom reaper. So, I assume, that adding some tracepoints might help with debugging of similar issues. Trace the following events: 1) a process is marked as an oom victim, 2) a process is added to the oom reaper list, 3) the oom reaper starts reaping process's mm, 4) the oom reaper finished reaping, 5) the oom reaper skips reaping. How it works in practice? Below is an example which show how the problem mentioned above can be found: one process is added twice to the oom_reaper list: $ cd /sys/kernel/debug/tracing $ echo ""oom:mark_victim"" > set_event $ echo ""oom:wake_reaper"" >> set_event $ echo ""oom:skip_task_reaping"" >> set_event $ echo ""oom:start_task_reaping"" >> set_event $ echo ""oom:finish_task_reaping"" >> set_event $ cat trace_pipe allocate-502 [001] .... 91.836405: mark_victim: pid=502 allocate-502 [001] .N.. 91.837356: wake_reaper: pid=502 allocate-502 [000] .N.. 91.871149: wake_reaper: pid=502 oom_reaper-23 [000] .... 91.871177: start_task_reaping: pid=502 oom_reaper-23 [000] .N.. 91.879511: finish_task_reaping: pid=502 oom_reaper-23 [000] .... 91.879580: skip_task_reaping: pid=502 Link: http://lkml.kernel.org/r/20170530185231.GA13412@castle Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/422580c3cea7faaca67f6199375b79565d3d8ebd;" 1) a process is marked as an oom victim, 2) a process is added to the oom reaper list, 3) the oom reaper starts reaping process's mm, 4) the oom reaper finished reaping, 5) the oom reaper skips reaping";yes;no;yes 93;MDY6Q29tbWl0MjMyNTI5ODo0MjI1ODBjM2NlYTdmYWFjYTY3ZjYxOTkzNzViNzk1NjVkM2Q4ZWJk;Roman Gushchin;Linus Torvalds;"mm/oom_kill.c: add tracepoints for oom reaper-related events During the debugging of the problem described in https://lkml.org/lkml/2017/5/17/542 and fixed by Tetsuo Handa in https://lkml.org/lkml/2017/5/19/383 , I've found that the existing debug output is not really useful to understand issues related to the oom reaper. So, I assume, that adding some tracepoints might help with debugging of similar issues. Trace the following events: 1) a process is marked as an oom victim, 2) a process is added to the oom reaper list, 3) the oom reaper starts reaping process's mm, 4) the oom reaper finished reaping, 5) the oom reaper skips reaping. How it works in practice? Below is an example which show how the problem mentioned above can be found: one process is added twice to the oom_reaper list: $ cd /sys/kernel/debug/tracing $ echo ""oom:mark_victim"" > set_event $ echo ""oom:wake_reaper"" >> set_event $ echo ""oom:skip_task_reaping"" >> set_event $ echo ""oom:start_task_reaping"" >> set_event $ echo ""oom:finish_task_reaping"" >> set_event $ cat trace_pipe allocate-502 [001] .... 91.836405: mark_victim: pid=502 allocate-502 [001] .N.. 91.837356: wake_reaper: pid=502 allocate-502 [000] .N.. 91.871149: wake_reaper: pid=502 oom_reaper-23 [000] .... 91.871177: start_task_reaping: pid=502 oom_reaper-23 [000] .N.. 91.879511: finish_task_reaping: pid=502 oom_reaper-23 [000] .... 91.879580: skip_task_reaping: pid=502 Link: http://lkml.kernel.org/r/20170530185231.GA13412@castle Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/422580c3cea7faaca67f6199375b79565d3d8ebd;"How it works in practice? Below is an example which show how the problem mentioned above can be found: one process is added twice to the oom_reaper list";yes;yes;yes 94;MDY6Q29tbWl0MjMyNTI5ODo4ZTY3NWY3YWY1MDc0N2UxZTllOTY1MzhlODcwNjc2N2U0ZjgwZTJj;Konstantin Khlebnikov;Linus Torvalds;"mm/oom_kill: count global and memory cgroup oom kills Show count of oom killer invocations in /proc/vmstat and count of processes killed in memory cgroup in knob ""memory.events"" (in memory.oom_control for v1 cgroup). Also describe difference between ""oom"" and ""oom_kill"" in memory cgroup documentation. Currently oom in memory cgroup kills tasks iff shortage has happened inside page fault. These counters helps in monitoring oom kills - for now the only way is grepping for magic words in kernel log. [akpm@linux-foundation.org: fix for mem_cgroup_count_vm_event() rename] [akpm@linux-foundation.org: fix comment, per Konstantin] Link: http://lkml.kernel.org/r/149570810989.203600.9492483715840752937.stgit@buzz Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Roman Guschin <guroan@gmail.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/8e675f7af50747e1e9e96538e8706767e4f80e2c;mm/oom_kill: count global and memory cgroup oom kills;yes;no;no 94;MDY6Q29tbWl0MjMyNTI5ODo4ZTY3NWY3YWY1MDc0N2UxZTllOTY1MzhlODcwNjc2N2U0ZjgwZTJj;Konstantin Khlebnikov;Linus Torvalds;"mm/oom_kill: count global and memory cgroup oom kills Show count of oom killer invocations in /proc/vmstat and count of processes killed in memory cgroup in knob ""memory.events"" (in memory.oom_control for v1 cgroup). Also describe difference between ""oom"" and ""oom_kill"" in memory cgroup documentation. Currently oom in memory cgroup kills tasks iff shortage has happened inside page fault. These counters helps in monitoring oom kills - for now the only way is grepping for magic words in kernel log. [akpm@linux-foundation.org: fix for mem_cgroup_count_vm_event() rename] [akpm@linux-foundation.org: fix comment, per Konstantin] Link: http://lkml.kernel.org/r/149570810989.203600.9492483715840752937.stgit@buzz Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Roman Guschin <guroan@gmail.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/8e675f7af50747e1e9e96538e8706767e4f80e2c;"Show count of oom killer invocations in /proc/vmstat and count of processes killed in memory cgroup in knob ""memory.events"" (in memory.oom_control for v1 cgroup)";yes;no;no 94;MDY6Q29tbWl0MjMyNTI5ODo4ZTY3NWY3YWY1MDc0N2UxZTllOTY1MzhlODcwNjc2N2U0ZjgwZTJj;Konstantin Khlebnikov;Linus Torvalds;"mm/oom_kill: count global and memory cgroup oom kills Show count of oom killer invocations in /proc/vmstat and count of processes killed in memory cgroup in knob ""memory.events"" (in memory.oom_control for v1 cgroup). Also describe difference between ""oom"" and ""oom_kill"" in memory cgroup documentation. Currently oom in memory cgroup kills tasks iff shortage has happened inside page fault. These counters helps in monitoring oom kills - for now the only way is grepping for magic words in kernel log. [akpm@linux-foundation.org: fix for mem_cgroup_count_vm_event() rename] [akpm@linux-foundation.org: fix comment, per Konstantin] Link: http://lkml.kernel.org/r/149570810989.203600.9492483715840752937.stgit@buzz Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Roman Guschin <guroan@gmail.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/8e675f7af50747e1e9e96538e8706767e4f80e2c;"Also describe difference between ""oom"" and ""oom_kill"" in memory cgroup documentation";yes;yes;no 94;MDY6Q29tbWl0MjMyNTI5ODo4ZTY3NWY3YWY1MDc0N2UxZTllOTY1MzhlODcwNjc2N2U0ZjgwZTJj;Konstantin Khlebnikov;Linus Torvalds;"mm/oom_kill: count global and memory cgroup oom kills Show count of oom killer invocations in /proc/vmstat and count of processes killed in memory cgroup in knob ""memory.events"" (in memory.oom_control for v1 cgroup). Also describe difference between ""oom"" and ""oom_kill"" in memory cgroup documentation. Currently oom in memory cgroup kills tasks iff shortage has happened inside page fault. These counters helps in monitoring oom kills - for now the only way is grepping for magic words in kernel log. [akpm@linux-foundation.org: fix for mem_cgroup_count_vm_event() rename] [akpm@linux-foundation.org: fix comment, per Konstantin] Link: http://lkml.kernel.org/r/149570810989.203600.9492483715840752937.stgit@buzz Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Roman Guschin <guroan@gmail.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/8e675f7af50747e1e9e96538e8706767e4f80e2c;" Currently oom in memory cgroup kills tasks iff shortage has happened inside page fault";no;no;yes 94;MDY6Q29tbWl0MjMyNTI5ODo4ZTY3NWY3YWY1MDc0N2UxZTllOTY1MzhlODcwNjc2N2U0ZjgwZTJj;Konstantin Khlebnikov;Linus Torvalds;"mm/oom_kill: count global and memory cgroup oom kills Show count of oom killer invocations in /proc/vmstat and count of processes killed in memory cgroup in knob ""memory.events"" (in memory.oom_control for v1 cgroup). Also describe difference between ""oom"" and ""oom_kill"" in memory cgroup documentation. Currently oom in memory cgroup kills tasks iff shortage has happened inside page fault. These counters helps in monitoring oom kills - for now the only way is grepping for magic words in kernel log. [akpm@linux-foundation.org: fix for mem_cgroup_count_vm_event() rename] [akpm@linux-foundation.org: fix comment, per Konstantin] Link: http://lkml.kernel.org/r/149570810989.203600.9492483715840752937.stgit@buzz Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Roman Guschin <guroan@gmail.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/8e675f7af50747e1e9e96538e8706767e4f80e2c;These counters helps in monitoring oom kills - for now the only way is;no;yes;yes 95;MDY6Q29tbWl0MjMyNTI5ODpkNzVkYTAwNGM3MDhjOWZjYTdiNTNmN2RhMjkzYTI5NTUyMjQxNGQ5;Michal Hocko;Linus Torvalds;"oom: improve oom disable handling Tetsuo has reported that sysrq triggered OOM killer will print a misleading information when no tasks are selected: sysrq: SysRq : Manual OOM execution Out of memory: Kill process 4468 ((agetty)) score 0 or sacrifice child Killed process 4468 ((agetty)) total-vm:43704kB, anon-rss:1760kB, file-rss:0kB, shmem-rss:0kB sysrq: SysRq : Manual OOM execution Out of memory: Kill process 4469 (systemd-cgroups) score 0 or sacrifice child Killed process 4469 (systemd-cgroups) total-vm:10704kB, anon-rss:120kB, file-rss:0kB, shmem-rss:0kB sysrq: SysRq : Manual OOM execution sysrq: OOM request ignored because killer is disabled sysrq: SysRq : Manual OOM execution sysrq: OOM request ignored because killer is disabled sysrq: SysRq : Manual OOM execution sysrq: OOM request ignored because killer is disabled The real reason is that there are no eligible tasks for the OOM killer to select but since commit 7c5f64f84483 (""mm: oom: deduplicate victim selection code for memcg and global oom"") the semantic of out_of_memory has changed without updating moom_callback. This patch updates moom_callback to tell that no task was eligible which is the case for both oom killer disabled and no eligible tasks. In order to help distinguish first case from the second add printk to both oom_killer_{enable,disable}. This information is useful on its own because it might help debugging potential memory allocation failures. Fixes: 7c5f64f84483 (""mm: oom: deduplicate victim selection code for memcg and global oom"") Link: http://lkml.kernel.org/r/20170404134705.6361-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/d75da004c708c9fca7b53f7da293a295522414d9;oom: improve oom disable handling;yes;yes;no 95;MDY6Q29tbWl0MjMyNTI5ODpkNzVkYTAwNGM3MDhjOWZjYTdiNTNmN2RhMjkzYTI5NTUyMjQxNGQ5;Michal Hocko;Linus Torvalds;"oom: improve oom disable handling Tetsuo has reported that sysrq triggered OOM killer will print a misleading information when no tasks are selected: sysrq: SysRq : Manual OOM execution Out of memory: Kill process 4468 ((agetty)) score 0 or sacrifice child Killed process 4468 ((agetty)) total-vm:43704kB, anon-rss:1760kB, file-rss:0kB, shmem-rss:0kB sysrq: SysRq : Manual OOM execution Out of memory: Kill process 4469 (systemd-cgroups) score 0 or sacrifice child Killed process 4469 (systemd-cgroups) total-vm:10704kB, anon-rss:120kB, file-rss:0kB, shmem-rss:0kB sysrq: SysRq : Manual OOM execution sysrq: OOM request ignored because killer is disabled sysrq: SysRq : Manual OOM execution sysrq: OOM request ignored because killer is disabled sysrq: SysRq : Manual OOM execution sysrq: OOM request ignored because killer is disabled The real reason is that there are no eligible tasks for the OOM killer to select but since commit 7c5f64f84483 (""mm: oom: deduplicate victim selection code for memcg and global oom"") the semantic of out_of_memory has changed without updating moom_callback. This patch updates moom_callback to tell that no task was eligible which is the case for both oom killer disabled and no eligible tasks. In order to help distinguish first case from the second add printk to both oom_killer_{enable,disable}. This information is useful on its own because it might help debugging potential memory allocation failures. Fixes: 7c5f64f84483 (""mm: oom: deduplicate victim selection code for memcg and global oom"") Link: http://lkml.kernel.org/r/20170404134705.6361-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/d75da004c708c9fca7b53f7da293a295522414d9;"Tetsuo has reported that sysrq triggered OOM killer will print a misleading information when no tasks are selected";no;yes;yes 95;MDY6Q29tbWl0MjMyNTI5ODpkNzVkYTAwNGM3MDhjOWZjYTdiNTNmN2RhMjkzYTI5NTUyMjQxNGQ5;Michal Hocko;Linus Torvalds;"oom: improve oom disable handling Tetsuo has reported that sysrq triggered OOM killer will print a misleading information when no tasks are selected: sysrq: SysRq : Manual OOM execution Out of memory: Kill process 4468 ((agetty)) score 0 or sacrifice child Killed process 4468 ((agetty)) total-vm:43704kB, anon-rss:1760kB, file-rss:0kB, shmem-rss:0kB sysrq: SysRq : Manual OOM execution Out of memory: Kill process 4469 (systemd-cgroups) score 0 or sacrifice child Killed process 4469 (systemd-cgroups) total-vm:10704kB, anon-rss:120kB, file-rss:0kB, shmem-rss:0kB sysrq: SysRq : Manual OOM execution sysrq: OOM request ignored because killer is disabled sysrq: SysRq : Manual OOM execution sysrq: OOM request ignored because killer is disabled sysrq: SysRq : Manual OOM execution sysrq: OOM request ignored because killer is disabled The real reason is that there are no eligible tasks for the OOM killer to select but since commit 7c5f64f84483 (""mm: oom: deduplicate victim selection code for memcg and global oom"") the semantic of out_of_memory has changed without updating moom_callback. This patch updates moom_callback to tell that no task was eligible which is the case for both oom killer disabled and no eligible tasks. In order to help distinguish first case from the second add printk to both oom_killer_{enable,disable}. This information is useful on its own because it might help debugging potential memory allocation failures. Fixes: 7c5f64f84483 (""mm: oom: deduplicate victim selection code for memcg and global oom"") Link: http://lkml.kernel.org/r/20170404134705.6361-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/d75da004c708c9fca7b53f7da293a295522414d9;" sysrq: SysRq : Manual OOM execution Out of memory: Kill process 4468 ((agetty)) score 0 or sacrifice child Killed process 4468 ((agetty)) total-vm:43704kB, anon-rss:1760kB, file-rss:0kB, shmem-rss:0kB sysrq: SysRq : Manual OOM execution Out of memory: Kill process 4469 (systemd-cgroups) score 0 or sacrifice child Killed process 4469 (systemd-cgroups) total-vm:10704kB, anon-rss:120kB, file-rss:0kB, shmem-rss:0kB sysrq: SysRq : Manual OOM execution sysrq: OOM request ignored because killer is disabled sysrq: SysRq : Manual OOM execution sysrq: OOM request ignored because killer is disabled sysrq: SysRq : Manual OOM execution sysrq: OOM request ignored because killer is disabled The real reason is that there are no eligible tasks for the OOM killer to select but since commit 7c5f64f84483 (""mm: oom: deduplicate victim selection code for memcg and global oom"") the semantic of out_of_memory has changed without updating moom_callback";no;yes;yes 95;MDY6Q29tbWl0MjMyNTI5ODpkNzVkYTAwNGM3MDhjOWZjYTdiNTNmN2RhMjkzYTI5NTUyMjQxNGQ5;Michal Hocko;Linus Torvalds;"oom: improve oom disable handling Tetsuo has reported that sysrq triggered OOM killer will print a misleading information when no tasks are selected: sysrq: SysRq : Manual OOM execution Out of memory: Kill process 4468 ((agetty)) score 0 or sacrifice child Killed process 4468 ((agetty)) total-vm:43704kB, anon-rss:1760kB, file-rss:0kB, shmem-rss:0kB sysrq: SysRq : Manual OOM execution Out of memory: Kill process 4469 (systemd-cgroups) score 0 or sacrifice child Killed process 4469 (systemd-cgroups) total-vm:10704kB, anon-rss:120kB, file-rss:0kB, shmem-rss:0kB sysrq: SysRq : Manual OOM execution sysrq: OOM request ignored because killer is disabled sysrq: SysRq : Manual OOM execution sysrq: OOM request ignored because killer is disabled sysrq: SysRq : Manual OOM execution sysrq: OOM request ignored because killer is disabled The real reason is that there are no eligible tasks for the OOM killer to select but since commit 7c5f64f84483 (""mm: oom: deduplicate victim selection code for memcg and global oom"") the semantic of out_of_memory has changed without updating moom_callback. This patch updates moom_callback to tell that no task was eligible which is the case for both oom killer disabled and no eligible tasks. In order to help distinguish first case from the second add printk to both oom_killer_{enable,disable}. This information is useful on its own because it might help debugging potential memory allocation failures. Fixes: 7c5f64f84483 (""mm: oom: deduplicate victim selection code for memcg and global oom"") Link: http://lkml.kernel.org/r/20170404134705.6361-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/d75da004c708c9fca7b53f7da293a295522414d9;"This patch updates moom_callback to tell that no task was eligible which is the case for both oom killer disabled and no eligible tasks";yes;no;no 95;MDY6Q29tbWl0MjMyNTI5ODpkNzVkYTAwNGM3MDhjOWZjYTdiNTNmN2RhMjkzYTI5NTUyMjQxNGQ5;Michal Hocko;Linus Torvalds;"oom: improve oom disable handling Tetsuo has reported that sysrq triggered OOM killer will print a misleading information when no tasks are selected: sysrq: SysRq : Manual OOM execution Out of memory: Kill process 4468 ((agetty)) score 0 or sacrifice child Killed process 4468 ((agetty)) total-vm:43704kB, anon-rss:1760kB, file-rss:0kB, shmem-rss:0kB sysrq: SysRq : Manual OOM execution Out of memory: Kill process 4469 (systemd-cgroups) score 0 or sacrifice child Killed process 4469 (systemd-cgroups) total-vm:10704kB, anon-rss:120kB, file-rss:0kB, shmem-rss:0kB sysrq: SysRq : Manual OOM execution sysrq: OOM request ignored because killer is disabled sysrq: SysRq : Manual OOM execution sysrq: OOM request ignored because killer is disabled sysrq: SysRq : Manual OOM execution sysrq: OOM request ignored because killer is disabled The real reason is that there are no eligible tasks for the OOM killer to select but since commit 7c5f64f84483 (""mm: oom: deduplicate victim selection code for memcg and global oom"") the semantic of out_of_memory has changed without updating moom_callback. This patch updates moom_callback to tell that no task was eligible which is the case for both oom killer disabled and no eligible tasks. In order to help distinguish first case from the second add printk to both oom_killer_{enable,disable}. This information is useful on its own because it might help debugging potential memory allocation failures. Fixes: 7c5f64f84483 (""mm: oom: deduplicate victim selection code for memcg and global oom"") Link: http://lkml.kernel.org/r/20170404134705.6361-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/d75da004c708c9fca7b53f7da293a295522414d9;" In order to help distinguish first case from the second add printk to both oom_killer_{enable,disable}";yes;yes;no 95;MDY6Q29tbWl0MjMyNTI5ODpkNzVkYTAwNGM3MDhjOWZjYTdiNTNmN2RhMjkzYTI5NTUyMjQxNGQ5;Michal Hocko;Linus Torvalds;"oom: improve oom disable handling Tetsuo has reported that sysrq triggered OOM killer will print a misleading information when no tasks are selected: sysrq: SysRq : Manual OOM execution Out of memory: Kill process 4468 ((agetty)) score 0 or sacrifice child Killed process 4468 ((agetty)) total-vm:43704kB, anon-rss:1760kB, file-rss:0kB, shmem-rss:0kB sysrq: SysRq : Manual OOM execution Out of memory: Kill process 4469 (systemd-cgroups) score 0 or sacrifice child Killed process 4469 (systemd-cgroups) total-vm:10704kB, anon-rss:120kB, file-rss:0kB, shmem-rss:0kB sysrq: SysRq : Manual OOM execution sysrq: OOM request ignored because killer is disabled sysrq: SysRq : Manual OOM execution sysrq: OOM request ignored because killer is disabled sysrq: SysRq : Manual OOM execution sysrq: OOM request ignored because killer is disabled The real reason is that there are no eligible tasks for the OOM killer to select but since commit 7c5f64f84483 (""mm: oom: deduplicate victim selection code for memcg and global oom"") the semantic of out_of_memory has changed without updating moom_callback. This patch updates moom_callback to tell that no task was eligible which is the case for both oom killer disabled and no eligible tasks. In order to help distinguish first case from the second add printk to both oom_killer_{enable,disable}. This information is useful on its own because it might help debugging potential memory allocation failures. Fixes: 7c5f64f84483 (""mm: oom: deduplicate victim selection code for memcg and global oom"") Link: http://lkml.kernel.org/r/20170404134705.6361-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/d75da004c708c9fca7b53f7da293a295522414d9;" This information is useful on its own because it might help debugging potential memory allocation failures";no;yes;no 95;MDY6Q29tbWl0MjMyNTI5ODpkNzVkYTAwNGM3MDhjOWZjYTdiNTNmN2RhMjkzYTI5NTUyMjQxNGQ5;Michal Hocko;Linus Torvalds;"oom: improve oom disable handling Tetsuo has reported that sysrq triggered OOM killer will print a misleading information when no tasks are selected: sysrq: SysRq : Manual OOM execution Out of memory: Kill process 4468 ((agetty)) score 0 or sacrifice child Killed process 4468 ((agetty)) total-vm:43704kB, anon-rss:1760kB, file-rss:0kB, shmem-rss:0kB sysrq: SysRq : Manual OOM execution Out of memory: Kill process 4469 (systemd-cgroups) score 0 or sacrifice child Killed process 4469 (systemd-cgroups) total-vm:10704kB, anon-rss:120kB, file-rss:0kB, shmem-rss:0kB sysrq: SysRq : Manual OOM execution sysrq: OOM request ignored because killer is disabled sysrq: SysRq : Manual OOM execution sysrq: OOM request ignored because killer is disabled sysrq: SysRq : Manual OOM execution sysrq: OOM request ignored because killer is disabled The real reason is that there are no eligible tasks for the OOM killer to select but since commit 7c5f64f84483 (""mm: oom: deduplicate victim selection code for memcg and global oom"") the semantic of out_of_memory has changed without updating moom_callback. This patch updates moom_callback to tell that no task was eligible which is the case for both oom killer disabled and no eligible tasks. In order to help distinguish first case from the second add printk to both oom_killer_{enable,disable}. This information is useful on its own because it might help debugging potential memory allocation failures. Fixes: 7c5f64f84483 (""mm: oom: deduplicate victim selection code for memcg and global oom"") Link: http://lkml.kernel.org/r/20170404134705.6361-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/d75da004c708c9fca7b53f7da293a295522414d9;"Fixes: 7c5f64f84483 (""mm: oom: deduplicate victim selection code for memcg and global oom"")";no;yes;no 97;MDY6Q29tbWl0MjMyNTI5ODpmN2NjYmFlNDVjNWUyYzEwNzc2NTRiMGU4NTdlN2VmYjFhYTMxYzky;Ingo Molnar;Ingo Molnar;"sched/headers: Prepare for new header dependencies before moving code to <linux/sched/coredump.h> We are going to split <linux/sched/coredump.h> out of <linux/sched.h>, which will have to be picked up from other headers and a couple of .c files. Create a trivial placeholder <linux/sched/coredump.h> file that just maps to <linux/sched.h> to make this patch obviously correct and bisectable. Include the new header in the files that are going to need it. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>";https://api.github.com/repos/torvalds/linux/git/commits/f7ccbae45c5e2c1077654b0e857e7efb1aa31c92;sched/headers: Prepare for new header dependencies before moving code to <linux/sched/coredump.h>;yes;yes;no 97;MDY6Q29tbWl0MjMyNTI5ODpmN2NjYmFlNDVjNWUyYzEwNzc2NTRiMGU4NTdlN2VmYjFhYTMxYzky;Ingo Molnar;Ingo Molnar;"sched/headers: Prepare for new header dependencies before moving code to <linux/sched/coredump.h> We are going to split <linux/sched/coredump.h> out of <linux/sched.h>, which will have to be picked up from other headers and a couple of .c files. Create a trivial placeholder <linux/sched/coredump.h> file that just maps to <linux/sched.h> to make this patch obviously correct and bisectable. Include the new header in the files that are going to need it. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>";https://api.github.com/repos/torvalds/linux/git/commits/f7ccbae45c5e2c1077654b0e857e7efb1aa31c92;"We are going to split <linux/sched/coredump.h> out of <linux/sched.h>, which will have to be picked up from other headers and a couple of .c files";no;yes;yes 97;MDY6Q29tbWl0MjMyNTI5ODpmN2NjYmFlNDVjNWUyYzEwNzc2NTRiMGU4NTdlN2VmYjFhYTMxYzky;Ingo Molnar;Ingo Molnar;"sched/headers: Prepare for new header dependencies before moving code to <linux/sched/coredump.h> We are going to split <linux/sched/coredump.h> out of <linux/sched.h>, which will have to be picked up from other headers and a couple of .c files. Create a trivial placeholder <linux/sched/coredump.h> file that just maps to <linux/sched.h> to make this patch obviously correct and bisectable. Include the new header in the files that are going to need it. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>";https://api.github.com/repos/torvalds/linux/git/commits/f7ccbae45c5e2c1077654b0e857e7efb1aa31c92;"Create a trivial placeholder <linux/sched/coredump.h> file that just maps to <linux/sched.h> to make this patch obviously correct and bisectable";yes;yes;no 97;MDY6Q29tbWl0MjMyNTI5ODpmN2NjYmFlNDVjNWUyYzEwNzc2NTRiMGU4NTdlN2VmYjFhYTMxYzky;Ingo Molnar;Ingo Molnar;"sched/headers: Prepare for new header dependencies before moving code to <linux/sched/coredump.h> We are going to split <linux/sched/coredump.h> out of <linux/sched.h>, which will have to be picked up from other headers and a couple of .c files. Create a trivial placeholder <linux/sched/coredump.h> file that just maps to <linux/sched.h> to make this patch obviously correct and bisectable. Include the new header in the files that are going to need it. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>";https://api.github.com/repos/torvalds/linux/git/commits/f7ccbae45c5e2c1077654b0e857e7efb1aa31c92;Include the new header in the files that are going to need it.;yes;yes;no 98;MDY6Q29tbWl0MjMyNTI5ODo2ZTg0ZjMxNTIyZjkzMTAyN2JmNjk1NzUyMDg3ZWNlMjc4YzEwZDNm;Ingo Molnar;Ingo Molnar;"sched/headers: Prepare for new header dependencies before moving code to <linux/sched/mm.h> We are going to split <linux/sched/mm.h> out of <linux/sched.h>, which will have to be picked up from other headers and a couple of .c files. Create a trivial placeholder <linux/sched/mm.h> file that just maps to <linux/sched.h> to make this patch obviously correct and bisectable. The APIs that are going to be moved first are: mm_alloc() __mmdrop() mmdrop() mmdrop_async_fn() mmdrop_async() mmget_not_zero() mmput() mmput_async() get_task_mm() mm_access() mm_release() Include the new header in the files that are going to need it. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>";https://api.github.com/repos/torvalds/linux/git/commits/6e84f31522f931027bf695752087ece278c10d3f;sched/headers: Prepare for new header dependencies before moving code to <linux/sched/mm.h>;yes;yes;no 98;MDY6Q29tbWl0MjMyNTI5ODo2ZTg0ZjMxNTIyZjkzMTAyN2JmNjk1NzUyMDg3ZWNlMjc4YzEwZDNm;Ingo Molnar;Ingo Molnar;"sched/headers: Prepare for new header dependencies before moving code to <linux/sched/mm.h> We are going to split <linux/sched/mm.h> out of <linux/sched.h>, which will have to be picked up from other headers and a couple of .c files. Create a trivial placeholder <linux/sched/mm.h> file that just maps to <linux/sched.h> to make this patch obviously correct and bisectable. The APIs that are going to be moved first are: mm_alloc() __mmdrop() mmdrop() mmdrop_async_fn() mmdrop_async() mmget_not_zero() mmput() mmput_async() get_task_mm() mm_access() mm_release() Include the new header in the files that are going to need it. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>";https://api.github.com/repos/torvalds/linux/git/commits/6e84f31522f931027bf695752087ece278c10d3f;"We are going to split <linux/sched/mm.h> out of <linux/sched.h>, which will have to be picked up from other headers and a couple of .c files";no;yes;yes 98;MDY6Q29tbWl0MjMyNTI5ODo2ZTg0ZjMxNTIyZjkzMTAyN2JmNjk1NzUyMDg3ZWNlMjc4YzEwZDNm;Ingo Molnar;Ingo Molnar;"sched/headers: Prepare for new header dependencies before moving code to <linux/sched/mm.h> We are going to split <linux/sched/mm.h> out of <linux/sched.h>, which will have to be picked up from other headers and a couple of .c files. Create a trivial placeholder <linux/sched/mm.h> file that just maps to <linux/sched.h> to make this patch obviously correct and bisectable. The APIs that are going to be moved first are: mm_alloc() __mmdrop() mmdrop() mmdrop_async_fn() mmdrop_async() mmget_not_zero() mmput() mmput_async() get_task_mm() mm_access() mm_release() Include the new header in the files that are going to need it. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>";https://api.github.com/repos/torvalds/linux/git/commits/6e84f31522f931027bf695752087ece278c10d3f;"Create a trivial placeholder <linux/sched/mm.h> file that just maps to <linux/sched.h> to make this patch obviously correct and bisectable";yes;yes;no 98;MDY6Q29tbWl0MjMyNTI5ODo2ZTg0ZjMxNTIyZjkzMTAyN2JmNjk1NzUyMDg3ZWNlMjc4YzEwZDNm;Ingo Molnar;Ingo Molnar;"sched/headers: Prepare for new header dependencies before moving code to <linux/sched/mm.h> We are going to split <linux/sched/mm.h> out of <linux/sched.h>, which will have to be picked up from other headers and a couple of .c files. Create a trivial placeholder <linux/sched/mm.h> file that just maps to <linux/sched.h> to make this patch obviously correct and bisectable. The APIs that are going to be moved first are: mm_alloc() __mmdrop() mmdrop() mmdrop_async_fn() mmdrop_async() mmget_not_zero() mmput() mmput_async() get_task_mm() mm_access() mm_release() Include the new header in the files that are going to need it. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>";https://api.github.com/repos/torvalds/linux/git/commits/6e84f31522f931027bf695752087ece278c10d3f;The APIs that are going to be moved first are;yes;no;yes 98;MDY6Q29tbWl0MjMyNTI5ODo2ZTg0ZjMxNTIyZjkzMTAyN2JmNjk1NzUyMDg3ZWNlMjc4YzEwZDNm;Ingo Molnar;Ingo Molnar;"sched/headers: Prepare for new header dependencies before moving code to <linux/sched/mm.h> We are going to split <linux/sched/mm.h> out of <linux/sched.h>, which will have to be picked up from other headers and a couple of .c files. Create a trivial placeholder <linux/sched/mm.h> file that just maps to <linux/sched.h> to make this patch obviously correct and bisectable. The APIs that are going to be moved first are: mm_alloc() __mmdrop() mmdrop() mmdrop_async_fn() mmdrop_async() mmget_not_zero() mmput() mmput_async() get_task_mm() mm_access() mm_release() Include the new header in the files that are going to need it. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>";https://api.github.com/repos/torvalds/linux/git/commits/6e84f31522f931027bf695752087ece278c10d3f;" mm_alloc() __mmdrop() mmdrop() mmdrop_async_fn() mmdrop_async() mmget_not_zero() mmput() mmput_async() get_task_mm() mm_access() mm_release() Include the new header in the files that are going to need it.";yes;yes;yes 99;MDY6Q29tbWl0MjMyNTI5ODpmMWYxMDA3NjQ0ZmZjODA1MWE0YzExNDI3ZDU4YjE5NjdhZTdiNzVh;Vegard Nossum;Linus Torvalds;"mm: add new mmgrab() helper Apart from adding the helper function itself, the rest of the kernel is converted mechanically using: git grep -l 'atomic_inc.*mm_count' | xargs sed -i 's/atomic_inc(&\(.*\)->mm_count);/mmgrab\(\1\);/' git grep -l 'atomic_inc.*mm_count' | xargs sed -i 's/atomic_inc(&\(.*\)\.mm_count);/mmgrab\(\&\1\);/' This is needed for a later patch that hooks into the helper, but might be a worthwhile cleanup on its own. (Michal Hocko provided most of the kerneldoc comment.) Link: http://lkml.kernel.org/r/20161218123229.22952-1-vegard.nossum@oracle.com Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/f1f1007644ffc8051a4c11427d58b1967ae7b75a;mm: add new mmgrab() helper;yes;no;no 99;MDY6Q29tbWl0MjMyNTI5ODpmMWYxMDA3NjQ0ZmZjODA1MWE0YzExNDI3ZDU4YjE5NjdhZTdiNzVh;Vegard Nossum;Linus Torvalds;"mm: add new mmgrab() helper Apart from adding the helper function itself, the rest of the kernel is converted mechanically using: git grep -l 'atomic_inc.*mm_count' | xargs sed -i 's/atomic_inc(&\(.*\)->mm_count);/mmgrab\(\1\);/' git grep -l 'atomic_inc.*mm_count' | xargs sed -i 's/atomic_inc(&\(.*\)\.mm_count);/mmgrab\(\&\1\);/' This is needed for a later patch that hooks into the helper, but might be a worthwhile cleanup on its own. (Michal Hocko provided most of the kerneldoc comment.) Link: http://lkml.kernel.org/r/20161218123229.22952-1-vegard.nossum@oracle.com Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/f1f1007644ffc8051a4c11427d58b1967ae7b75a;"Apart from adding the helper function itself, the rest of the kernel is converted mechanically using";yes;no;no 99;MDY6Q29tbWl0MjMyNTI5ODpmMWYxMDA3NjQ0ZmZjODA1MWE0YzExNDI3ZDU4YjE5NjdhZTdiNzVh;Vegard Nossum;Linus Torvalds;"mm: add new mmgrab() helper Apart from adding the helper function itself, the rest of the kernel is converted mechanically using: git grep -l 'atomic_inc.*mm_count' | xargs sed -i 's/atomic_inc(&\(.*\)->mm_count);/mmgrab\(\1\);/' git grep -l 'atomic_inc.*mm_count' | xargs sed -i 's/atomic_inc(&\(.*\)\.mm_count);/mmgrab\(\&\1\);/' This is needed for a later patch that hooks into the helper, but might be a worthwhile cleanup on its own. (Michal Hocko provided most of the kerneldoc comment.) Link: http://lkml.kernel.org/r/20161218123229.22952-1-vegard.nossum@oracle.com Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/f1f1007644ffc8051a4c11427d58b1967ae7b75a;"This is needed for a later patch that hooks into the helper, but might be a worthwhile cleanup on its own";no;yes;no 99;MDY6Q29tbWl0MjMyNTI5ODpmMWYxMDA3NjQ0ZmZjODA1MWE0YzExNDI3ZDU4YjE5NjdhZTdiNzVh;Vegard Nossum;Linus Torvalds;"mm: add new mmgrab() helper Apart from adding the helper function itself, the rest of the kernel is converted mechanically using: git grep -l 'atomic_inc.*mm_count' | xargs sed -i 's/atomic_inc(&\(.*\)->mm_count);/mmgrab\(\1\);/' git grep -l 'atomic_inc.*mm_count' | xargs sed -i 's/atomic_inc(&\(.*\)\.mm_count);/mmgrab\(\&\1\);/' This is needed for a later patch that hooks into the helper, but might be a worthwhile cleanup on its own. (Michal Hocko provided most of the kerneldoc comment.) Link: http://lkml.kernel.org/r/20161218123229.22952-1-vegard.nossum@oracle.com Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/f1f1007644ffc8051a4c11427d58b1967ae7b75a;(Michal Hocko provided most of the kerneldoc comment.);no;no;yes 100;MDY6Q29tbWl0MjMyNTI5ODoyOTljNTE3YWRiOTAzZGRkMzc2ZGQxMDNmY2M5ZGNmZmYzZDQ3Mjhi;David Rientjes;Linus Torvalds;"mm, oom: header nodemask is NULL when cpusets are disabled Commit 82e7d3abec86 (""oom: print nodemask in the oom report"") implicitly sets the allocation nodemask to cpuset_current_mems_allowed when there is no effective mempolicy. cpuset_current_mems_allowed is only effective when cpusets are enabled, which is also printed by dump_header(), so setting the nodemask to cpuset_current_mems_allowed is redundant and prevents debugging issues where ac->nodemask is not set properly in the page allocator. This provides better debugging output since cpuset_print_current_mems_allowed() is already provided. [rientjes@google.com: newline per Hillf] Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1701200158300.88321@chino.kir.corp.google.com Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1701191454470.2381@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/299c517adb903ddd376dd103fcc9dcfff3d4728b;mm, oom: header nodemask is NULL when cpusets are disabled;no;no;yes 100;MDY6Q29tbWl0MjMyNTI5ODoyOTljNTE3YWRiOTAzZGRkMzc2ZGQxMDNmY2M5ZGNmZmYzZDQ3Mjhi;David Rientjes;Linus Torvalds;"mm, oom: header nodemask is NULL when cpusets are disabled Commit 82e7d3abec86 (""oom: print nodemask in the oom report"") implicitly sets the allocation nodemask to cpuset_current_mems_allowed when there is no effective mempolicy. cpuset_current_mems_allowed is only effective when cpusets are enabled, which is also printed by dump_header(), so setting the nodemask to cpuset_current_mems_allowed is redundant and prevents debugging issues where ac->nodemask is not set properly in the page allocator. This provides better debugging output since cpuset_print_current_mems_allowed() is already provided. [rientjes@google.com: newline per Hillf] Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1701200158300.88321@chino.kir.corp.google.com Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1701191454470.2381@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/299c517adb903ddd376dd103fcc9dcfff3d4728b;"Commit 82e7d3abec86 (""oom: print nodemask in the oom report"") implicitly sets the allocation nodemask to cpuset_current_mems_allowed when there is no effective mempolicy";no;no;yes 100;MDY6Q29tbWl0MjMyNTI5ODoyOTljNTE3YWRiOTAzZGRkMzc2ZGQxMDNmY2M5ZGNmZmYzZDQ3Mjhi;David Rientjes;Linus Torvalds;"mm, oom: header nodemask is NULL when cpusets are disabled Commit 82e7d3abec86 (""oom: print nodemask in the oom report"") implicitly sets the allocation nodemask to cpuset_current_mems_allowed when there is no effective mempolicy. cpuset_current_mems_allowed is only effective when cpusets are enabled, which is also printed by dump_header(), so setting the nodemask to cpuset_current_mems_allowed is redundant and prevents debugging issues where ac->nodemask is not set properly in the page allocator. This provides better debugging output since cpuset_print_current_mems_allowed() is already provided. [rientjes@google.com: newline per Hillf] Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1701200158300.88321@chino.kir.corp.google.com Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1701191454470.2381@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/299c517adb903ddd376dd103fcc9dcfff3d4728b;" cpuset_current_mems_allowed is only effective when cpusets are enabled, which is also printed by dump_header(), so setting the nodemask to cpuset_current_mems_allowed is redundant and prevents debugging issues where ac->nodemask is not set properly in the page allocator";no;yes;yes 100;MDY6Q29tbWl0MjMyNTI5ODoyOTljNTE3YWRiOTAzZGRkMzc2ZGQxMDNmY2M5ZGNmZmYzZDQ3Mjhi;David Rientjes;Linus Torvalds;"mm, oom: header nodemask is NULL when cpusets are disabled Commit 82e7d3abec86 (""oom: print nodemask in the oom report"") implicitly sets the allocation nodemask to cpuset_current_mems_allowed when there is no effective mempolicy. cpuset_current_mems_allowed is only effective when cpusets are enabled, which is also printed by dump_header(), so setting the nodemask to cpuset_current_mems_allowed is redundant and prevents debugging issues where ac->nodemask is not set properly in the page allocator. This provides better debugging output since cpuset_print_current_mems_allowed() is already provided. [rientjes@google.com: newline per Hillf] Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1701200158300.88321@chino.kir.corp.google.com Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1701191454470.2381@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/299c517adb903ddd376dd103fcc9dcfff3d4728b;"This provides better debugging output since cpuset_print_current_mems_allowed() is already provided.";no;yes;yes 101;MDY6Q29tbWl0MjMyNTI5ODoyMzUxOTA3MzhhYmE3YzVjOTQzMDBjOGQ4ODI4NDJhNTM1MjgwZTVh;Kirill A. Shutemov;Linus Torvalds;"oom-reaper: use madvise_dontneed() logic to decide if unmap the VMA Logic on whether we can reap pages from the VMA should match what we have in madvise_dontneed(). In particular, we should skip, VM_PFNMAP VMAs, but we don't now. Let's just extract condition on which we can shoot down pagesi from a VMA with MADV_DONTNEED into separate function and use it in both places. Link: http://lkml.kernel.org/r/20170118122429.43661-4-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/235190738aba7c5c94300c8d882842a535280e5a;oom-reaper: use madvise_dontneed() logic to decide if unmap the VMA;yes;yes;no 101;MDY6Q29tbWl0MjMyNTI5ODoyMzUxOTA3MzhhYmE3YzVjOTQzMDBjOGQ4ODI4NDJhNTM1MjgwZTVh;Kirill A. Shutemov;Linus Torvalds;"oom-reaper: use madvise_dontneed() logic to decide if unmap the VMA Logic on whether we can reap pages from the VMA should match what we have in madvise_dontneed(). In particular, we should skip, VM_PFNMAP VMAs, but we don't now. Let's just extract condition on which we can shoot down pagesi from a VMA with MADV_DONTNEED into separate function and use it in both places. Link: http://lkml.kernel.org/r/20170118122429.43661-4-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/235190738aba7c5c94300c8d882842a535280e5a;"Logic on whether we can reap pages from the VMA should match what we have in madvise_dontneed()";no;yes;yes 101;MDY6Q29tbWl0MjMyNTI5ODoyMzUxOTA3MzhhYmE3YzVjOTQzMDBjOGQ4ODI4NDJhNTM1MjgwZTVh;Kirill A. Shutemov;Linus Torvalds;"oom-reaper: use madvise_dontneed() logic to decide if unmap the VMA Logic on whether we can reap pages from the VMA should match what we have in madvise_dontneed(). In particular, we should skip, VM_PFNMAP VMAs, but we don't now. Let's just extract condition on which we can shoot down pagesi from a VMA with MADV_DONTNEED into separate function and use it in both places. Link: http://lkml.kernel.org/r/20170118122429.43661-4-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/235190738aba7c5c94300c8d882842a535280e5a;" In particular, we should skip, VM_PFNMAP VMAs, but we don't now";no;yes;yes 101;MDY6Q29tbWl0MjMyNTI5ODoyMzUxOTA3MzhhYmE3YzVjOTQzMDBjOGQ4ODI4NDJhNTM1MjgwZTVh;Kirill A. Shutemov;Linus Torvalds;"oom-reaper: use madvise_dontneed() logic to decide if unmap the VMA Logic on whether we can reap pages from the VMA should match what we have in madvise_dontneed(). In particular, we should skip, VM_PFNMAP VMAs, but we don't now. Let's just extract condition on which we can shoot down pagesi from a VMA with MADV_DONTNEED into separate function and use it in both places. Link: http://lkml.kernel.org/r/20170118122429.43661-4-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/235190738aba7c5c94300c8d882842a535280e5a;"Let's just extract condition on which we can shoot down pagesi from a VMA with MADV_DONTNEED into separate function and use it in both places.";yes;yes;no 102;MDY6Q29tbWl0MjMyNTI5ODozZTg3MTVmZGMwM2U4ZGY0ZDI2ZDhlNDM2MTY2ZTQ0ZTNlNDE2ZDNi;Kirill A. Shutemov;Linus Torvalds;"mm: drop zap_details::check_swap_entries detail == NULL would give the same functionality as .check_swap_entries==true. Link: http://lkml.kernel.org/r/20170118122429.43661-2-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3e8715fdc03e8df4d26d8e436166e44e3e416d3b;mm: drop zap_details::check_swap_entries;yes;no;no 102;MDY6Q29tbWl0MjMyNTI5ODozZTg3MTVmZGMwM2U4ZGY0ZDI2ZDhlNDM2MTY2ZTQ0ZTNlNDE2ZDNi;Kirill A. Shutemov;Linus Torvalds;"mm: drop zap_details::check_swap_entries detail == NULL would give the same functionality as .check_swap_entries==true. Link: http://lkml.kernel.org/r/20170118122429.43661-2-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3e8715fdc03e8df4d26d8e436166e44e3e416d3b;"detail == NULL would give the same functionality as .check_swap_entries==true.";yes;yes;yes 103;MDY6Q29tbWl0MjMyNTI5ODpkYTE2MmU5MzY4OTkwZWQ3NDcwNzVlMmFiNDI3ZGEwNzU5ZmM0YTU5;Kirill A. Shutemov;Linus Torvalds;"mm: drop zap_details::ignore_dirty The only user of ignore_dirty is oom-reaper. But it doesn't really use it. ignore_dirty only has effect on file pages mapped with dirty pte. But oom-repear skips shared VMAs, so there's no way we can dirty file pte in them. Link: http://lkml.kernel.org/r/20170118122429.43661-1-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/da162e9368990ed747075e2ab427da0759fc4a59;mm: drop zap_details::ignore_dirty;yes;no;no 103;MDY6Q29tbWl0MjMyNTI5ODpkYTE2MmU5MzY4OTkwZWQ3NDcwNzVlMmFiNDI3ZGEwNzU5ZmM0YTU5;Kirill A. Shutemov;Linus Torvalds;"mm: drop zap_details::ignore_dirty The only user of ignore_dirty is oom-reaper. But it doesn't really use it. ignore_dirty only has effect on file pages mapped with dirty pte. But oom-repear skips shared VMAs, so there's no way we can dirty file pte in them. Link: http://lkml.kernel.org/r/20170118122429.43661-1-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/da162e9368990ed747075e2ab427da0759fc4a59;The only user of ignore_dirty is oom-reaper;no;no;yes 103;MDY6Q29tbWl0MjMyNTI5ODpkYTE2MmU5MzY4OTkwZWQ3NDcwNzVlMmFiNDI3ZGEwNzU5ZmM0YTU5;Kirill A. Shutemov;Linus Torvalds;"mm: drop zap_details::ignore_dirty The only user of ignore_dirty is oom-reaper. But it doesn't really use it. ignore_dirty only has effect on file pages mapped with dirty pte. But oom-repear skips shared VMAs, so there's no way we can dirty file pte in them. Link: http://lkml.kernel.org/r/20170118122429.43661-1-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/da162e9368990ed747075e2ab427da0759fc4a59;" But it doesn't really use ignore_dirty only has effect on file pages mapped with dirty pte";no;yes;yes 103;MDY6Q29tbWl0MjMyNTI5ODpkYTE2MmU5MzY4OTkwZWQ3NDcwNzVlMmFiNDI3ZGEwNzU5ZmM0YTU5;Kirill A. Shutemov;Linus Torvalds;"mm: drop zap_details::ignore_dirty The only user of ignore_dirty is oom-reaper. But it doesn't really use it. ignore_dirty only has effect on file pages mapped with dirty pte. But oom-repear skips shared VMAs, so there's no way we can dirty file pte in them. Link: http://lkml.kernel.org/r/20170118122429.43661-1-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/da162e9368990ed747075e2ab427da0759fc4a59;" But oom-repear skips shared VMAs, so there's no way we can dirty file pte in them.";no;yes;yes 104;MDY6Q29tbWl0MjMyNTI5ODowNmFkMjc2YWMxODc0MmM2YjI4MTY5OGQ0MWIyN2EyOTBjZDQyNDA3;Michal Hocko;Linus Torvalds;"mm, oom: do not enforce OOM killer for __GFP_NOFAIL automatically __alloc_pages_may_oom makes sure to skip the OOM killer depending on the allocation request. This includes lowmem requests, costly high order requests and others. For a long time __GFP_NOFAIL acted as an override for all those rules. This is not documented and it can be quite surprising as well. E.g. GFP_NOFS requests are not invoking the OOM killer but GFP_NOFS|__GFP_NOFAIL does so if we try to convert some of the existing open coded loops around allocator to nofail request (and we have done that in the past) then such a change would have a non trivial side effect which is far from obvious. Note that the primary motivation for skipping the OOM killer is to prevent from pre-mature invocation. The exception has been added by commit 82553a937f12 (""oom: invoke oom killer for __GFP_NOFAIL""). The changelog points out that the oom killer has to be invoked otherwise the request would be looping for ever. But this argument is rather weak because the OOM killer doesn't really guarantee a forward progress for those exceptional cases: - it will hardly help to form costly order which in turn can result in the system panic because of no oom killable task in the end - I believe we certainly do not want to put the system down just because there is a nasty driver asking for order-9 page with GFP_NOFAIL not realizing all the consequences. It is much better this request would loop for ever than the massive system disruption - lowmem is also highly unlikely to be freed during OOM killer - GFP_NOFS request could trigger while there is still a lot of memory pinned by filesystems. This patch simply removes the __GFP_NOFAIL special case in order to have a more clear semantic without surprising side effects. Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Nils Holland <nholland@tisys.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/06ad276ac18742c6b281698d41b27a290cd42407;mm, oom: do not enforce OOM killer for __GFP_NOFAIL automatically;yes;no;no 104;MDY6Q29tbWl0MjMyNTI5ODowNmFkMjc2YWMxODc0MmM2YjI4MTY5OGQ0MWIyN2EyOTBjZDQyNDA3;Michal Hocko;Linus Torvalds;"mm, oom: do not enforce OOM killer for __GFP_NOFAIL automatically __alloc_pages_may_oom makes sure to skip the OOM killer depending on the allocation request. This includes lowmem requests, costly high order requests and others. For a long time __GFP_NOFAIL acted as an override for all those rules. This is not documented and it can be quite surprising as well. E.g. GFP_NOFS requests are not invoking the OOM killer but GFP_NOFS|__GFP_NOFAIL does so if we try to convert some of the existing open coded loops around allocator to nofail request (and we have done that in the past) then such a change would have a non trivial side effect which is far from obvious. Note that the primary motivation for skipping the OOM killer is to prevent from pre-mature invocation. The exception has been added by commit 82553a937f12 (""oom: invoke oom killer for __GFP_NOFAIL""). The changelog points out that the oom killer has to be invoked otherwise the request would be looping for ever. But this argument is rather weak because the OOM killer doesn't really guarantee a forward progress for those exceptional cases: - it will hardly help to form costly order which in turn can result in the system panic because of no oom killable task in the end - I believe we certainly do not want to put the system down just because there is a nasty driver asking for order-9 page with GFP_NOFAIL not realizing all the consequences. It is much better this request would loop for ever than the massive system disruption - lowmem is also highly unlikely to be freed during OOM killer - GFP_NOFS request could trigger while there is still a lot of memory pinned by filesystems. This patch simply removes the __GFP_NOFAIL special case in order to have a more clear semantic without surprising side effects. Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Nils Holland <nholland@tisys.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/06ad276ac18742c6b281698d41b27a290cd42407;"__alloc_pages_may_oom makes sure to skip the OOM killer depending on the allocation request";no;no;yes 104;MDY6Q29tbWl0MjMyNTI5ODowNmFkMjc2YWMxODc0MmM2YjI4MTY5OGQ0MWIyN2EyOTBjZDQyNDA3;Michal Hocko;Linus Torvalds;"mm, oom: do not enforce OOM killer for __GFP_NOFAIL automatically __alloc_pages_may_oom makes sure to skip the OOM killer depending on the allocation request. This includes lowmem requests, costly high order requests and others. For a long time __GFP_NOFAIL acted as an override for all those rules. This is not documented and it can be quite surprising as well. E.g. GFP_NOFS requests are not invoking the OOM killer but GFP_NOFS|__GFP_NOFAIL does so if we try to convert some of the existing open coded loops around allocator to nofail request (and we have done that in the past) then such a change would have a non trivial side effect which is far from obvious. Note that the primary motivation for skipping the OOM killer is to prevent from pre-mature invocation. The exception has been added by commit 82553a937f12 (""oom: invoke oom killer for __GFP_NOFAIL""). The changelog points out that the oom killer has to be invoked otherwise the request would be looping for ever. But this argument is rather weak because the OOM killer doesn't really guarantee a forward progress for those exceptional cases: - it will hardly help to form costly order which in turn can result in the system panic because of no oom killable task in the end - I believe we certainly do not want to put the system down just because there is a nasty driver asking for order-9 page with GFP_NOFAIL not realizing all the consequences. It is much better this request would loop for ever than the massive system disruption - lowmem is also highly unlikely to be freed during OOM killer - GFP_NOFS request could trigger while there is still a lot of memory pinned by filesystems. This patch simply removes the __GFP_NOFAIL special case in order to have a more clear semantic without surprising side effects. Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Nils Holland <nholland@tisys.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/06ad276ac18742c6b281698d41b27a290cd42407;" This includes lowmem requests, costly high order requests and others";no;no;yes 104;MDY6Q29tbWl0MjMyNTI5ODowNmFkMjc2YWMxODc0MmM2YjI4MTY5OGQ0MWIyN2EyOTBjZDQyNDA3;Michal Hocko;Linus Torvalds;"mm, oom: do not enforce OOM killer for __GFP_NOFAIL automatically __alloc_pages_may_oom makes sure to skip the OOM killer depending on the allocation request. This includes lowmem requests, costly high order requests and others. For a long time __GFP_NOFAIL acted as an override for all those rules. This is not documented and it can be quite surprising as well. E.g. GFP_NOFS requests are not invoking the OOM killer but GFP_NOFS|__GFP_NOFAIL does so if we try to convert some of the existing open coded loops around allocator to nofail request (and we have done that in the past) then such a change would have a non trivial side effect which is far from obvious. Note that the primary motivation for skipping the OOM killer is to prevent from pre-mature invocation. The exception has been added by commit 82553a937f12 (""oom: invoke oom killer for __GFP_NOFAIL""). The changelog points out that the oom killer has to be invoked otherwise the request would be looping for ever. But this argument is rather weak because the OOM killer doesn't really guarantee a forward progress for those exceptional cases: - it will hardly help to form costly order which in turn can result in the system panic because of no oom killable task in the end - I believe we certainly do not want to put the system down just because there is a nasty driver asking for order-9 page with GFP_NOFAIL not realizing all the consequences. It is much better this request would loop for ever than the massive system disruption - lowmem is also highly unlikely to be freed during OOM killer - GFP_NOFS request could trigger while there is still a lot of memory pinned by filesystems. This patch simply removes the __GFP_NOFAIL special case in order to have a more clear semantic without surprising side effects. Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Nils Holland <nholland@tisys.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/06ad276ac18742c6b281698d41b27a290cd42407;" For a long time __GFP_NOFAIL acted as an override for all those rules";no;no;yes 104;MDY6Q29tbWl0MjMyNTI5ODowNmFkMjc2YWMxODc0MmM2YjI4MTY5OGQ0MWIyN2EyOTBjZDQyNDA3;Michal Hocko;Linus Torvalds;"mm, oom: do not enforce OOM killer for __GFP_NOFAIL automatically __alloc_pages_may_oom makes sure to skip the OOM killer depending on the allocation request. This includes lowmem requests, costly high order requests and others. For a long time __GFP_NOFAIL acted as an override for all those rules. This is not documented and it can be quite surprising as well. E.g. GFP_NOFS requests are not invoking the OOM killer but GFP_NOFS|__GFP_NOFAIL does so if we try to convert some of the existing open coded loops around allocator to nofail request (and we have done that in the past) then such a change would have a non trivial side effect which is far from obvious. Note that the primary motivation for skipping the OOM killer is to prevent from pre-mature invocation. The exception has been added by commit 82553a937f12 (""oom: invoke oom killer for __GFP_NOFAIL""). The changelog points out that the oom killer has to be invoked otherwise the request would be looping for ever. But this argument is rather weak because the OOM killer doesn't really guarantee a forward progress for those exceptional cases: - it will hardly help to form costly order which in turn can result in the system panic because of no oom killable task in the end - I believe we certainly do not want to put the system down just because there is a nasty driver asking for order-9 page with GFP_NOFAIL not realizing all the consequences. It is much better this request would loop for ever than the massive system disruption - lowmem is also highly unlikely to be freed during OOM killer - GFP_NOFS request could trigger while there is still a lot of memory pinned by filesystems. This patch simply removes the __GFP_NOFAIL special case in order to have a more clear semantic without surprising side effects. Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Nils Holland <nholland@tisys.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/06ad276ac18742c6b281698d41b27a290cd42407;" This is not documented and it can be quite surprising as well";no;yes;yes 104;MDY6Q29tbWl0MjMyNTI5ODowNmFkMjc2YWMxODc0MmM2YjI4MTY5OGQ0MWIyN2EyOTBjZDQyNDA3;Michal Hocko;Linus Torvalds;"mm, oom: do not enforce OOM killer for __GFP_NOFAIL automatically __alloc_pages_may_oom makes sure to skip the OOM killer depending on the allocation request. This includes lowmem requests, costly high order requests and others. For a long time __GFP_NOFAIL acted as an override for all those rules. This is not documented and it can be quite surprising as well. E.g. GFP_NOFS requests are not invoking the OOM killer but GFP_NOFS|__GFP_NOFAIL does so if we try to convert some of the existing open coded loops around allocator to nofail request (and we have done that in the past) then such a change would have a non trivial side effect which is far from obvious. Note that the primary motivation for skipping the OOM killer is to prevent from pre-mature invocation. The exception has been added by commit 82553a937f12 (""oom: invoke oom killer for __GFP_NOFAIL""). The changelog points out that the oom killer has to be invoked otherwise the request would be looping for ever. But this argument is rather weak because the OOM killer doesn't really guarantee a forward progress for those exceptional cases: - it will hardly help to form costly order which in turn can result in the system panic because of no oom killable task in the end - I believe we certainly do not want to put the system down just because there is a nasty driver asking for order-9 page with GFP_NOFAIL not realizing all the consequences. It is much better this request would loop for ever than the massive system disruption - lowmem is also highly unlikely to be freed during OOM killer - GFP_NOFS request could trigger while there is still a lot of memory pinned by filesystems. This patch simply removes the __GFP_NOFAIL special case in order to have a more clear semantic without surprising side effects. Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Nils Holland <nholland@tisys.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/06ad276ac18742c6b281698d41b27a290cd42407;" GFP_NOFS requests are not invoking the OOM killer but GFP_NOFS|__GFP_NOFAIL does so if we try to convert some of the existing open coded loops around allocator to nofail request (and we have done that in the past) then such a change would have a non trivial side effect which is far from obvious";no;yes;yes 104;MDY6Q29tbWl0MjMyNTI5ODowNmFkMjc2YWMxODc0MmM2YjI4MTY5OGQ0MWIyN2EyOTBjZDQyNDA3;Michal Hocko;Linus Torvalds;"mm, oom: do not enforce OOM killer for __GFP_NOFAIL automatically __alloc_pages_may_oom makes sure to skip the OOM killer depending on the allocation request. This includes lowmem requests, costly high order requests and others. For a long time __GFP_NOFAIL acted as an override for all those rules. This is not documented and it can be quite surprising as well. E.g. GFP_NOFS requests are not invoking the OOM killer but GFP_NOFS|__GFP_NOFAIL does so if we try to convert some of the existing open coded loops around allocator to nofail request (and we have done that in the past) then such a change would have a non trivial side effect which is far from obvious. Note that the primary motivation for skipping the OOM killer is to prevent from pre-mature invocation. The exception has been added by commit 82553a937f12 (""oom: invoke oom killer for __GFP_NOFAIL""). The changelog points out that the oom killer has to be invoked otherwise the request would be looping for ever. But this argument is rather weak because the OOM killer doesn't really guarantee a forward progress for those exceptional cases: - it will hardly help to form costly order which in turn can result in the system panic because of no oom killable task in the end - I believe we certainly do not want to put the system down just because there is a nasty driver asking for order-9 page with GFP_NOFAIL not realizing all the consequences. It is much better this request would loop for ever than the massive system disruption - lowmem is also highly unlikely to be freed during OOM killer - GFP_NOFS request could trigger while there is still a lot of memory pinned by filesystems. This patch simply removes the __GFP_NOFAIL special case in order to have a more clear semantic without surprising side effects. Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Nils Holland <nholland@tisys.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/06ad276ac18742c6b281698d41b27a290cd42407;" Note that the primary motivation for skipping the OOM killer is to prevent from pre-mature invocation";no;yes;yes 104;MDY6Q29tbWl0MjMyNTI5ODowNmFkMjc2YWMxODc0MmM2YjI4MTY5OGQ0MWIyN2EyOTBjZDQyNDA3;Michal Hocko;Linus Torvalds;"mm, oom: do not enforce OOM killer for __GFP_NOFAIL automatically __alloc_pages_may_oom makes sure to skip the OOM killer depending on the allocation request. This includes lowmem requests, costly high order requests and others. For a long time __GFP_NOFAIL acted as an override for all those rules. This is not documented and it can be quite surprising as well. E.g. GFP_NOFS requests are not invoking the OOM killer but GFP_NOFS|__GFP_NOFAIL does so if we try to convert some of the existing open coded loops around allocator to nofail request (and we have done that in the past) then such a change would have a non trivial side effect which is far from obvious. Note that the primary motivation for skipping the OOM killer is to prevent from pre-mature invocation. The exception has been added by commit 82553a937f12 (""oom: invoke oom killer for __GFP_NOFAIL""). The changelog points out that the oom killer has to be invoked otherwise the request would be looping for ever. But this argument is rather weak because the OOM killer doesn't really guarantee a forward progress for those exceptional cases: - it will hardly help to form costly order which in turn can result in the system panic because of no oom killable task in the end - I believe we certainly do not want to put the system down just because there is a nasty driver asking for order-9 page with GFP_NOFAIL not realizing all the consequences. It is much better this request would loop for ever than the massive system disruption - lowmem is also highly unlikely to be freed during OOM killer - GFP_NOFS request could trigger while there is still a lot of memory pinned by filesystems. This patch simply removes the __GFP_NOFAIL special case in order to have a more clear semantic without surprising side effects. Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Nils Holland <nholland@tisys.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/06ad276ac18742c6b281698d41b27a290cd42407;"The exception has been added by commit 82553a937f12 (""oom: invoke oom killer for __GFP_NOFAIL"")";no;no;yes 104;MDY6Q29tbWl0MjMyNTI5ODowNmFkMjc2YWMxODc0MmM2YjI4MTY5OGQ0MWIyN2EyOTBjZDQyNDA3;Michal Hocko;Linus Torvalds;"mm, oom: do not enforce OOM killer for __GFP_NOFAIL automatically __alloc_pages_may_oom makes sure to skip the OOM killer depending on the allocation request. This includes lowmem requests, costly high order requests and others. For a long time __GFP_NOFAIL acted as an override for all those rules. This is not documented and it can be quite surprising as well. E.g. GFP_NOFS requests are not invoking the OOM killer but GFP_NOFS|__GFP_NOFAIL does so if we try to convert some of the existing open coded loops around allocator to nofail request (and we have done that in the past) then such a change would have a non trivial side effect which is far from obvious. Note that the primary motivation for skipping the OOM killer is to prevent from pre-mature invocation. The exception has been added by commit 82553a937f12 (""oom: invoke oom killer for __GFP_NOFAIL""). The changelog points out that the oom killer has to be invoked otherwise the request would be looping for ever. But this argument is rather weak because the OOM killer doesn't really guarantee a forward progress for those exceptional cases: - it will hardly help to form costly order which in turn can result in the system panic because of no oom killable task in the end - I believe we certainly do not want to put the system down just because there is a nasty driver asking for order-9 page with GFP_NOFAIL not realizing all the consequences. It is much better this request would loop for ever than the massive system disruption - lowmem is also highly unlikely to be freed during OOM killer - GFP_NOFS request could trigger while there is still a lot of memory pinned by filesystems. This patch simply removes the __GFP_NOFAIL special case in order to have a more clear semantic without surprising side effects. Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Nils Holland <nholland@tisys.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/06ad276ac18742c6b281698d41b27a290cd42407;" The changelog points out that the oom killer has to be invoked otherwise the request would be looping for ever";no;no;yes 104;MDY6Q29tbWl0MjMyNTI5ODowNmFkMjc2YWMxODc0MmM2YjI4MTY5OGQ0MWIyN2EyOTBjZDQyNDA3;Michal Hocko;Linus Torvalds;"mm, oom: do not enforce OOM killer for __GFP_NOFAIL automatically __alloc_pages_may_oom makes sure to skip the OOM killer depending on the allocation request. This includes lowmem requests, costly high order requests and others. For a long time __GFP_NOFAIL acted as an override for all those rules. This is not documented and it can be quite surprising as well. E.g. GFP_NOFS requests are not invoking the OOM killer but GFP_NOFS|__GFP_NOFAIL does so if we try to convert some of the existing open coded loops around allocator to nofail request (and we have done that in the past) then such a change would have a non trivial side effect which is far from obvious. Note that the primary motivation for skipping the OOM killer is to prevent from pre-mature invocation. The exception has been added by commit 82553a937f12 (""oom: invoke oom killer for __GFP_NOFAIL""). The changelog points out that the oom killer has to be invoked otherwise the request would be looping for ever. But this argument is rather weak because the OOM killer doesn't really guarantee a forward progress for those exceptional cases: - it will hardly help to form costly order which in turn can result in the system panic because of no oom killable task in the end - I believe we certainly do not want to put the system down just because there is a nasty driver asking for order-9 page with GFP_NOFAIL not realizing all the consequences. It is much better this request would loop for ever than the massive system disruption - lowmem is also highly unlikely to be freed during OOM killer - GFP_NOFS request could trigger while there is still a lot of memory pinned by filesystems. This patch simply removes the __GFP_NOFAIL special case in order to have a more clear semantic without surprising side effects. Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Nils Holland <nholland@tisys.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/06ad276ac18742c6b281698d41b27a290cd42407;" But this argument is rather weak because the OOM killer doesn't really guarantee a forward progress for those exceptional cases";yes;yes;yes 104;MDY6Q29tbWl0MjMyNTI5ODowNmFkMjc2YWMxODc0MmM2YjI4MTY5OGQ0MWIyN2EyOTBjZDQyNDA3;Michal Hocko;Linus Torvalds;"mm, oom: do not enforce OOM killer for __GFP_NOFAIL automatically __alloc_pages_may_oom makes sure to skip the OOM killer depending on the allocation request. This includes lowmem requests, costly high order requests and others. For a long time __GFP_NOFAIL acted as an override for all those rules. This is not documented and it can be quite surprising as well. E.g. GFP_NOFS requests are not invoking the OOM killer but GFP_NOFS|__GFP_NOFAIL does so if we try to convert some of the existing open coded loops around allocator to nofail request (and we have done that in the past) then such a change would have a non trivial side effect which is far from obvious. Note that the primary motivation for skipping the OOM killer is to prevent from pre-mature invocation. The exception has been added by commit 82553a937f12 (""oom: invoke oom killer for __GFP_NOFAIL""). The changelog points out that the oom killer has to be invoked otherwise the request would be looping for ever. But this argument is rather weak because the OOM killer doesn't really guarantee a forward progress for those exceptional cases: - it will hardly help to form costly order which in turn can result in the system panic because of no oom killable task in the end - I believe we certainly do not want to put the system down just because there is a nasty driver asking for order-9 page with GFP_NOFAIL not realizing all the consequences. It is much better this request would loop for ever than the massive system disruption - lowmem is also highly unlikely to be freed during OOM killer - GFP_NOFS request could trigger while there is still a lot of memory pinned by filesystems. This patch simply removes the __GFP_NOFAIL special case in order to have a more clear semantic without surprising side effects. Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Nils Holland <nholland@tisys.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/06ad276ac18742c6b281698d41b27a290cd42407;"- it will hardly help to form costly order which in turn can result in the system panic because of no oom killable task in the end - I believe we certainly do not want to put the system down just because there is a nasty driver asking for order-9 page with GFP_NOFAIL not realizing all the consequences";no;yes;yes 104;MDY6Q29tbWl0MjMyNTI5ODowNmFkMjc2YWMxODc0MmM2YjI4MTY5OGQ0MWIyN2EyOTBjZDQyNDA3;Michal Hocko;Linus Torvalds;"mm, oom: do not enforce OOM killer for __GFP_NOFAIL automatically __alloc_pages_may_oom makes sure to skip the OOM killer depending on the allocation request. This includes lowmem requests, costly high order requests and others. For a long time __GFP_NOFAIL acted as an override for all those rules. This is not documented and it can be quite surprising as well. E.g. GFP_NOFS requests are not invoking the OOM killer but GFP_NOFS|__GFP_NOFAIL does so if we try to convert some of the existing open coded loops around allocator to nofail request (and we have done that in the past) then such a change would have a non trivial side effect which is far from obvious. Note that the primary motivation for skipping the OOM killer is to prevent from pre-mature invocation. The exception has been added by commit 82553a937f12 (""oom: invoke oom killer for __GFP_NOFAIL""). The changelog points out that the oom killer has to be invoked otherwise the request would be looping for ever. But this argument is rather weak because the OOM killer doesn't really guarantee a forward progress for those exceptional cases: - it will hardly help to form costly order which in turn can result in the system panic because of no oom killable task in the end - I believe we certainly do not want to put the system down just because there is a nasty driver asking for order-9 page with GFP_NOFAIL not realizing all the consequences. It is much better this request would loop for ever than the massive system disruption - lowmem is also highly unlikely to be freed during OOM killer - GFP_NOFS request could trigger while there is still a lot of memory pinned by filesystems. This patch simply removes the __GFP_NOFAIL special case in order to have a more clear semantic without surprising side effects. Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Nils Holland <nholland@tisys.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/06ad276ac18742c6b281698d41b27a290cd42407;" It is much better this request would loop for ever than the massive system disruption - lowmem is also highly unlikely to be freed during OOM killer - GFP_NOFS request could trigger while there is still a lot of memory pinned by filesystems";yes;yes;yes 104;MDY6Q29tbWl0MjMyNTI5ODowNmFkMjc2YWMxODc0MmM2YjI4MTY5OGQ0MWIyN2EyOTBjZDQyNDA3;Michal Hocko;Linus Torvalds;"mm, oom: do not enforce OOM killer for __GFP_NOFAIL automatically __alloc_pages_may_oom makes sure to skip the OOM killer depending on the allocation request. This includes lowmem requests, costly high order requests and others. For a long time __GFP_NOFAIL acted as an override for all those rules. This is not documented and it can be quite surprising as well. E.g. GFP_NOFS requests are not invoking the OOM killer but GFP_NOFS|__GFP_NOFAIL does so if we try to convert some of the existing open coded loops around allocator to nofail request (and we have done that in the past) then such a change would have a non trivial side effect which is far from obvious. Note that the primary motivation for skipping the OOM killer is to prevent from pre-mature invocation. The exception has been added by commit 82553a937f12 (""oom: invoke oom killer for __GFP_NOFAIL""). The changelog points out that the oom killer has to be invoked otherwise the request would be looping for ever. But this argument is rather weak because the OOM killer doesn't really guarantee a forward progress for those exceptional cases: - it will hardly help to form costly order which in turn can result in the system panic because of no oom killable task in the end - I believe we certainly do not want to put the system down just because there is a nasty driver asking for order-9 page with GFP_NOFAIL not realizing all the consequences. It is much better this request would loop for ever than the massive system disruption - lowmem is also highly unlikely to be freed during OOM killer - GFP_NOFS request could trigger while there is still a lot of memory pinned by filesystems. This patch simply removes the __GFP_NOFAIL special case in order to have a more clear semantic without surprising side effects. Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Nils Holland <nholland@tisys.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/06ad276ac18742c6b281698d41b27a290cd42407;"This patch simply removes the __GFP_NOFAIL special case in order to have a more clear semantic without surprising side effects.";yes;yes;no 105;MDY6Q29tbWl0MjMyNTI5ODo5YWY3NDRkNzQzMTcwYjVmNWVmNzAwMzFkZWE4ZDc3MmQxNjZhYjI4;Michal Hocko;Linus Torvalds;"lib/show_mem.c: teach show_mem to work with the given nodemask show_mem() allows to filter out node specific data which is irrelevant to the allocation request via SHOW_MEM_FILTER_NODES. The filtering is done in skip_free_areas_node which skips all nodes which are not in the mems_allowed of the current process. This works most of the time as expected because the nodemask shouldn't be outside of the allocating task but there are some exceptions. E.g. memory hotplug might want to request allocations from outside of the allowed nodes (see new_node_page). Get rid of this hardcoded behavior and push the allocation mask down the show_mem path and use it instead of cpuset_current_mems_allowed. NULL nodemask is interpreted as cpuset_current_mems_allowed. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/20170117091543.25850-5-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9af744d743170b5f5ef70031dea8d772d166ab28;lib/show_mem.c: teach show_mem to work with the given nodemask;yes;no;no 105;MDY6Q29tbWl0MjMyNTI5ODo5YWY3NDRkNzQzMTcwYjVmNWVmNzAwMzFkZWE4ZDc3MmQxNjZhYjI4;Michal Hocko;Linus Torvalds;"lib/show_mem.c: teach show_mem to work with the given nodemask show_mem() allows to filter out node specific data which is irrelevant to the allocation request via SHOW_MEM_FILTER_NODES. The filtering is done in skip_free_areas_node which skips all nodes which are not in the mems_allowed of the current process. This works most of the time as expected because the nodemask shouldn't be outside of the allocating task but there are some exceptions. E.g. memory hotplug might want to request allocations from outside of the allowed nodes (see new_node_page). Get rid of this hardcoded behavior and push the allocation mask down the show_mem path and use it instead of cpuset_current_mems_allowed. NULL nodemask is interpreted as cpuset_current_mems_allowed. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/20170117091543.25850-5-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9af744d743170b5f5ef70031dea8d772d166ab28;"show_mem() allows to filter out node specific data which is irrelevant to the allocation request via SHOW_MEM_FILTER_NODES";no;no;yes 105;MDY6Q29tbWl0MjMyNTI5ODo5YWY3NDRkNzQzMTcwYjVmNWVmNzAwMzFkZWE4ZDc3MmQxNjZhYjI4;Michal Hocko;Linus Torvalds;"lib/show_mem.c: teach show_mem to work with the given nodemask show_mem() allows to filter out node specific data which is irrelevant to the allocation request via SHOW_MEM_FILTER_NODES. The filtering is done in skip_free_areas_node which skips all nodes which are not in the mems_allowed of the current process. This works most of the time as expected because the nodemask shouldn't be outside of the allocating task but there are some exceptions. E.g. memory hotplug might want to request allocations from outside of the allowed nodes (see new_node_page). Get rid of this hardcoded behavior and push the allocation mask down the show_mem path and use it instead of cpuset_current_mems_allowed. NULL nodemask is interpreted as cpuset_current_mems_allowed. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/20170117091543.25850-5-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9af744d743170b5f5ef70031dea8d772d166ab28;" The filtering is done in skip_free_areas_node which skips all nodes which are not in the mems_allowed of the current process";no;no;yes 105;MDY6Q29tbWl0MjMyNTI5ODo5YWY3NDRkNzQzMTcwYjVmNWVmNzAwMzFkZWE4ZDc3MmQxNjZhYjI4;Michal Hocko;Linus Torvalds;"lib/show_mem.c: teach show_mem to work with the given nodemask show_mem() allows to filter out node specific data which is irrelevant to the allocation request via SHOW_MEM_FILTER_NODES. The filtering is done in skip_free_areas_node which skips all nodes which are not in the mems_allowed of the current process. This works most of the time as expected because the nodemask shouldn't be outside of the allocating task but there are some exceptions. E.g. memory hotplug might want to request allocations from outside of the allowed nodes (see new_node_page). Get rid of this hardcoded behavior and push the allocation mask down the show_mem path and use it instead of cpuset_current_mems_allowed. NULL nodemask is interpreted as cpuset_current_mems_allowed. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/20170117091543.25850-5-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9af744d743170b5f5ef70031dea8d772d166ab28;" This works most of the time as expected because the nodemask shouldn't be outside of the allocating task but there are some exceptions";no;yes;yes 105;MDY6Q29tbWl0MjMyNTI5ODo5YWY3NDRkNzQzMTcwYjVmNWVmNzAwMzFkZWE4ZDc3MmQxNjZhYjI4;Michal Hocko;Linus Torvalds;"lib/show_mem.c: teach show_mem to work with the given nodemask show_mem() allows to filter out node specific data which is irrelevant to the allocation request via SHOW_MEM_FILTER_NODES. The filtering is done in skip_free_areas_node which skips all nodes which are not in the mems_allowed of the current process. This works most of the time as expected because the nodemask shouldn't be outside of the allocating task but there are some exceptions. E.g. memory hotplug might want to request allocations from outside of the allowed nodes (see new_node_page). Get rid of this hardcoded behavior and push the allocation mask down the show_mem path and use it instead of cpuset_current_mems_allowed. NULL nodemask is interpreted as cpuset_current_mems_allowed. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/20170117091543.25850-5-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9af744d743170b5f5ef70031dea8d772d166ab28;" memory hotplug might want to request allocations from outside of the allowed nodes (see new_node_page)";no;no;yes 105;MDY6Q29tbWl0MjMyNTI5ODo5YWY3NDRkNzQzMTcwYjVmNWVmNzAwMzFkZWE4ZDc3MmQxNjZhYjI4;Michal Hocko;Linus Torvalds;"lib/show_mem.c: teach show_mem to work with the given nodemask show_mem() allows to filter out node specific data which is irrelevant to the allocation request via SHOW_MEM_FILTER_NODES. The filtering is done in skip_free_areas_node which skips all nodes which are not in the mems_allowed of the current process. This works most of the time as expected because the nodemask shouldn't be outside of the allocating task but there are some exceptions. E.g. memory hotplug might want to request allocations from outside of the allowed nodes (see new_node_page). Get rid of this hardcoded behavior and push the allocation mask down the show_mem path and use it instead of cpuset_current_mems_allowed. NULL nodemask is interpreted as cpuset_current_mems_allowed. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/20170117091543.25850-5-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9af744d743170b5f5ef70031dea8d772d166ab28;"Get rid of this hardcoded behavior and push the allocation mask down the show_mem path and use it instead of cpuset_current_mems_allowed";yes;yes;no 105;MDY6Q29tbWl0MjMyNTI5ODo5YWY3NDRkNzQzMTcwYjVmNWVmNzAwMzFkZWE4ZDc3MmQxNjZhYjI4;Michal Hocko;Linus Torvalds;"lib/show_mem.c: teach show_mem to work with the given nodemask show_mem() allows to filter out node specific data which is irrelevant to the allocation request via SHOW_MEM_FILTER_NODES. The filtering is done in skip_free_areas_node which skips all nodes which are not in the mems_allowed of the current process. This works most of the time as expected because the nodemask shouldn't be outside of the allocating task but there are some exceptions. E.g. memory hotplug might want to request allocations from outside of the allowed nodes (see new_node_page). Get rid of this hardcoded behavior and push the allocation mask down the show_mem path and use it instead of cpuset_current_mems_allowed. NULL nodemask is interpreted as cpuset_current_mems_allowed. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/20170117091543.25850-5-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9af744d743170b5f5ef70031dea8d772d166ab28;" NULL nodemask is interpreted as cpuset_current_mems_allowed.";yes;no;no 106;MDY6Q29tbWl0MjMyNTI5ODo4MmU3ZDNhYmVjODZjYmE5ZGY5NDVhNzY1YmJhMzg0ZjhhYzExM2E3;Michal Hocko;Linus Torvalds;"oom: print nodemask in the oom report We have received a hard to explain oom report from a customer. The oom triggered regardless there is a lot of free memory: PoolThread invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0, oom_score_adj=0 PoolThread cpuset=/ mems_allowed=0-7 Pid: 30055, comm: PoolThread Tainted: G E X 3.0.101-80-default #1 Call Trace: dump_trace+0x75/0x300 dump_stack+0x69/0x6f dump_header+0x8e/0x110 oom_kill_process+0xa6/0x350 out_of_memory+0x2b7/0x310 __alloc_pages_slowpath+0x7dd/0x820 __alloc_pages_nodemask+0x1e9/0x200 alloc_pages_vma+0xe1/0x290 do_anonymous_page+0x13e/0x300 do_page_fault+0x1fd/0x4c0 page_fault+0x25/0x30 [...] active_anon:1135959151 inactive_anon:1051962 isolated_anon:0 active_file:13093 inactive_file:222506 isolated_file:0 unevictable:262144 dirty:2 writeback:0 unstable:0 free:432672819 slab_reclaimable:7917 slab_unreclaimable:95308 mapped:261139 shmem:166297 pagetables:2228282 bounce:0 [...] Node 0 DMA free:15896kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15672kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes lowmem_reserve[]: 0 2892 775542 775542 Node 0 DMA32 free:2783784kB min:28kB low:32kB high:40kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2961572kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes lowmem_reserve[]: 0 0 772650 772650 Node 0 Normal free:8120kB min:8160kB low:10200kB high:12240kB active_anon:779334960kB inactive_anon:2198744kB active_file:0kB inactive_file:180kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:791193600kB mlocked:131072kB dirty:0kB writeback:0kB mapped:372940kB shmem:361480kB slab_reclaimable:4536kB slab_unreclaimable:68472kB kernel_stack:10104kB pagetables:1414820kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:2280 all_unreclaimable? yes lowmem_reserve[]: 0 0 0 0 Node 1 Normal free:476718144kB min:8192kB low:10240kB high:12288kB active_anon:307623696kB inactive_anon:283620kB active_file:10392kB inactive_file:69908kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:4kB writeback:0kB mapped:257208kB shmem:189896kB slab_reclaimable:3868kB slab_unreclaimable:44756kB kernel_stack:1848kB pagetables:1369432kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 2 Normal free:386002452kB min:8192kB low:10240kB high:12288kB active_anon:398563752kB inactive_anon:68184kB active_file:10292kB inactive_file:29936kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:32084kB shmem:776kB slab_reclaimable:6888kB slab_unreclaimable:60056kB kernel_stack:8208kB pagetables:1282880kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 3 Normal free:196406760kB min:8192kB low:10240kB high:12288kB active_anon:587445640kB inactive_anon:164396kB active_file:5716kB inactive_file:709844kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:291776kB shmem:111416kB slab_reclaimable:5152kB slab_unreclaimable:44516kB kernel_stack:2168kB pagetables:1455956kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 4 Normal free:425338880kB min:8192kB low:10240kB high:12288kB active_anon:359695204kB inactive_anon:43216kB active_file:5748kB inactive_file:14772kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:24708kB shmem:1120kB slab_reclaimable:1884kB slab_unreclaimable:41060kB kernel_stack:1856kB pagetables:1100208kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 5 Normal free:11140kB min:8192kB low:10240kB high:12288kB active_anon:784240872kB inactive_anon:1217164kB active_file:28kB inactive_file:48kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:11408kB shmem:0kB slab_reclaimable:2008kB slab_unreclaimable:49220kB kernel_stack:1360kB pagetables:531600kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1202 all_unreclaimable? yes lowmem_reserve[]: 0 0 0 0 Node 6 Normal free:243395332kB min:8192kB low:10240kB high:12288kB active_anon:542015544kB inactive_anon:40208kB active_file:968kB inactive_file:8484kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:19992kB shmem:496kB slab_reclaimable:1672kB slab_unreclaimable:37052kB kernel_stack:2088kB pagetables:750264kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 7 Normal free:10768kB min:8192kB low:10240kB high:12288kB active_anon:784916936kB inactive_anon:192316kB active_file:19228kB inactive_file:56852kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:4kB writeback:0kB mapped:34440kB shmem:4kB slab_reclaimable:5660kB slab_unreclaimable:36100kB kernel_stack:1328kB pagetables:1007968kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 So all nodes but Node 0 have a lot of free memory which should suggest that there is an available memory especially when mems_allowed=0-7. One could speculate that a massive process has managed to terminate and free up a lot of memory while racing with the above allocation request. Although this is highly unlikely it cannot be ruled out. A further debugging, however shown that the faulting process had mempolicy (not cpuset) to bind to Node 0. We cannot see that information from the report though. mems_allowed turned out to be more confusing than really helpful. Fix this by always priting the nodemask. It is either mempolicy mask (and non-null) or the one defined by the cpusets. The new output for the above oom report would be PoolThread invoked oom-killer: gfp_mask=0x280da(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=0, order=0, oom_adj=0, oom_score_adj=0 This patch doesn't touch show_mem and the node filtering based on the cpuset node mask because mempolicy is always a subset of cpusets and seeing the full cpuset oom context might be helpful for tunning more specific mempolicies inside cpusets (e.g. when they turn out to be too restrictive). To prevent from ugly ifdefs the mask is printed even for !NUMA configurations but this should be OK (a single node will be printed). Link: http://lkml.kernel.org/r/20160930214146.28600-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Sellami Abdelkader <abdelkader.sellami@sap.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Sellami Abdelkader <abdelkader.sellami@sap.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/82e7d3abec86cba9df945a765bba384f8ac113a7;oom: print nodemask in the oom report;yes;no;no 106;MDY6Q29tbWl0MjMyNTI5ODo4MmU3ZDNhYmVjODZjYmE5ZGY5NDVhNzY1YmJhMzg0ZjhhYzExM2E3;Michal Hocko;Linus Torvalds;"oom: print nodemask in the oom report We have received a hard to explain oom report from a customer. The oom triggered regardless there is a lot of free memory: PoolThread invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0, oom_score_adj=0 PoolThread cpuset=/ mems_allowed=0-7 Pid: 30055, comm: PoolThread Tainted: G E X 3.0.101-80-default #1 Call Trace: dump_trace+0x75/0x300 dump_stack+0x69/0x6f dump_header+0x8e/0x110 oom_kill_process+0xa6/0x350 out_of_memory+0x2b7/0x310 __alloc_pages_slowpath+0x7dd/0x820 __alloc_pages_nodemask+0x1e9/0x200 alloc_pages_vma+0xe1/0x290 do_anonymous_page+0x13e/0x300 do_page_fault+0x1fd/0x4c0 page_fault+0x25/0x30 [...] active_anon:1135959151 inactive_anon:1051962 isolated_anon:0 active_file:13093 inactive_file:222506 isolated_file:0 unevictable:262144 dirty:2 writeback:0 unstable:0 free:432672819 slab_reclaimable:7917 slab_unreclaimable:95308 mapped:261139 shmem:166297 pagetables:2228282 bounce:0 [...] Node 0 DMA free:15896kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15672kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes lowmem_reserve[]: 0 2892 775542 775542 Node 0 DMA32 free:2783784kB min:28kB low:32kB high:40kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2961572kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes lowmem_reserve[]: 0 0 772650 772650 Node 0 Normal free:8120kB min:8160kB low:10200kB high:12240kB active_anon:779334960kB inactive_anon:2198744kB active_file:0kB inactive_file:180kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:791193600kB mlocked:131072kB dirty:0kB writeback:0kB mapped:372940kB shmem:361480kB slab_reclaimable:4536kB slab_unreclaimable:68472kB kernel_stack:10104kB pagetables:1414820kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:2280 all_unreclaimable? yes lowmem_reserve[]: 0 0 0 0 Node 1 Normal free:476718144kB min:8192kB low:10240kB high:12288kB active_anon:307623696kB inactive_anon:283620kB active_file:10392kB inactive_file:69908kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:4kB writeback:0kB mapped:257208kB shmem:189896kB slab_reclaimable:3868kB slab_unreclaimable:44756kB kernel_stack:1848kB pagetables:1369432kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 2 Normal free:386002452kB min:8192kB low:10240kB high:12288kB active_anon:398563752kB inactive_anon:68184kB active_file:10292kB inactive_file:29936kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:32084kB shmem:776kB slab_reclaimable:6888kB slab_unreclaimable:60056kB kernel_stack:8208kB pagetables:1282880kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 3 Normal free:196406760kB min:8192kB low:10240kB high:12288kB active_anon:587445640kB inactive_anon:164396kB active_file:5716kB inactive_file:709844kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:291776kB shmem:111416kB slab_reclaimable:5152kB slab_unreclaimable:44516kB kernel_stack:2168kB pagetables:1455956kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 4 Normal free:425338880kB min:8192kB low:10240kB high:12288kB active_anon:359695204kB inactive_anon:43216kB active_file:5748kB inactive_file:14772kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:24708kB shmem:1120kB slab_reclaimable:1884kB slab_unreclaimable:41060kB kernel_stack:1856kB pagetables:1100208kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 5 Normal free:11140kB min:8192kB low:10240kB high:12288kB active_anon:784240872kB inactive_anon:1217164kB active_file:28kB inactive_file:48kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:11408kB shmem:0kB slab_reclaimable:2008kB slab_unreclaimable:49220kB kernel_stack:1360kB pagetables:531600kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1202 all_unreclaimable? yes lowmem_reserve[]: 0 0 0 0 Node 6 Normal free:243395332kB min:8192kB low:10240kB high:12288kB active_anon:542015544kB inactive_anon:40208kB active_file:968kB inactive_file:8484kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:19992kB shmem:496kB slab_reclaimable:1672kB slab_unreclaimable:37052kB kernel_stack:2088kB pagetables:750264kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 7 Normal free:10768kB min:8192kB low:10240kB high:12288kB active_anon:784916936kB inactive_anon:192316kB active_file:19228kB inactive_file:56852kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:4kB writeback:0kB mapped:34440kB shmem:4kB slab_reclaimable:5660kB slab_unreclaimable:36100kB kernel_stack:1328kB pagetables:1007968kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 So all nodes but Node 0 have a lot of free memory which should suggest that there is an available memory especially when mems_allowed=0-7. One could speculate that a massive process has managed to terminate and free up a lot of memory while racing with the above allocation request. Although this is highly unlikely it cannot be ruled out. A further debugging, however shown that the faulting process had mempolicy (not cpuset) to bind to Node 0. We cannot see that information from the report though. mems_allowed turned out to be more confusing than really helpful. Fix this by always priting the nodemask. It is either mempolicy mask (and non-null) or the one defined by the cpusets. The new output for the above oom report would be PoolThread invoked oom-killer: gfp_mask=0x280da(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=0, order=0, oom_adj=0, oom_score_adj=0 This patch doesn't touch show_mem and the node filtering based on the cpuset node mask because mempolicy is always a subset of cpusets and seeing the full cpuset oom context might be helpful for tunning more specific mempolicies inside cpusets (e.g. when they turn out to be too restrictive). To prevent from ugly ifdefs the mask is printed even for !NUMA configurations but this should be OK (a single node will be printed). Link: http://lkml.kernel.org/r/20160930214146.28600-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Sellami Abdelkader <abdelkader.sellami@sap.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Sellami Abdelkader <abdelkader.sellami@sap.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/82e7d3abec86cba9df945a765bba384f8ac113a7;We have received a hard to explain oom report from a customer;no;no;yes 106;MDY6Q29tbWl0MjMyNTI5ODo4MmU3ZDNhYmVjODZjYmE5ZGY5NDVhNzY1YmJhMzg0ZjhhYzExM2E3;Michal Hocko;Linus Torvalds;"oom: print nodemask in the oom report We have received a hard to explain oom report from a customer. The oom triggered regardless there is a lot of free memory: PoolThread invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0, oom_score_adj=0 PoolThread cpuset=/ mems_allowed=0-7 Pid: 30055, comm: PoolThread Tainted: G E X 3.0.101-80-default #1 Call Trace: dump_trace+0x75/0x300 dump_stack+0x69/0x6f dump_header+0x8e/0x110 oom_kill_process+0xa6/0x350 out_of_memory+0x2b7/0x310 __alloc_pages_slowpath+0x7dd/0x820 __alloc_pages_nodemask+0x1e9/0x200 alloc_pages_vma+0xe1/0x290 do_anonymous_page+0x13e/0x300 do_page_fault+0x1fd/0x4c0 page_fault+0x25/0x30 [...] active_anon:1135959151 inactive_anon:1051962 isolated_anon:0 active_file:13093 inactive_file:222506 isolated_file:0 unevictable:262144 dirty:2 writeback:0 unstable:0 free:432672819 slab_reclaimable:7917 slab_unreclaimable:95308 mapped:261139 shmem:166297 pagetables:2228282 bounce:0 [...] Node 0 DMA free:15896kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15672kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes lowmem_reserve[]: 0 2892 775542 775542 Node 0 DMA32 free:2783784kB min:28kB low:32kB high:40kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2961572kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes lowmem_reserve[]: 0 0 772650 772650 Node 0 Normal free:8120kB min:8160kB low:10200kB high:12240kB active_anon:779334960kB inactive_anon:2198744kB active_file:0kB inactive_file:180kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:791193600kB mlocked:131072kB dirty:0kB writeback:0kB mapped:372940kB shmem:361480kB slab_reclaimable:4536kB slab_unreclaimable:68472kB kernel_stack:10104kB pagetables:1414820kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:2280 all_unreclaimable? yes lowmem_reserve[]: 0 0 0 0 Node 1 Normal free:476718144kB min:8192kB low:10240kB high:12288kB active_anon:307623696kB inactive_anon:283620kB active_file:10392kB inactive_file:69908kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:4kB writeback:0kB mapped:257208kB shmem:189896kB slab_reclaimable:3868kB slab_unreclaimable:44756kB kernel_stack:1848kB pagetables:1369432kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 2 Normal free:386002452kB min:8192kB low:10240kB high:12288kB active_anon:398563752kB inactive_anon:68184kB active_file:10292kB inactive_file:29936kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:32084kB shmem:776kB slab_reclaimable:6888kB slab_unreclaimable:60056kB kernel_stack:8208kB pagetables:1282880kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 3 Normal free:196406760kB min:8192kB low:10240kB high:12288kB active_anon:587445640kB inactive_anon:164396kB active_file:5716kB inactive_file:709844kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:291776kB shmem:111416kB slab_reclaimable:5152kB slab_unreclaimable:44516kB kernel_stack:2168kB pagetables:1455956kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 4 Normal free:425338880kB min:8192kB low:10240kB high:12288kB active_anon:359695204kB inactive_anon:43216kB active_file:5748kB inactive_file:14772kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:24708kB shmem:1120kB slab_reclaimable:1884kB slab_unreclaimable:41060kB kernel_stack:1856kB pagetables:1100208kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 5 Normal free:11140kB min:8192kB low:10240kB high:12288kB active_anon:784240872kB inactive_anon:1217164kB active_file:28kB inactive_file:48kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:11408kB shmem:0kB slab_reclaimable:2008kB slab_unreclaimable:49220kB kernel_stack:1360kB pagetables:531600kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1202 all_unreclaimable? yes lowmem_reserve[]: 0 0 0 0 Node 6 Normal free:243395332kB min:8192kB low:10240kB high:12288kB active_anon:542015544kB inactive_anon:40208kB active_file:968kB inactive_file:8484kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:19992kB shmem:496kB slab_reclaimable:1672kB slab_unreclaimable:37052kB kernel_stack:2088kB pagetables:750264kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 7 Normal free:10768kB min:8192kB low:10240kB high:12288kB active_anon:784916936kB inactive_anon:192316kB active_file:19228kB inactive_file:56852kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:4kB writeback:0kB mapped:34440kB shmem:4kB slab_reclaimable:5660kB slab_unreclaimable:36100kB kernel_stack:1328kB pagetables:1007968kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 So all nodes but Node 0 have a lot of free memory which should suggest that there is an available memory especially when mems_allowed=0-7. One could speculate that a massive process has managed to terminate and free up a lot of memory while racing with the above allocation request. Although this is highly unlikely it cannot be ruled out. A further debugging, however shown that the faulting process had mempolicy (not cpuset) to bind to Node 0. We cannot see that information from the report though. mems_allowed turned out to be more confusing than really helpful. Fix this by always priting the nodemask. It is either mempolicy mask (and non-null) or the one defined by the cpusets. The new output for the above oom report would be PoolThread invoked oom-killer: gfp_mask=0x280da(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=0, order=0, oom_adj=0, oom_score_adj=0 This patch doesn't touch show_mem and the node filtering based on the cpuset node mask because mempolicy is always a subset of cpusets and seeing the full cpuset oom context might be helpful for tunning more specific mempolicies inside cpusets (e.g. when they turn out to be too restrictive). To prevent from ugly ifdefs the mask is printed even for !NUMA configurations but this should be OK (a single node will be printed). Link: http://lkml.kernel.org/r/20160930214146.28600-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Sellami Abdelkader <abdelkader.sellami@sap.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Sellami Abdelkader <abdelkader.sellami@sap.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/82e7d3abec86cba9df945a765bba384f8ac113a7;" The oom triggered regardless there is a lot of free memory";no;no;yes 106;MDY6Q29tbWl0MjMyNTI5ODo4MmU3ZDNhYmVjODZjYmE5ZGY5NDVhNzY1YmJhMzg0ZjhhYzExM2E3;Michal Hocko;Linus Torvalds;"oom: print nodemask in the oom report We have received a hard to explain oom report from a customer. The oom triggered regardless there is a lot of free memory: PoolThread invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0, oom_score_adj=0 PoolThread cpuset=/ mems_allowed=0-7 Pid: 30055, comm: PoolThread Tainted: G E X 3.0.101-80-default #1 Call Trace: dump_trace+0x75/0x300 dump_stack+0x69/0x6f dump_header+0x8e/0x110 oom_kill_process+0xa6/0x350 out_of_memory+0x2b7/0x310 __alloc_pages_slowpath+0x7dd/0x820 __alloc_pages_nodemask+0x1e9/0x200 alloc_pages_vma+0xe1/0x290 do_anonymous_page+0x13e/0x300 do_page_fault+0x1fd/0x4c0 page_fault+0x25/0x30 [...] active_anon:1135959151 inactive_anon:1051962 isolated_anon:0 active_file:13093 inactive_file:222506 isolated_file:0 unevictable:262144 dirty:2 writeback:0 unstable:0 free:432672819 slab_reclaimable:7917 slab_unreclaimable:95308 mapped:261139 shmem:166297 pagetables:2228282 bounce:0 [...] Node 0 DMA free:15896kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15672kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes lowmem_reserve[]: 0 2892 775542 775542 Node 0 DMA32 free:2783784kB min:28kB low:32kB high:40kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2961572kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes lowmem_reserve[]: 0 0 772650 772650 Node 0 Normal free:8120kB min:8160kB low:10200kB high:12240kB active_anon:779334960kB inactive_anon:2198744kB active_file:0kB inactive_file:180kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:791193600kB mlocked:131072kB dirty:0kB writeback:0kB mapped:372940kB shmem:361480kB slab_reclaimable:4536kB slab_unreclaimable:68472kB kernel_stack:10104kB pagetables:1414820kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:2280 all_unreclaimable? yes lowmem_reserve[]: 0 0 0 0 Node 1 Normal free:476718144kB min:8192kB low:10240kB high:12288kB active_anon:307623696kB inactive_anon:283620kB active_file:10392kB inactive_file:69908kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:4kB writeback:0kB mapped:257208kB shmem:189896kB slab_reclaimable:3868kB slab_unreclaimable:44756kB kernel_stack:1848kB pagetables:1369432kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 2 Normal free:386002452kB min:8192kB low:10240kB high:12288kB active_anon:398563752kB inactive_anon:68184kB active_file:10292kB inactive_file:29936kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:32084kB shmem:776kB slab_reclaimable:6888kB slab_unreclaimable:60056kB kernel_stack:8208kB pagetables:1282880kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 3 Normal free:196406760kB min:8192kB low:10240kB high:12288kB active_anon:587445640kB inactive_anon:164396kB active_file:5716kB inactive_file:709844kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:291776kB shmem:111416kB slab_reclaimable:5152kB slab_unreclaimable:44516kB kernel_stack:2168kB pagetables:1455956kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 4 Normal free:425338880kB min:8192kB low:10240kB high:12288kB active_anon:359695204kB inactive_anon:43216kB active_file:5748kB inactive_file:14772kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:24708kB shmem:1120kB slab_reclaimable:1884kB slab_unreclaimable:41060kB kernel_stack:1856kB pagetables:1100208kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 5 Normal free:11140kB min:8192kB low:10240kB high:12288kB active_anon:784240872kB inactive_anon:1217164kB active_file:28kB inactive_file:48kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:11408kB shmem:0kB slab_reclaimable:2008kB slab_unreclaimable:49220kB kernel_stack:1360kB pagetables:531600kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1202 all_unreclaimable? yes lowmem_reserve[]: 0 0 0 0 Node 6 Normal free:243395332kB min:8192kB low:10240kB high:12288kB active_anon:542015544kB inactive_anon:40208kB active_file:968kB inactive_file:8484kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:19992kB shmem:496kB slab_reclaimable:1672kB slab_unreclaimable:37052kB kernel_stack:2088kB pagetables:750264kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 7 Normal free:10768kB min:8192kB low:10240kB high:12288kB active_anon:784916936kB inactive_anon:192316kB active_file:19228kB inactive_file:56852kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:4kB writeback:0kB mapped:34440kB shmem:4kB slab_reclaimable:5660kB slab_unreclaimable:36100kB kernel_stack:1328kB pagetables:1007968kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 So all nodes but Node 0 have a lot of free memory which should suggest that there is an available memory especially when mems_allowed=0-7. One could speculate that a massive process has managed to terminate and free up a lot of memory while racing with the above allocation request. Although this is highly unlikely it cannot be ruled out. A further debugging, however shown that the faulting process had mempolicy (not cpuset) to bind to Node 0. We cannot see that information from the report though. mems_allowed turned out to be more confusing than really helpful. Fix this by always priting the nodemask. It is either mempolicy mask (and non-null) or the one defined by the cpusets. The new output for the above oom report would be PoolThread invoked oom-killer: gfp_mask=0x280da(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=0, order=0, oom_adj=0, oom_score_adj=0 This patch doesn't touch show_mem and the node filtering based on the cpuset node mask because mempolicy is always a subset of cpusets and seeing the full cpuset oom context might be helpful for tunning more specific mempolicies inside cpusets (e.g. when they turn out to be too restrictive). To prevent from ugly ifdefs the mask is printed even for !NUMA configurations but this should be OK (a single node will be printed). Link: http://lkml.kernel.org/r/20160930214146.28600-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Sellami Abdelkader <abdelkader.sellami@sap.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Sellami Abdelkader <abdelkader.sellami@sap.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/82e7d3abec86cba9df945a765bba384f8ac113a7;" PoolThread invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0, oom_score_adj=0 PoolThread cpuset=/ mems_allowed=0-7 Pid: 30055, comm: PoolThread Tainted: G E X 3.0.101-80-default #1 So all nodes but Node 0 have a lot of free memory which should suggest that there is an available memory especially when mems_allowed=0-7";no;no;yes 106;MDY6Q29tbWl0MjMyNTI5ODo4MmU3ZDNhYmVjODZjYmE5ZGY5NDVhNzY1YmJhMzg0ZjhhYzExM2E3;Michal Hocko;Linus Torvalds;"oom: print nodemask in the oom report We have received a hard to explain oom report from a customer. The oom triggered regardless there is a lot of free memory: PoolThread invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0, oom_score_adj=0 PoolThread cpuset=/ mems_allowed=0-7 Pid: 30055, comm: PoolThread Tainted: G E X 3.0.101-80-default #1 Call Trace: dump_trace+0x75/0x300 dump_stack+0x69/0x6f dump_header+0x8e/0x110 oom_kill_process+0xa6/0x350 out_of_memory+0x2b7/0x310 __alloc_pages_slowpath+0x7dd/0x820 __alloc_pages_nodemask+0x1e9/0x200 alloc_pages_vma+0xe1/0x290 do_anonymous_page+0x13e/0x300 do_page_fault+0x1fd/0x4c0 page_fault+0x25/0x30 [...] active_anon:1135959151 inactive_anon:1051962 isolated_anon:0 active_file:13093 inactive_file:222506 isolated_file:0 unevictable:262144 dirty:2 writeback:0 unstable:0 free:432672819 slab_reclaimable:7917 slab_unreclaimable:95308 mapped:261139 shmem:166297 pagetables:2228282 bounce:0 [...] Node 0 DMA free:15896kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15672kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes lowmem_reserve[]: 0 2892 775542 775542 Node 0 DMA32 free:2783784kB min:28kB low:32kB high:40kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2961572kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes lowmem_reserve[]: 0 0 772650 772650 Node 0 Normal free:8120kB min:8160kB low:10200kB high:12240kB active_anon:779334960kB inactive_anon:2198744kB active_file:0kB inactive_file:180kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:791193600kB mlocked:131072kB dirty:0kB writeback:0kB mapped:372940kB shmem:361480kB slab_reclaimable:4536kB slab_unreclaimable:68472kB kernel_stack:10104kB pagetables:1414820kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:2280 all_unreclaimable? yes lowmem_reserve[]: 0 0 0 0 Node 1 Normal free:476718144kB min:8192kB low:10240kB high:12288kB active_anon:307623696kB inactive_anon:283620kB active_file:10392kB inactive_file:69908kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:4kB writeback:0kB mapped:257208kB shmem:189896kB slab_reclaimable:3868kB slab_unreclaimable:44756kB kernel_stack:1848kB pagetables:1369432kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 2 Normal free:386002452kB min:8192kB low:10240kB high:12288kB active_anon:398563752kB inactive_anon:68184kB active_file:10292kB inactive_file:29936kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:32084kB shmem:776kB slab_reclaimable:6888kB slab_unreclaimable:60056kB kernel_stack:8208kB pagetables:1282880kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 3 Normal free:196406760kB min:8192kB low:10240kB high:12288kB active_anon:587445640kB inactive_anon:164396kB active_file:5716kB inactive_file:709844kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:291776kB shmem:111416kB slab_reclaimable:5152kB slab_unreclaimable:44516kB kernel_stack:2168kB pagetables:1455956kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 4 Normal free:425338880kB min:8192kB low:10240kB high:12288kB active_anon:359695204kB inactive_anon:43216kB active_file:5748kB inactive_file:14772kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:24708kB shmem:1120kB slab_reclaimable:1884kB slab_unreclaimable:41060kB kernel_stack:1856kB pagetables:1100208kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 5 Normal free:11140kB min:8192kB low:10240kB high:12288kB active_anon:784240872kB inactive_anon:1217164kB active_file:28kB inactive_file:48kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:11408kB shmem:0kB slab_reclaimable:2008kB slab_unreclaimable:49220kB kernel_stack:1360kB pagetables:531600kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1202 all_unreclaimable? yes lowmem_reserve[]: 0 0 0 0 Node 6 Normal free:243395332kB min:8192kB low:10240kB high:12288kB active_anon:542015544kB inactive_anon:40208kB active_file:968kB inactive_file:8484kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:19992kB shmem:496kB slab_reclaimable:1672kB slab_unreclaimable:37052kB kernel_stack:2088kB pagetables:750264kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 7 Normal free:10768kB min:8192kB low:10240kB high:12288kB active_anon:784916936kB inactive_anon:192316kB active_file:19228kB inactive_file:56852kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:4kB writeback:0kB mapped:34440kB shmem:4kB slab_reclaimable:5660kB slab_unreclaimable:36100kB kernel_stack:1328kB pagetables:1007968kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 So all nodes but Node 0 have a lot of free memory which should suggest that there is an available memory especially when mems_allowed=0-7. One could speculate that a massive process has managed to terminate and free up a lot of memory while racing with the above allocation request. Although this is highly unlikely it cannot be ruled out. A further debugging, however shown that the faulting process had mempolicy (not cpuset) to bind to Node 0. We cannot see that information from the report though. mems_allowed turned out to be more confusing than really helpful. Fix this by always priting the nodemask. It is either mempolicy mask (and non-null) or the one defined by the cpusets. The new output for the above oom report would be PoolThread invoked oom-killer: gfp_mask=0x280da(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=0, order=0, oom_adj=0, oom_score_adj=0 This patch doesn't touch show_mem and the node filtering based on the cpuset node mask because mempolicy is always a subset of cpusets and seeing the full cpuset oom context might be helpful for tunning more specific mempolicies inside cpusets (e.g. when they turn out to be too restrictive). To prevent from ugly ifdefs the mask is printed even for !NUMA configurations but this should be OK (a single node will be printed). Link: http://lkml.kernel.org/r/20160930214146.28600-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Sellami Abdelkader <abdelkader.sellami@sap.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Sellami Abdelkader <abdelkader.sellami@sap.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/82e7d3abec86cba9df945a765bba384f8ac113a7;" One could speculate that a massive process has managed to terminate and free up a lot of memory while racing with the above allocation request";no;no;yes 106;MDY6Q29tbWl0MjMyNTI5ODo4MmU3ZDNhYmVjODZjYmE5ZGY5NDVhNzY1YmJhMzg0ZjhhYzExM2E3;Michal Hocko;Linus Torvalds;"oom: print nodemask in the oom report We have received a hard to explain oom report from a customer. The oom triggered regardless there is a lot of free memory: PoolThread invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0, oom_score_adj=0 PoolThread cpuset=/ mems_allowed=0-7 Pid: 30055, comm: PoolThread Tainted: G E X 3.0.101-80-default #1 Call Trace: dump_trace+0x75/0x300 dump_stack+0x69/0x6f dump_header+0x8e/0x110 oom_kill_process+0xa6/0x350 out_of_memory+0x2b7/0x310 __alloc_pages_slowpath+0x7dd/0x820 __alloc_pages_nodemask+0x1e9/0x200 alloc_pages_vma+0xe1/0x290 do_anonymous_page+0x13e/0x300 do_page_fault+0x1fd/0x4c0 page_fault+0x25/0x30 [...] active_anon:1135959151 inactive_anon:1051962 isolated_anon:0 active_file:13093 inactive_file:222506 isolated_file:0 unevictable:262144 dirty:2 writeback:0 unstable:0 free:432672819 slab_reclaimable:7917 slab_unreclaimable:95308 mapped:261139 shmem:166297 pagetables:2228282 bounce:0 [...] Node 0 DMA free:15896kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15672kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes lowmem_reserve[]: 0 2892 775542 775542 Node 0 DMA32 free:2783784kB min:28kB low:32kB high:40kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2961572kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes lowmem_reserve[]: 0 0 772650 772650 Node 0 Normal free:8120kB min:8160kB low:10200kB high:12240kB active_anon:779334960kB inactive_anon:2198744kB active_file:0kB inactive_file:180kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:791193600kB mlocked:131072kB dirty:0kB writeback:0kB mapped:372940kB shmem:361480kB slab_reclaimable:4536kB slab_unreclaimable:68472kB kernel_stack:10104kB pagetables:1414820kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:2280 all_unreclaimable? yes lowmem_reserve[]: 0 0 0 0 Node 1 Normal free:476718144kB min:8192kB low:10240kB high:12288kB active_anon:307623696kB inactive_anon:283620kB active_file:10392kB inactive_file:69908kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:4kB writeback:0kB mapped:257208kB shmem:189896kB slab_reclaimable:3868kB slab_unreclaimable:44756kB kernel_stack:1848kB pagetables:1369432kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 2 Normal free:386002452kB min:8192kB low:10240kB high:12288kB active_anon:398563752kB inactive_anon:68184kB active_file:10292kB inactive_file:29936kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:32084kB shmem:776kB slab_reclaimable:6888kB slab_unreclaimable:60056kB kernel_stack:8208kB pagetables:1282880kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 3 Normal free:196406760kB min:8192kB low:10240kB high:12288kB active_anon:587445640kB inactive_anon:164396kB active_file:5716kB inactive_file:709844kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:291776kB shmem:111416kB slab_reclaimable:5152kB slab_unreclaimable:44516kB kernel_stack:2168kB pagetables:1455956kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 4 Normal free:425338880kB min:8192kB low:10240kB high:12288kB active_anon:359695204kB inactive_anon:43216kB active_file:5748kB inactive_file:14772kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:24708kB shmem:1120kB slab_reclaimable:1884kB slab_unreclaimable:41060kB kernel_stack:1856kB pagetables:1100208kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 5 Normal free:11140kB min:8192kB low:10240kB high:12288kB active_anon:784240872kB inactive_anon:1217164kB active_file:28kB inactive_file:48kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:11408kB shmem:0kB slab_reclaimable:2008kB slab_unreclaimable:49220kB kernel_stack:1360kB pagetables:531600kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1202 all_unreclaimable? yes lowmem_reserve[]: 0 0 0 0 Node 6 Normal free:243395332kB min:8192kB low:10240kB high:12288kB active_anon:542015544kB inactive_anon:40208kB active_file:968kB inactive_file:8484kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:19992kB shmem:496kB slab_reclaimable:1672kB slab_unreclaimable:37052kB kernel_stack:2088kB pagetables:750264kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 7 Normal free:10768kB min:8192kB low:10240kB high:12288kB active_anon:784916936kB inactive_anon:192316kB active_file:19228kB inactive_file:56852kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:4kB writeback:0kB mapped:34440kB shmem:4kB slab_reclaimable:5660kB slab_unreclaimable:36100kB kernel_stack:1328kB pagetables:1007968kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 So all nodes but Node 0 have a lot of free memory which should suggest that there is an available memory especially when mems_allowed=0-7. One could speculate that a massive process has managed to terminate and free up a lot of memory while racing with the above allocation request. Although this is highly unlikely it cannot be ruled out. A further debugging, however shown that the faulting process had mempolicy (not cpuset) to bind to Node 0. We cannot see that information from the report though. mems_allowed turned out to be more confusing than really helpful. Fix this by always priting the nodemask. It is either mempolicy mask (and non-null) or the one defined by the cpusets. The new output for the above oom report would be PoolThread invoked oom-killer: gfp_mask=0x280da(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=0, order=0, oom_adj=0, oom_score_adj=0 This patch doesn't touch show_mem and the node filtering based on the cpuset node mask because mempolicy is always a subset of cpusets and seeing the full cpuset oom context might be helpful for tunning more specific mempolicies inside cpusets (e.g. when they turn out to be too restrictive). To prevent from ugly ifdefs the mask is printed even for !NUMA configurations but this should be OK (a single node will be printed). Link: http://lkml.kernel.org/r/20160930214146.28600-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Sellami Abdelkader <abdelkader.sellami@sap.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Sellami Abdelkader <abdelkader.sellami@sap.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/82e7d3abec86cba9df945a765bba384f8ac113a7;Although this is highly unlikely it cannot be ruled out;no;yes;yes 106;MDY6Q29tbWl0MjMyNTI5ODo4MmU3ZDNhYmVjODZjYmE5ZGY5NDVhNzY1YmJhMzg0ZjhhYzExM2E3;Michal Hocko;Linus Torvalds;"oom: print nodemask in the oom report We have received a hard to explain oom report from a customer. The oom triggered regardless there is a lot of free memory: PoolThread invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0, oom_score_adj=0 PoolThread cpuset=/ mems_allowed=0-7 Pid: 30055, comm: PoolThread Tainted: G E X 3.0.101-80-default #1 Call Trace: dump_trace+0x75/0x300 dump_stack+0x69/0x6f dump_header+0x8e/0x110 oom_kill_process+0xa6/0x350 out_of_memory+0x2b7/0x310 __alloc_pages_slowpath+0x7dd/0x820 __alloc_pages_nodemask+0x1e9/0x200 alloc_pages_vma+0xe1/0x290 do_anonymous_page+0x13e/0x300 do_page_fault+0x1fd/0x4c0 page_fault+0x25/0x30 [...] active_anon:1135959151 inactive_anon:1051962 isolated_anon:0 active_file:13093 inactive_file:222506 isolated_file:0 unevictable:262144 dirty:2 writeback:0 unstable:0 free:432672819 slab_reclaimable:7917 slab_unreclaimable:95308 mapped:261139 shmem:166297 pagetables:2228282 bounce:0 [...] Node 0 DMA free:15896kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15672kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes lowmem_reserve[]: 0 2892 775542 775542 Node 0 DMA32 free:2783784kB min:28kB low:32kB high:40kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2961572kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes lowmem_reserve[]: 0 0 772650 772650 Node 0 Normal free:8120kB min:8160kB low:10200kB high:12240kB active_anon:779334960kB inactive_anon:2198744kB active_file:0kB inactive_file:180kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:791193600kB mlocked:131072kB dirty:0kB writeback:0kB mapped:372940kB shmem:361480kB slab_reclaimable:4536kB slab_unreclaimable:68472kB kernel_stack:10104kB pagetables:1414820kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:2280 all_unreclaimable? yes lowmem_reserve[]: 0 0 0 0 Node 1 Normal free:476718144kB min:8192kB low:10240kB high:12288kB active_anon:307623696kB inactive_anon:283620kB active_file:10392kB inactive_file:69908kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:4kB writeback:0kB mapped:257208kB shmem:189896kB slab_reclaimable:3868kB slab_unreclaimable:44756kB kernel_stack:1848kB pagetables:1369432kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 2 Normal free:386002452kB min:8192kB low:10240kB high:12288kB active_anon:398563752kB inactive_anon:68184kB active_file:10292kB inactive_file:29936kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:32084kB shmem:776kB slab_reclaimable:6888kB slab_unreclaimable:60056kB kernel_stack:8208kB pagetables:1282880kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 3 Normal free:196406760kB min:8192kB low:10240kB high:12288kB active_anon:587445640kB inactive_anon:164396kB active_file:5716kB inactive_file:709844kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:291776kB shmem:111416kB slab_reclaimable:5152kB slab_unreclaimable:44516kB kernel_stack:2168kB pagetables:1455956kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 4 Normal free:425338880kB min:8192kB low:10240kB high:12288kB active_anon:359695204kB inactive_anon:43216kB active_file:5748kB inactive_file:14772kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:24708kB shmem:1120kB slab_reclaimable:1884kB slab_unreclaimable:41060kB kernel_stack:1856kB pagetables:1100208kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 5 Normal free:11140kB min:8192kB low:10240kB high:12288kB active_anon:784240872kB inactive_anon:1217164kB active_file:28kB inactive_file:48kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:11408kB shmem:0kB slab_reclaimable:2008kB slab_unreclaimable:49220kB kernel_stack:1360kB pagetables:531600kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1202 all_unreclaimable? yes lowmem_reserve[]: 0 0 0 0 Node 6 Normal free:243395332kB min:8192kB low:10240kB high:12288kB active_anon:542015544kB inactive_anon:40208kB active_file:968kB inactive_file:8484kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:19992kB shmem:496kB slab_reclaimable:1672kB slab_unreclaimable:37052kB kernel_stack:2088kB pagetables:750264kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 7 Normal free:10768kB min:8192kB low:10240kB high:12288kB active_anon:784916936kB inactive_anon:192316kB active_file:19228kB inactive_file:56852kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:4kB writeback:0kB mapped:34440kB shmem:4kB slab_reclaimable:5660kB slab_unreclaimable:36100kB kernel_stack:1328kB pagetables:1007968kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 So all nodes but Node 0 have a lot of free memory which should suggest that there is an available memory especially when mems_allowed=0-7. One could speculate that a massive process has managed to terminate and free up a lot of memory while racing with the above allocation request. Although this is highly unlikely it cannot be ruled out. A further debugging, however shown that the faulting process had mempolicy (not cpuset) to bind to Node 0. We cannot see that information from the report though. mems_allowed turned out to be more confusing than really helpful. Fix this by always priting the nodemask. It is either mempolicy mask (and non-null) or the one defined by the cpusets. The new output for the above oom report would be PoolThread invoked oom-killer: gfp_mask=0x280da(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=0, order=0, oom_adj=0, oom_score_adj=0 This patch doesn't touch show_mem and the node filtering based on the cpuset node mask because mempolicy is always a subset of cpusets and seeing the full cpuset oom context might be helpful for tunning more specific mempolicies inside cpusets (e.g. when they turn out to be too restrictive). To prevent from ugly ifdefs the mask is printed even for !NUMA configurations but this should be OK (a single node will be printed). Link: http://lkml.kernel.org/r/20160930214146.28600-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Sellami Abdelkader <abdelkader.sellami@sap.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Sellami Abdelkader <abdelkader.sellami@sap.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/82e7d3abec86cba9df945a765bba384f8ac113a7;"A further debugging, however shown that the faulting process had mempolicy (not cpuset) to bind to Node 0";no;no;yes 106;MDY6Q29tbWl0MjMyNTI5ODo4MmU3ZDNhYmVjODZjYmE5ZGY5NDVhNzY1YmJhMzg0ZjhhYzExM2E3;Michal Hocko;Linus Torvalds;"oom: print nodemask in the oom report We have received a hard to explain oom report from a customer. The oom triggered regardless there is a lot of free memory: PoolThread invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0, oom_score_adj=0 PoolThread cpuset=/ mems_allowed=0-7 Pid: 30055, comm: PoolThread Tainted: G E X 3.0.101-80-default #1 Call Trace: dump_trace+0x75/0x300 dump_stack+0x69/0x6f dump_header+0x8e/0x110 oom_kill_process+0xa6/0x350 out_of_memory+0x2b7/0x310 __alloc_pages_slowpath+0x7dd/0x820 __alloc_pages_nodemask+0x1e9/0x200 alloc_pages_vma+0xe1/0x290 do_anonymous_page+0x13e/0x300 do_page_fault+0x1fd/0x4c0 page_fault+0x25/0x30 [...] active_anon:1135959151 inactive_anon:1051962 isolated_anon:0 active_file:13093 inactive_file:222506 isolated_file:0 unevictable:262144 dirty:2 writeback:0 unstable:0 free:432672819 slab_reclaimable:7917 slab_unreclaimable:95308 mapped:261139 shmem:166297 pagetables:2228282 bounce:0 [...] Node 0 DMA free:15896kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15672kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes lowmem_reserve[]: 0 2892 775542 775542 Node 0 DMA32 free:2783784kB min:28kB low:32kB high:40kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2961572kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes lowmem_reserve[]: 0 0 772650 772650 Node 0 Normal free:8120kB min:8160kB low:10200kB high:12240kB active_anon:779334960kB inactive_anon:2198744kB active_file:0kB inactive_file:180kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:791193600kB mlocked:131072kB dirty:0kB writeback:0kB mapped:372940kB shmem:361480kB slab_reclaimable:4536kB slab_unreclaimable:68472kB kernel_stack:10104kB pagetables:1414820kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:2280 all_unreclaimable? yes lowmem_reserve[]: 0 0 0 0 Node 1 Normal free:476718144kB min:8192kB low:10240kB high:12288kB active_anon:307623696kB inactive_anon:283620kB active_file:10392kB inactive_file:69908kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:4kB writeback:0kB mapped:257208kB shmem:189896kB slab_reclaimable:3868kB slab_unreclaimable:44756kB kernel_stack:1848kB pagetables:1369432kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 2 Normal free:386002452kB min:8192kB low:10240kB high:12288kB active_anon:398563752kB inactive_anon:68184kB active_file:10292kB inactive_file:29936kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:32084kB shmem:776kB slab_reclaimable:6888kB slab_unreclaimable:60056kB kernel_stack:8208kB pagetables:1282880kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 3 Normal free:196406760kB min:8192kB low:10240kB high:12288kB active_anon:587445640kB inactive_anon:164396kB active_file:5716kB inactive_file:709844kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:291776kB shmem:111416kB slab_reclaimable:5152kB slab_unreclaimable:44516kB kernel_stack:2168kB pagetables:1455956kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 4 Normal free:425338880kB min:8192kB low:10240kB high:12288kB active_anon:359695204kB inactive_anon:43216kB active_file:5748kB inactive_file:14772kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:24708kB shmem:1120kB slab_reclaimable:1884kB slab_unreclaimable:41060kB kernel_stack:1856kB pagetables:1100208kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 5 Normal free:11140kB min:8192kB low:10240kB high:12288kB active_anon:784240872kB inactive_anon:1217164kB active_file:28kB inactive_file:48kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:11408kB shmem:0kB slab_reclaimable:2008kB slab_unreclaimable:49220kB kernel_stack:1360kB pagetables:531600kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1202 all_unreclaimable? yes lowmem_reserve[]: 0 0 0 0 Node 6 Normal free:243395332kB min:8192kB low:10240kB high:12288kB active_anon:542015544kB inactive_anon:40208kB active_file:968kB inactive_file:8484kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:19992kB shmem:496kB slab_reclaimable:1672kB slab_unreclaimable:37052kB kernel_stack:2088kB pagetables:750264kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 7 Normal free:10768kB min:8192kB low:10240kB high:12288kB active_anon:784916936kB inactive_anon:192316kB active_file:19228kB inactive_file:56852kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:4kB writeback:0kB mapped:34440kB shmem:4kB slab_reclaimable:5660kB slab_unreclaimable:36100kB kernel_stack:1328kB pagetables:1007968kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 So all nodes but Node 0 have a lot of free memory which should suggest that there is an available memory especially when mems_allowed=0-7. One could speculate that a massive process has managed to terminate and free up a lot of memory while racing with the above allocation request. Although this is highly unlikely it cannot be ruled out. A further debugging, however shown that the faulting process had mempolicy (not cpuset) to bind to Node 0. We cannot see that information from the report though. mems_allowed turned out to be more confusing than really helpful. Fix this by always priting the nodemask. It is either mempolicy mask (and non-null) or the one defined by the cpusets. The new output for the above oom report would be PoolThread invoked oom-killer: gfp_mask=0x280da(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=0, order=0, oom_adj=0, oom_score_adj=0 This patch doesn't touch show_mem and the node filtering based on the cpuset node mask because mempolicy is always a subset of cpusets and seeing the full cpuset oom context might be helpful for tunning more specific mempolicies inside cpusets (e.g. when they turn out to be too restrictive). To prevent from ugly ifdefs the mask is printed even for !NUMA configurations but this should be OK (a single node will be printed). Link: http://lkml.kernel.org/r/20160930214146.28600-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Sellami Abdelkader <abdelkader.sellami@sap.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Sellami Abdelkader <abdelkader.sellami@sap.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/82e7d3abec86cba9df945a765bba384f8ac113a7;" We cannot see that information from the report though";no;yes;yes 106;MDY6Q29tbWl0MjMyNTI5ODo4MmU3ZDNhYmVjODZjYmE5ZGY5NDVhNzY1YmJhMzg0ZjhhYzExM2E3;Michal Hocko;Linus Torvalds;"oom: print nodemask in the oom report We have received a hard to explain oom report from a customer. The oom triggered regardless there is a lot of free memory: PoolThread invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0, oom_score_adj=0 PoolThread cpuset=/ mems_allowed=0-7 Pid: 30055, comm: PoolThread Tainted: G E X 3.0.101-80-default #1 Call Trace: dump_trace+0x75/0x300 dump_stack+0x69/0x6f dump_header+0x8e/0x110 oom_kill_process+0xa6/0x350 out_of_memory+0x2b7/0x310 __alloc_pages_slowpath+0x7dd/0x820 __alloc_pages_nodemask+0x1e9/0x200 alloc_pages_vma+0xe1/0x290 do_anonymous_page+0x13e/0x300 do_page_fault+0x1fd/0x4c0 page_fault+0x25/0x30 [...] active_anon:1135959151 inactive_anon:1051962 isolated_anon:0 active_file:13093 inactive_file:222506 isolated_file:0 unevictable:262144 dirty:2 writeback:0 unstable:0 free:432672819 slab_reclaimable:7917 slab_unreclaimable:95308 mapped:261139 shmem:166297 pagetables:2228282 bounce:0 [...] Node 0 DMA free:15896kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15672kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes lowmem_reserve[]: 0 2892 775542 775542 Node 0 DMA32 free:2783784kB min:28kB low:32kB high:40kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2961572kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes lowmem_reserve[]: 0 0 772650 772650 Node 0 Normal free:8120kB min:8160kB low:10200kB high:12240kB active_anon:779334960kB inactive_anon:2198744kB active_file:0kB inactive_file:180kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:791193600kB mlocked:131072kB dirty:0kB writeback:0kB mapped:372940kB shmem:361480kB slab_reclaimable:4536kB slab_unreclaimable:68472kB kernel_stack:10104kB pagetables:1414820kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:2280 all_unreclaimable? yes lowmem_reserve[]: 0 0 0 0 Node 1 Normal free:476718144kB min:8192kB low:10240kB high:12288kB active_anon:307623696kB inactive_anon:283620kB active_file:10392kB inactive_file:69908kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:4kB writeback:0kB mapped:257208kB shmem:189896kB slab_reclaimable:3868kB slab_unreclaimable:44756kB kernel_stack:1848kB pagetables:1369432kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 2 Normal free:386002452kB min:8192kB low:10240kB high:12288kB active_anon:398563752kB inactive_anon:68184kB active_file:10292kB inactive_file:29936kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:32084kB shmem:776kB slab_reclaimable:6888kB slab_unreclaimable:60056kB kernel_stack:8208kB pagetables:1282880kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 3 Normal free:196406760kB min:8192kB low:10240kB high:12288kB active_anon:587445640kB inactive_anon:164396kB active_file:5716kB inactive_file:709844kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:291776kB shmem:111416kB slab_reclaimable:5152kB slab_unreclaimable:44516kB kernel_stack:2168kB pagetables:1455956kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 4 Normal free:425338880kB min:8192kB low:10240kB high:12288kB active_anon:359695204kB inactive_anon:43216kB active_file:5748kB inactive_file:14772kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:24708kB shmem:1120kB slab_reclaimable:1884kB slab_unreclaimable:41060kB kernel_stack:1856kB pagetables:1100208kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 5 Normal free:11140kB min:8192kB low:10240kB high:12288kB active_anon:784240872kB inactive_anon:1217164kB active_file:28kB inactive_file:48kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:11408kB shmem:0kB slab_reclaimable:2008kB slab_unreclaimable:49220kB kernel_stack:1360kB pagetables:531600kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1202 all_unreclaimable? yes lowmem_reserve[]: 0 0 0 0 Node 6 Normal free:243395332kB min:8192kB low:10240kB high:12288kB active_anon:542015544kB inactive_anon:40208kB active_file:968kB inactive_file:8484kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:19992kB shmem:496kB slab_reclaimable:1672kB slab_unreclaimable:37052kB kernel_stack:2088kB pagetables:750264kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 7 Normal free:10768kB min:8192kB low:10240kB high:12288kB active_anon:784916936kB inactive_anon:192316kB active_file:19228kB inactive_file:56852kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:4kB writeback:0kB mapped:34440kB shmem:4kB slab_reclaimable:5660kB slab_unreclaimable:36100kB kernel_stack:1328kB pagetables:1007968kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 So all nodes but Node 0 have a lot of free memory which should suggest that there is an available memory especially when mems_allowed=0-7. One could speculate that a massive process has managed to terminate and free up a lot of memory while racing with the above allocation request. Although this is highly unlikely it cannot be ruled out. A further debugging, however shown that the faulting process had mempolicy (not cpuset) to bind to Node 0. We cannot see that information from the report though. mems_allowed turned out to be more confusing than really helpful. Fix this by always priting the nodemask. It is either mempolicy mask (and non-null) or the one defined by the cpusets. The new output for the above oom report would be PoolThread invoked oom-killer: gfp_mask=0x280da(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=0, order=0, oom_adj=0, oom_score_adj=0 This patch doesn't touch show_mem and the node filtering based on the cpuset node mask because mempolicy is always a subset of cpusets and seeing the full cpuset oom context might be helpful for tunning more specific mempolicies inside cpusets (e.g. when they turn out to be too restrictive). To prevent from ugly ifdefs the mask is printed even for !NUMA configurations but this should be OK (a single node will be printed). Link: http://lkml.kernel.org/r/20160930214146.28600-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Sellami Abdelkader <abdelkader.sellami@sap.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Sellami Abdelkader <abdelkader.sellami@sap.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/82e7d3abec86cba9df945a765bba384f8ac113a7;" mems_allowed turned out to be more confusing than really helpful";no;yes;yes 106;MDY6Q29tbWl0MjMyNTI5ODo4MmU3ZDNhYmVjODZjYmE5ZGY5NDVhNzY1YmJhMzg0ZjhhYzExM2E3;Michal Hocko;Linus Torvalds;"oom: print nodemask in the oom report We have received a hard to explain oom report from a customer. The oom triggered regardless there is a lot of free memory: PoolThread invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0, oom_score_adj=0 PoolThread cpuset=/ mems_allowed=0-7 Pid: 30055, comm: PoolThread Tainted: G E X 3.0.101-80-default #1 Call Trace: dump_trace+0x75/0x300 dump_stack+0x69/0x6f dump_header+0x8e/0x110 oom_kill_process+0xa6/0x350 out_of_memory+0x2b7/0x310 __alloc_pages_slowpath+0x7dd/0x820 __alloc_pages_nodemask+0x1e9/0x200 alloc_pages_vma+0xe1/0x290 do_anonymous_page+0x13e/0x300 do_page_fault+0x1fd/0x4c0 page_fault+0x25/0x30 [...] active_anon:1135959151 inactive_anon:1051962 isolated_anon:0 active_file:13093 inactive_file:222506 isolated_file:0 unevictable:262144 dirty:2 writeback:0 unstable:0 free:432672819 slab_reclaimable:7917 slab_unreclaimable:95308 mapped:261139 shmem:166297 pagetables:2228282 bounce:0 [...] Node 0 DMA free:15896kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15672kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes lowmem_reserve[]: 0 2892 775542 775542 Node 0 DMA32 free:2783784kB min:28kB low:32kB high:40kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2961572kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes lowmem_reserve[]: 0 0 772650 772650 Node 0 Normal free:8120kB min:8160kB low:10200kB high:12240kB active_anon:779334960kB inactive_anon:2198744kB active_file:0kB inactive_file:180kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:791193600kB mlocked:131072kB dirty:0kB writeback:0kB mapped:372940kB shmem:361480kB slab_reclaimable:4536kB slab_unreclaimable:68472kB kernel_stack:10104kB pagetables:1414820kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:2280 all_unreclaimable? yes lowmem_reserve[]: 0 0 0 0 Node 1 Normal free:476718144kB min:8192kB low:10240kB high:12288kB active_anon:307623696kB inactive_anon:283620kB active_file:10392kB inactive_file:69908kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:4kB writeback:0kB mapped:257208kB shmem:189896kB slab_reclaimable:3868kB slab_unreclaimable:44756kB kernel_stack:1848kB pagetables:1369432kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 2 Normal free:386002452kB min:8192kB low:10240kB high:12288kB active_anon:398563752kB inactive_anon:68184kB active_file:10292kB inactive_file:29936kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:32084kB shmem:776kB slab_reclaimable:6888kB slab_unreclaimable:60056kB kernel_stack:8208kB pagetables:1282880kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 3 Normal free:196406760kB min:8192kB low:10240kB high:12288kB active_anon:587445640kB inactive_anon:164396kB active_file:5716kB inactive_file:709844kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:291776kB shmem:111416kB slab_reclaimable:5152kB slab_unreclaimable:44516kB kernel_stack:2168kB pagetables:1455956kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 4 Normal free:425338880kB min:8192kB low:10240kB high:12288kB active_anon:359695204kB inactive_anon:43216kB active_file:5748kB inactive_file:14772kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:24708kB shmem:1120kB slab_reclaimable:1884kB slab_unreclaimable:41060kB kernel_stack:1856kB pagetables:1100208kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 5 Normal free:11140kB min:8192kB low:10240kB high:12288kB active_anon:784240872kB inactive_anon:1217164kB active_file:28kB inactive_file:48kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:11408kB shmem:0kB slab_reclaimable:2008kB slab_unreclaimable:49220kB kernel_stack:1360kB pagetables:531600kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1202 all_unreclaimable? yes lowmem_reserve[]: 0 0 0 0 Node 6 Normal free:243395332kB min:8192kB low:10240kB high:12288kB active_anon:542015544kB inactive_anon:40208kB active_file:968kB inactive_file:8484kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:19992kB shmem:496kB slab_reclaimable:1672kB slab_unreclaimable:37052kB kernel_stack:2088kB pagetables:750264kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 7 Normal free:10768kB min:8192kB low:10240kB high:12288kB active_anon:784916936kB inactive_anon:192316kB active_file:19228kB inactive_file:56852kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:4kB writeback:0kB mapped:34440kB shmem:4kB slab_reclaimable:5660kB slab_unreclaimable:36100kB kernel_stack:1328kB pagetables:1007968kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 So all nodes but Node 0 have a lot of free memory which should suggest that there is an available memory especially when mems_allowed=0-7. One could speculate that a massive process has managed to terminate and free up a lot of memory while racing with the above allocation request. Although this is highly unlikely it cannot be ruled out. A further debugging, however shown that the faulting process had mempolicy (not cpuset) to bind to Node 0. We cannot see that information from the report though. mems_allowed turned out to be more confusing than really helpful. Fix this by always priting the nodemask. It is either mempolicy mask (and non-null) or the one defined by the cpusets. The new output for the above oom report would be PoolThread invoked oom-killer: gfp_mask=0x280da(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=0, order=0, oom_adj=0, oom_score_adj=0 This patch doesn't touch show_mem and the node filtering based on the cpuset node mask because mempolicy is always a subset of cpusets and seeing the full cpuset oom context might be helpful for tunning more specific mempolicies inside cpusets (e.g. when they turn out to be too restrictive). To prevent from ugly ifdefs the mask is printed even for !NUMA configurations but this should be OK (a single node will be printed). Link: http://lkml.kernel.org/r/20160930214146.28600-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Sellami Abdelkader <abdelkader.sellami@sap.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Sellami Abdelkader <abdelkader.sellami@sap.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/82e7d3abec86cba9df945a765bba384f8ac113a7;Fix this by always priting the nodemask;yes;yes;no 106;MDY6Q29tbWl0MjMyNTI5ODo4MmU3ZDNhYmVjODZjYmE5ZGY5NDVhNzY1YmJhMzg0ZjhhYzExM2E3;Michal Hocko;Linus Torvalds;"oom: print nodemask in the oom report We have received a hard to explain oom report from a customer. The oom triggered regardless there is a lot of free memory: PoolThread invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0, oom_score_adj=0 PoolThread cpuset=/ mems_allowed=0-7 Pid: 30055, comm: PoolThread Tainted: G E X 3.0.101-80-default #1 Call Trace: dump_trace+0x75/0x300 dump_stack+0x69/0x6f dump_header+0x8e/0x110 oom_kill_process+0xa6/0x350 out_of_memory+0x2b7/0x310 __alloc_pages_slowpath+0x7dd/0x820 __alloc_pages_nodemask+0x1e9/0x200 alloc_pages_vma+0xe1/0x290 do_anonymous_page+0x13e/0x300 do_page_fault+0x1fd/0x4c0 page_fault+0x25/0x30 [...] active_anon:1135959151 inactive_anon:1051962 isolated_anon:0 active_file:13093 inactive_file:222506 isolated_file:0 unevictable:262144 dirty:2 writeback:0 unstable:0 free:432672819 slab_reclaimable:7917 slab_unreclaimable:95308 mapped:261139 shmem:166297 pagetables:2228282 bounce:0 [...] Node 0 DMA free:15896kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15672kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes lowmem_reserve[]: 0 2892 775542 775542 Node 0 DMA32 free:2783784kB min:28kB low:32kB high:40kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2961572kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes lowmem_reserve[]: 0 0 772650 772650 Node 0 Normal free:8120kB min:8160kB low:10200kB high:12240kB active_anon:779334960kB inactive_anon:2198744kB active_file:0kB inactive_file:180kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:791193600kB mlocked:131072kB dirty:0kB writeback:0kB mapped:372940kB shmem:361480kB slab_reclaimable:4536kB slab_unreclaimable:68472kB kernel_stack:10104kB pagetables:1414820kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:2280 all_unreclaimable? yes lowmem_reserve[]: 0 0 0 0 Node 1 Normal free:476718144kB min:8192kB low:10240kB high:12288kB active_anon:307623696kB inactive_anon:283620kB active_file:10392kB inactive_file:69908kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:4kB writeback:0kB mapped:257208kB shmem:189896kB slab_reclaimable:3868kB slab_unreclaimable:44756kB kernel_stack:1848kB pagetables:1369432kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 2 Normal free:386002452kB min:8192kB low:10240kB high:12288kB active_anon:398563752kB inactive_anon:68184kB active_file:10292kB inactive_file:29936kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:32084kB shmem:776kB slab_reclaimable:6888kB slab_unreclaimable:60056kB kernel_stack:8208kB pagetables:1282880kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 3 Normal free:196406760kB min:8192kB low:10240kB high:12288kB active_anon:587445640kB inactive_anon:164396kB active_file:5716kB inactive_file:709844kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:291776kB shmem:111416kB slab_reclaimable:5152kB slab_unreclaimable:44516kB kernel_stack:2168kB pagetables:1455956kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 4 Normal free:425338880kB min:8192kB low:10240kB high:12288kB active_anon:359695204kB inactive_anon:43216kB active_file:5748kB inactive_file:14772kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:24708kB shmem:1120kB slab_reclaimable:1884kB slab_unreclaimable:41060kB kernel_stack:1856kB pagetables:1100208kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 5 Normal free:11140kB min:8192kB low:10240kB high:12288kB active_anon:784240872kB inactive_anon:1217164kB active_file:28kB inactive_file:48kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:11408kB shmem:0kB slab_reclaimable:2008kB slab_unreclaimable:49220kB kernel_stack:1360kB pagetables:531600kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1202 all_unreclaimable? yes lowmem_reserve[]: 0 0 0 0 Node 6 Normal free:243395332kB min:8192kB low:10240kB high:12288kB active_anon:542015544kB inactive_anon:40208kB active_file:968kB inactive_file:8484kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:19992kB shmem:496kB slab_reclaimable:1672kB slab_unreclaimable:37052kB kernel_stack:2088kB pagetables:750264kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 7 Normal free:10768kB min:8192kB low:10240kB high:12288kB active_anon:784916936kB inactive_anon:192316kB active_file:19228kB inactive_file:56852kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:4kB writeback:0kB mapped:34440kB shmem:4kB slab_reclaimable:5660kB slab_unreclaimable:36100kB kernel_stack:1328kB pagetables:1007968kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 So all nodes but Node 0 have a lot of free memory which should suggest that there is an available memory especially when mems_allowed=0-7. One could speculate that a massive process has managed to terminate and free up a lot of memory while racing with the above allocation request. Although this is highly unlikely it cannot be ruled out. A further debugging, however shown that the faulting process had mempolicy (not cpuset) to bind to Node 0. We cannot see that information from the report though. mems_allowed turned out to be more confusing than really helpful. Fix this by always priting the nodemask. It is either mempolicy mask (and non-null) or the one defined by the cpusets. The new output for the above oom report would be PoolThread invoked oom-killer: gfp_mask=0x280da(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=0, order=0, oom_adj=0, oom_score_adj=0 This patch doesn't touch show_mem and the node filtering based on the cpuset node mask because mempolicy is always a subset of cpusets and seeing the full cpuset oom context might be helpful for tunning more specific mempolicies inside cpusets (e.g. when they turn out to be too restrictive). To prevent from ugly ifdefs the mask is printed even for !NUMA configurations but this should be OK (a single node will be printed). Link: http://lkml.kernel.org/r/20160930214146.28600-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Sellami Abdelkader <abdelkader.sellami@sap.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Sellami Abdelkader <abdelkader.sellami@sap.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/82e7d3abec86cba9df945a765bba384f8ac113a7;" It is either mempolicy mask (and non-null) or the one defined by the cpusets";yes;no;yes 106;MDY6Q29tbWl0MjMyNTI5ODo4MmU3ZDNhYmVjODZjYmE5ZGY5NDVhNzY1YmJhMzg0ZjhhYzExM2E3;Michal Hocko;Linus Torvalds;"oom: print nodemask in the oom report We have received a hard to explain oom report from a customer. The oom triggered regardless there is a lot of free memory: PoolThread invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0, oom_score_adj=0 PoolThread cpuset=/ mems_allowed=0-7 Pid: 30055, comm: PoolThread Tainted: G E X 3.0.101-80-default #1 Call Trace: dump_trace+0x75/0x300 dump_stack+0x69/0x6f dump_header+0x8e/0x110 oom_kill_process+0xa6/0x350 out_of_memory+0x2b7/0x310 __alloc_pages_slowpath+0x7dd/0x820 __alloc_pages_nodemask+0x1e9/0x200 alloc_pages_vma+0xe1/0x290 do_anonymous_page+0x13e/0x300 do_page_fault+0x1fd/0x4c0 page_fault+0x25/0x30 [...] active_anon:1135959151 inactive_anon:1051962 isolated_anon:0 active_file:13093 inactive_file:222506 isolated_file:0 unevictable:262144 dirty:2 writeback:0 unstable:0 free:432672819 slab_reclaimable:7917 slab_unreclaimable:95308 mapped:261139 shmem:166297 pagetables:2228282 bounce:0 [...] Node 0 DMA free:15896kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15672kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes lowmem_reserve[]: 0 2892 775542 775542 Node 0 DMA32 free:2783784kB min:28kB low:32kB high:40kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2961572kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes lowmem_reserve[]: 0 0 772650 772650 Node 0 Normal free:8120kB min:8160kB low:10200kB high:12240kB active_anon:779334960kB inactive_anon:2198744kB active_file:0kB inactive_file:180kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:791193600kB mlocked:131072kB dirty:0kB writeback:0kB mapped:372940kB shmem:361480kB slab_reclaimable:4536kB slab_unreclaimable:68472kB kernel_stack:10104kB pagetables:1414820kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:2280 all_unreclaimable? yes lowmem_reserve[]: 0 0 0 0 Node 1 Normal free:476718144kB min:8192kB low:10240kB high:12288kB active_anon:307623696kB inactive_anon:283620kB active_file:10392kB inactive_file:69908kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:4kB writeback:0kB mapped:257208kB shmem:189896kB slab_reclaimable:3868kB slab_unreclaimable:44756kB kernel_stack:1848kB pagetables:1369432kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 2 Normal free:386002452kB min:8192kB low:10240kB high:12288kB active_anon:398563752kB inactive_anon:68184kB active_file:10292kB inactive_file:29936kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:32084kB shmem:776kB slab_reclaimable:6888kB slab_unreclaimable:60056kB kernel_stack:8208kB pagetables:1282880kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 3 Normal free:196406760kB min:8192kB low:10240kB high:12288kB active_anon:587445640kB inactive_anon:164396kB active_file:5716kB inactive_file:709844kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:291776kB shmem:111416kB slab_reclaimable:5152kB slab_unreclaimable:44516kB kernel_stack:2168kB pagetables:1455956kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 4 Normal free:425338880kB min:8192kB low:10240kB high:12288kB active_anon:359695204kB inactive_anon:43216kB active_file:5748kB inactive_file:14772kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:24708kB shmem:1120kB slab_reclaimable:1884kB slab_unreclaimable:41060kB kernel_stack:1856kB pagetables:1100208kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 5 Normal free:11140kB min:8192kB low:10240kB high:12288kB active_anon:784240872kB inactive_anon:1217164kB active_file:28kB inactive_file:48kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:11408kB shmem:0kB slab_reclaimable:2008kB slab_unreclaimable:49220kB kernel_stack:1360kB pagetables:531600kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1202 all_unreclaimable? yes lowmem_reserve[]: 0 0 0 0 Node 6 Normal free:243395332kB min:8192kB low:10240kB high:12288kB active_anon:542015544kB inactive_anon:40208kB active_file:968kB inactive_file:8484kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:19992kB shmem:496kB slab_reclaimable:1672kB slab_unreclaimable:37052kB kernel_stack:2088kB pagetables:750264kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 7 Normal free:10768kB min:8192kB low:10240kB high:12288kB active_anon:784916936kB inactive_anon:192316kB active_file:19228kB inactive_file:56852kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:4kB writeback:0kB mapped:34440kB shmem:4kB slab_reclaimable:5660kB slab_unreclaimable:36100kB kernel_stack:1328kB pagetables:1007968kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 So all nodes but Node 0 have a lot of free memory which should suggest that there is an available memory especially when mems_allowed=0-7. One could speculate that a massive process has managed to terminate and free up a lot of memory while racing with the above allocation request. Although this is highly unlikely it cannot be ruled out. A further debugging, however shown that the faulting process had mempolicy (not cpuset) to bind to Node 0. We cannot see that information from the report though. mems_allowed turned out to be more confusing than really helpful. Fix this by always priting the nodemask. It is either mempolicy mask (and non-null) or the one defined by the cpusets. The new output for the above oom report would be PoolThread invoked oom-killer: gfp_mask=0x280da(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=0, order=0, oom_adj=0, oom_score_adj=0 This patch doesn't touch show_mem and the node filtering based on the cpuset node mask because mempolicy is always a subset of cpusets and seeing the full cpuset oom context might be helpful for tunning more specific mempolicies inside cpusets (e.g. when they turn out to be too restrictive). To prevent from ugly ifdefs the mask is printed even for !NUMA configurations but this should be OK (a single node will be printed). Link: http://lkml.kernel.org/r/20160930214146.28600-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Sellami Abdelkader <abdelkader.sellami@sap.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Sellami Abdelkader <abdelkader.sellami@sap.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/82e7d3abec86cba9df945a765bba384f8ac113a7;" The new output for the above oom report would be PoolThread invoked oom-killer: gfp_mask=0x280da(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=0, order=0, oom_adj=0, oom_score_adj=0 This patch doesn't touch show_mem and the node filtering based on the cpuset node mask because mempolicy is always a subset of cpusets and seeing the full cpuset oom context might be helpful for tunning more specific mempolicies inside cpusets (e.g";yes;yes;yes 106;MDY6Q29tbWl0MjMyNTI5ODo4MmU3ZDNhYmVjODZjYmE5ZGY5NDVhNzY1YmJhMzg0ZjhhYzExM2E3;Michal Hocko;Linus Torvalds;"oom: print nodemask in the oom report We have received a hard to explain oom report from a customer. The oom triggered regardless there is a lot of free memory: PoolThread invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0, oom_score_adj=0 PoolThread cpuset=/ mems_allowed=0-7 Pid: 30055, comm: PoolThread Tainted: G E X 3.0.101-80-default #1 Call Trace: dump_trace+0x75/0x300 dump_stack+0x69/0x6f dump_header+0x8e/0x110 oom_kill_process+0xa6/0x350 out_of_memory+0x2b7/0x310 __alloc_pages_slowpath+0x7dd/0x820 __alloc_pages_nodemask+0x1e9/0x200 alloc_pages_vma+0xe1/0x290 do_anonymous_page+0x13e/0x300 do_page_fault+0x1fd/0x4c0 page_fault+0x25/0x30 [...] active_anon:1135959151 inactive_anon:1051962 isolated_anon:0 active_file:13093 inactive_file:222506 isolated_file:0 unevictable:262144 dirty:2 writeback:0 unstable:0 free:432672819 slab_reclaimable:7917 slab_unreclaimable:95308 mapped:261139 shmem:166297 pagetables:2228282 bounce:0 [...] Node 0 DMA free:15896kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15672kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes lowmem_reserve[]: 0 2892 775542 775542 Node 0 DMA32 free:2783784kB min:28kB low:32kB high:40kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2961572kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes lowmem_reserve[]: 0 0 772650 772650 Node 0 Normal free:8120kB min:8160kB low:10200kB high:12240kB active_anon:779334960kB inactive_anon:2198744kB active_file:0kB inactive_file:180kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:791193600kB mlocked:131072kB dirty:0kB writeback:0kB mapped:372940kB shmem:361480kB slab_reclaimable:4536kB slab_unreclaimable:68472kB kernel_stack:10104kB pagetables:1414820kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:2280 all_unreclaimable? yes lowmem_reserve[]: 0 0 0 0 Node 1 Normal free:476718144kB min:8192kB low:10240kB high:12288kB active_anon:307623696kB inactive_anon:283620kB active_file:10392kB inactive_file:69908kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:4kB writeback:0kB mapped:257208kB shmem:189896kB slab_reclaimable:3868kB slab_unreclaimable:44756kB kernel_stack:1848kB pagetables:1369432kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 2 Normal free:386002452kB min:8192kB low:10240kB high:12288kB active_anon:398563752kB inactive_anon:68184kB active_file:10292kB inactive_file:29936kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:32084kB shmem:776kB slab_reclaimable:6888kB slab_unreclaimable:60056kB kernel_stack:8208kB pagetables:1282880kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 3 Normal free:196406760kB min:8192kB low:10240kB high:12288kB active_anon:587445640kB inactive_anon:164396kB active_file:5716kB inactive_file:709844kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:291776kB shmem:111416kB slab_reclaimable:5152kB slab_unreclaimable:44516kB kernel_stack:2168kB pagetables:1455956kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 4 Normal free:425338880kB min:8192kB low:10240kB high:12288kB active_anon:359695204kB inactive_anon:43216kB active_file:5748kB inactive_file:14772kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:24708kB shmem:1120kB slab_reclaimable:1884kB slab_unreclaimable:41060kB kernel_stack:1856kB pagetables:1100208kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 5 Normal free:11140kB min:8192kB low:10240kB high:12288kB active_anon:784240872kB inactive_anon:1217164kB active_file:28kB inactive_file:48kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:11408kB shmem:0kB slab_reclaimable:2008kB slab_unreclaimable:49220kB kernel_stack:1360kB pagetables:531600kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1202 all_unreclaimable? yes lowmem_reserve[]: 0 0 0 0 Node 6 Normal free:243395332kB min:8192kB low:10240kB high:12288kB active_anon:542015544kB inactive_anon:40208kB active_file:968kB inactive_file:8484kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:19992kB shmem:496kB slab_reclaimable:1672kB slab_unreclaimable:37052kB kernel_stack:2088kB pagetables:750264kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 7 Normal free:10768kB min:8192kB low:10240kB high:12288kB active_anon:784916936kB inactive_anon:192316kB active_file:19228kB inactive_file:56852kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:4kB writeback:0kB mapped:34440kB shmem:4kB slab_reclaimable:5660kB slab_unreclaimable:36100kB kernel_stack:1328kB pagetables:1007968kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 So all nodes but Node 0 have a lot of free memory which should suggest that there is an available memory especially when mems_allowed=0-7. One could speculate that a massive process has managed to terminate and free up a lot of memory while racing with the above allocation request. Although this is highly unlikely it cannot be ruled out. A further debugging, however shown that the faulting process had mempolicy (not cpuset) to bind to Node 0. We cannot see that information from the report though. mems_allowed turned out to be more confusing than really helpful. Fix this by always priting the nodemask. It is either mempolicy mask (and non-null) or the one defined by the cpusets. The new output for the above oom report would be PoolThread invoked oom-killer: gfp_mask=0x280da(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=0, order=0, oom_adj=0, oom_score_adj=0 This patch doesn't touch show_mem and the node filtering based on the cpuset node mask because mempolicy is always a subset of cpusets and seeing the full cpuset oom context might be helpful for tunning more specific mempolicies inside cpusets (e.g. when they turn out to be too restrictive). To prevent from ugly ifdefs the mask is printed even for !NUMA configurations but this should be OK (a single node will be printed). Link: http://lkml.kernel.org/r/20160930214146.28600-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Sellami Abdelkader <abdelkader.sellami@sap.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Sellami Abdelkader <abdelkader.sellami@sap.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/82e7d3abec86cba9df945a765bba384f8ac113a7;" when they turn out to be too restrictive)";yes;yes;yes 106;MDY6Q29tbWl0MjMyNTI5ODo4MmU3ZDNhYmVjODZjYmE5ZGY5NDVhNzY1YmJhMzg0ZjhhYzExM2E3;Michal Hocko;Linus Torvalds;"oom: print nodemask in the oom report We have received a hard to explain oom report from a customer. The oom triggered regardless there is a lot of free memory: PoolThread invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0, oom_score_adj=0 PoolThread cpuset=/ mems_allowed=0-7 Pid: 30055, comm: PoolThread Tainted: G E X 3.0.101-80-default #1 Call Trace: dump_trace+0x75/0x300 dump_stack+0x69/0x6f dump_header+0x8e/0x110 oom_kill_process+0xa6/0x350 out_of_memory+0x2b7/0x310 __alloc_pages_slowpath+0x7dd/0x820 __alloc_pages_nodemask+0x1e9/0x200 alloc_pages_vma+0xe1/0x290 do_anonymous_page+0x13e/0x300 do_page_fault+0x1fd/0x4c0 page_fault+0x25/0x30 [...] active_anon:1135959151 inactive_anon:1051962 isolated_anon:0 active_file:13093 inactive_file:222506 isolated_file:0 unevictable:262144 dirty:2 writeback:0 unstable:0 free:432672819 slab_reclaimable:7917 slab_unreclaimable:95308 mapped:261139 shmem:166297 pagetables:2228282 bounce:0 [...] Node 0 DMA free:15896kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15672kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes lowmem_reserve[]: 0 2892 775542 775542 Node 0 DMA32 free:2783784kB min:28kB low:32kB high:40kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2961572kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes lowmem_reserve[]: 0 0 772650 772650 Node 0 Normal free:8120kB min:8160kB low:10200kB high:12240kB active_anon:779334960kB inactive_anon:2198744kB active_file:0kB inactive_file:180kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:791193600kB mlocked:131072kB dirty:0kB writeback:0kB mapped:372940kB shmem:361480kB slab_reclaimable:4536kB slab_unreclaimable:68472kB kernel_stack:10104kB pagetables:1414820kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:2280 all_unreclaimable? yes lowmem_reserve[]: 0 0 0 0 Node 1 Normal free:476718144kB min:8192kB low:10240kB high:12288kB active_anon:307623696kB inactive_anon:283620kB active_file:10392kB inactive_file:69908kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:4kB writeback:0kB mapped:257208kB shmem:189896kB slab_reclaimable:3868kB slab_unreclaimable:44756kB kernel_stack:1848kB pagetables:1369432kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 2 Normal free:386002452kB min:8192kB low:10240kB high:12288kB active_anon:398563752kB inactive_anon:68184kB active_file:10292kB inactive_file:29936kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:32084kB shmem:776kB slab_reclaimable:6888kB slab_unreclaimable:60056kB kernel_stack:8208kB pagetables:1282880kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 3 Normal free:196406760kB min:8192kB low:10240kB high:12288kB active_anon:587445640kB inactive_anon:164396kB active_file:5716kB inactive_file:709844kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:291776kB shmem:111416kB slab_reclaimable:5152kB slab_unreclaimable:44516kB kernel_stack:2168kB pagetables:1455956kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 4 Normal free:425338880kB min:8192kB low:10240kB high:12288kB active_anon:359695204kB inactive_anon:43216kB active_file:5748kB inactive_file:14772kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:24708kB shmem:1120kB slab_reclaimable:1884kB slab_unreclaimable:41060kB kernel_stack:1856kB pagetables:1100208kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 5 Normal free:11140kB min:8192kB low:10240kB high:12288kB active_anon:784240872kB inactive_anon:1217164kB active_file:28kB inactive_file:48kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:11408kB shmem:0kB slab_reclaimable:2008kB slab_unreclaimable:49220kB kernel_stack:1360kB pagetables:531600kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1202 all_unreclaimable? yes lowmem_reserve[]: 0 0 0 0 Node 6 Normal free:243395332kB min:8192kB low:10240kB high:12288kB active_anon:542015544kB inactive_anon:40208kB active_file:968kB inactive_file:8484kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:19992kB shmem:496kB slab_reclaimable:1672kB slab_unreclaimable:37052kB kernel_stack:2088kB pagetables:750264kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 7 Normal free:10768kB min:8192kB low:10240kB high:12288kB active_anon:784916936kB inactive_anon:192316kB active_file:19228kB inactive_file:56852kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:4kB writeback:0kB mapped:34440kB shmem:4kB slab_reclaimable:5660kB slab_unreclaimable:36100kB kernel_stack:1328kB pagetables:1007968kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 So all nodes but Node 0 have a lot of free memory which should suggest that there is an available memory especially when mems_allowed=0-7. One could speculate that a massive process has managed to terminate and free up a lot of memory while racing with the above allocation request. Although this is highly unlikely it cannot be ruled out. A further debugging, however shown that the faulting process had mempolicy (not cpuset) to bind to Node 0. We cannot see that information from the report though. mems_allowed turned out to be more confusing than really helpful. Fix this by always priting the nodemask. It is either mempolicy mask (and non-null) or the one defined by the cpusets. The new output for the above oom report would be PoolThread invoked oom-killer: gfp_mask=0x280da(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=0, order=0, oom_adj=0, oom_score_adj=0 This patch doesn't touch show_mem and the node filtering based on the cpuset node mask because mempolicy is always a subset of cpusets and seeing the full cpuset oom context might be helpful for tunning more specific mempolicies inside cpusets (e.g. when they turn out to be too restrictive). To prevent from ugly ifdefs the mask is printed even for !NUMA configurations but this should be OK (a single node will be printed). Link: http://lkml.kernel.org/r/20160930214146.28600-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Sellami Abdelkader <abdelkader.sellami@sap.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Sellami Abdelkader <abdelkader.sellami@sap.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/82e7d3abec86cba9df945a765bba384f8ac113a7;" To prevent from ugly ifdefs the mask is printed even for !NUMA configurations but this should be OK (a single node will be printed).";yes;yes;no 107;MDY6Q29tbWl0MjMyNTI5ODphMTA0ODA4ZTIxMmE5ZWU5N2U2YjljYjY5NDUxODVlNTA5MDVmMDA5;Tetsuo Handa;Linus Torvalds;"mm: don't emit warning from pagefault_out_of_memory() Commit c32b3cbe0d06 (""oom, PM: make OOM detection in the freezer path raceless"") inserted a WARN_ON() into pagefault_out_of_memory() in order to warn when we raced with disabling the OOM killer. Now, patch ""oom, suspend: fix oom_killer_disable vs. pm suspend properly"" introduced a timeout for oom_killer_disable(). Even if we raced with disabling the OOM killer and the system is OOM livelocked, the OOM killer will be enabled eventually (in 20 seconds by default) and the OOM livelock will be solved. Therefore, we no longer need to warn when we raced with disabling the OOM killer. Link: http://lkml.kernel.org/r/1473442120-7246-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/a104808e212a9ee97e6b9cb6945185e50905f009;mm: don't emit warning from pagefault_out_of_memory();yes;no;no 107;MDY6Q29tbWl0MjMyNTI5ODphMTA0ODA4ZTIxMmE5ZWU5N2U2YjljYjY5NDUxODVlNTA5MDVmMDA5;Tetsuo Handa;Linus Torvalds;"mm: don't emit warning from pagefault_out_of_memory() Commit c32b3cbe0d06 (""oom, PM: make OOM detection in the freezer path raceless"") inserted a WARN_ON() into pagefault_out_of_memory() in order to warn when we raced with disabling the OOM killer. Now, patch ""oom, suspend: fix oom_killer_disable vs. pm suspend properly"" introduced a timeout for oom_killer_disable(). Even if we raced with disabling the OOM killer and the system is OOM livelocked, the OOM killer will be enabled eventually (in 20 seconds by default) and the OOM livelock will be solved. Therefore, we no longer need to warn when we raced with disabling the OOM killer. Link: http://lkml.kernel.org/r/1473442120-7246-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/a104808e212a9ee97e6b9cb6945185e50905f009;"Commit c32b3cbe0d06 (""oom, PM: make OOM detection in the freezer path raceless"") inserted a WARN_ON() into pagefault_out_of_memory() in order to warn when we raced with disabling the OOM killer";no;no;yes 107;MDY6Q29tbWl0MjMyNTI5ODphMTA0ODA4ZTIxMmE5ZWU5N2U2YjljYjY5NDUxODVlNTA5MDVmMDA5;Tetsuo Handa;Linus Torvalds;"mm: don't emit warning from pagefault_out_of_memory() Commit c32b3cbe0d06 (""oom, PM: make OOM detection in the freezer path raceless"") inserted a WARN_ON() into pagefault_out_of_memory() in order to warn when we raced with disabling the OOM killer. Now, patch ""oom, suspend: fix oom_killer_disable vs. pm suspend properly"" introduced a timeout for oom_killer_disable(). Even if we raced with disabling the OOM killer and the system is OOM livelocked, the OOM killer will be enabled eventually (in 20 seconds by default) and the OOM livelock will be solved. Therefore, we no longer need to warn when we raced with disabling the OOM killer. Link: http://lkml.kernel.org/r/1473442120-7246-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/a104808e212a9ee97e6b9cb6945185e50905f009;"Now, patch ""oom, suspend: fix oom_killer_disable vs";yes;yes;yes 107;MDY6Q29tbWl0MjMyNTI5ODphMTA0ODA4ZTIxMmE5ZWU5N2U2YjljYjY5NDUxODVlNTA5MDVmMDA5;Tetsuo Handa;Linus Torvalds;"mm: don't emit warning from pagefault_out_of_memory() Commit c32b3cbe0d06 (""oom, PM: make OOM detection in the freezer path raceless"") inserted a WARN_ON() into pagefault_out_of_memory() in order to warn when we raced with disabling the OOM killer. Now, patch ""oom, suspend: fix oom_killer_disable vs. pm suspend properly"" introduced a timeout for oom_killer_disable(). Even if we raced with disabling the OOM killer and the system is OOM livelocked, the OOM killer will be enabled eventually (in 20 seconds by default) and the OOM livelock will be solved. Therefore, we no longer need to warn when we raced with disabling the OOM killer. Link: http://lkml.kernel.org/r/1473442120-7246-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/a104808e212a9ee97e6b9cb6945185e50905f009;" pm suspend properly"" introduced a timeout for oom_killer_disable()";no;no;yes 107;MDY6Q29tbWl0MjMyNTI5ODphMTA0ODA4ZTIxMmE5ZWU5N2U2YjljYjY5NDUxODVlNTA5MDVmMDA5;Tetsuo Handa;Linus Torvalds;"mm: don't emit warning from pagefault_out_of_memory() Commit c32b3cbe0d06 (""oom, PM: make OOM detection in the freezer path raceless"") inserted a WARN_ON() into pagefault_out_of_memory() in order to warn when we raced with disabling the OOM killer. Now, patch ""oom, suspend: fix oom_killer_disable vs. pm suspend properly"" introduced a timeout for oom_killer_disable(). Even if we raced with disabling the OOM killer and the system is OOM livelocked, the OOM killer will be enabled eventually (in 20 seconds by default) and the OOM livelock will be solved. Therefore, we no longer need to warn when we raced with disabling the OOM killer. Link: http://lkml.kernel.org/r/1473442120-7246-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/a104808e212a9ee97e6b9cb6945185e50905f009;" Even if we raced with disabling the OOM killer and the system is OOM livelocked, the OOM killer will be enabled eventually (in 20 seconds by default) and the OOM livelock will be solved";no;yes;yes 107;MDY6Q29tbWl0MjMyNTI5ODphMTA0ODA4ZTIxMmE5ZWU5N2U2YjljYjY5NDUxODVlNTA5MDVmMDA5;Tetsuo Handa;Linus Torvalds;"mm: don't emit warning from pagefault_out_of_memory() Commit c32b3cbe0d06 (""oom, PM: make OOM detection in the freezer path raceless"") inserted a WARN_ON() into pagefault_out_of_memory() in order to warn when we raced with disabling the OOM killer. Now, patch ""oom, suspend: fix oom_killer_disable vs. pm suspend properly"" introduced a timeout for oom_killer_disable(). Even if we raced with disabling the OOM killer and the system is OOM livelocked, the OOM killer will be enabled eventually (in 20 seconds by default) and the OOM livelock will be solved. Therefore, we no longer need to warn when we raced with disabling the OOM killer. Link: http://lkml.kernel.org/r/1473442120-7246-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/a104808e212a9ee97e6b9cb6945185e50905f009;" Therefore, we no longer need to warn when we raced with disabling the OOM killer.";yes;yes;no 108;MDY6Q29tbWl0MjMyNTI5ODo5MjU0OTkwZmI5ZjBmMTVmMjU2MDU3NDhkYTIwY2ZiZWNlZDdjODE2;Michal Hocko;Linus Torvalds;"oom: warn if we go OOM for higher order and compaction is disabled Since the lumpy reclaim is gone there is no source of higher order pages if CONFIG_COMPACTION=n except for the order-0 pages reclaim which is unreliable for that purpose to say the least. Hitting an OOM for !costly higher order requests is therefore all not that hard to imagine. We are trying hard to not invoke OOM killer as much as possible but there is simply no reliable way to detect whether more reclaim retries make sense. Disabling COMPACTION is not widespread but it seems that some users might have disable the feature without realizing full consequences (mostly along with disabling THP because compaction used to be THP mainly thing). This patch just adds a note if the OOM killer was triggered by higher order request with compaction disabled. This will help us identifying possible misconfiguration right from the oom report which is easier than to always keep in mind that somebody might have disabled COMPACTION without a good reason. Link: http://lkml.kernel.org/r/20160830111632.GD23963@dhcp22.suse.cz Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9254990fb9f0f15f25605748da20cfbeced7c816;oom: warn if we go OOM for higher order and compaction is disabled;yes;no;no 108;MDY6Q29tbWl0MjMyNTI5ODo5MjU0OTkwZmI5ZjBmMTVmMjU2MDU3NDhkYTIwY2ZiZWNlZDdjODE2;Michal Hocko;Linus Torvalds;"oom: warn if we go OOM for higher order and compaction is disabled Since the lumpy reclaim is gone there is no source of higher order pages if CONFIG_COMPACTION=n except for the order-0 pages reclaim which is unreliable for that purpose to say the least. Hitting an OOM for !costly higher order requests is therefore all not that hard to imagine. We are trying hard to not invoke OOM killer as much as possible but there is simply no reliable way to detect whether more reclaim retries make sense. Disabling COMPACTION is not widespread but it seems that some users might have disable the feature without realizing full consequences (mostly along with disabling THP because compaction used to be THP mainly thing). This patch just adds a note if the OOM killer was triggered by higher order request with compaction disabled. This will help us identifying possible misconfiguration right from the oom report which is easier than to always keep in mind that somebody might have disabled COMPACTION without a good reason. Link: http://lkml.kernel.org/r/20160830111632.GD23963@dhcp22.suse.cz Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9254990fb9f0f15f25605748da20cfbeced7c816;"Since the lumpy reclaim is gone there is no source of higher order pages if CONFIG_COMPACTION=n except for the order-0 pages reclaim which is unreliable for that purpose to say the least";no;yes;yes 108;MDY6Q29tbWl0MjMyNTI5ODo5MjU0OTkwZmI5ZjBmMTVmMjU2MDU3NDhkYTIwY2ZiZWNlZDdjODE2;Michal Hocko;Linus Torvalds;"oom: warn if we go OOM for higher order and compaction is disabled Since the lumpy reclaim is gone there is no source of higher order pages if CONFIG_COMPACTION=n except for the order-0 pages reclaim which is unreliable for that purpose to say the least. Hitting an OOM for !costly higher order requests is therefore all not that hard to imagine. We are trying hard to not invoke OOM killer as much as possible but there is simply no reliable way to detect whether more reclaim retries make sense. Disabling COMPACTION is not widespread but it seems that some users might have disable the feature without realizing full consequences (mostly along with disabling THP because compaction used to be THP mainly thing). This patch just adds a note if the OOM killer was triggered by higher order request with compaction disabled. This will help us identifying possible misconfiguration right from the oom report which is easier than to always keep in mind that somebody might have disabled COMPACTION without a good reason. Link: http://lkml.kernel.org/r/20160830111632.GD23963@dhcp22.suse.cz Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9254990fb9f0f15f25605748da20cfbeced7c816;" Hitting an OOM for !costly higher order requests is therefore all not that hard to imagine";no;yes;yes 108;MDY6Q29tbWl0MjMyNTI5ODo5MjU0OTkwZmI5ZjBmMTVmMjU2MDU3NDhkYTIwY2ZiZWNlZDdjODE2;Michal Hocko;Linus Torvalds;"oom: warn if we go OOM for higher order and compaction is disabled Since the lumpy reclaim is gone there is no source of higher order pages if CONFIG_COMPACTION=n except for the order-0 pages reclaim which is unreliable for that purpose to say the least. Hitting an OOM for !costly higher order requests is therefore all not that hard to imagine. We are trying hard to not invoke OOM killer as much as possible but there is simply no reliable way to detect whether more reclaim retries make sense. Disabling COMPACTION is not widespread but it seems that some users might have disable the feature without realizing full consequences (mostly along with disabling THP because compaction used to be THP mainly thing). This patch just adds a note if the OOM killer was triggered by higher order request with compaction disabled. This will help us identifying possible misconfiguration right from the oom report which is easier than to always keep in mind that somebody might have disabled COMPACTION without a good reason. Link: http://lkml.kernel.org/r/20160830111632.GD23963@dhcp22.suse.cz Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9254990fb9f0f15f25605748da20cfbeced7c816;"We are trying hard to not invoke OOM killer as much as possible but there is simply no reliable way to detect whether more reclaim retries make sense";yes;yes;yes 108;MDY6Q29tbWl0MjMyNTI5ODo5MjU0OTkwZmI5ZjBmMTVmMjU2MDU3NDhkYTIwY2ZiZWNlZDdjODE2;Michal Hocko;Linus Torvalds;"oom: warn if we go OOM for higher order and compaction is disabled Since the lumpy reclaim is gone there is no source of higher order pages if CONFIG_COMPACTION=n except for the order-0 pages reclaim which is unreliable for that purpose to say the least. Hitting an OOM for !costly higher order requests is therefore all not that hard to imagine. We are trying hard to not invoke OOM killer as much as possible but there is simply no reliable way to detect whether more reclaim retries make sense. Disabling COMPACTION is not widespread but it seems that some users might have disable the feature without realizing full consequences (mostly along with disabling THP because compaction used to be THP mainly thing). This patch just adds a note if the OOM killer was triggered by higher order request with compaction disabled. This will help us identifying possible misconfiguration right from the oom report which is easier than to always keep in mind that somebody might have disabled COMPACTION without a good reason. Link: http://lkml.kernel.org/r/20160830111632.GD23963@dhcp22.suse.cz Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9254990fb9f0f15f25605748da20cfbeced7c816;"Disabling COMPACTION is not widespread but it seems that some users might have disable the feature without realizing full consequences (mostly along with disabling THP because compaction used to be THP mainly thing)";no;no;yes 108;MDY6Q29tbWl0MjMyNTI5ODo5MjU0OTkwZmI5ZjBmMTVmMjU2MDU3NDhkYTIwY2ZiZWNlZDdjODE2;Michal Hocko;Linus Torvalds;"oom: warn if we go OOM for higher order and compaction is disabled Since the lumpy reclaim is gone there is no source of higher order pages if CONFIG_COMPACTION=n except for the order-0 pages reclaim which is unreliable for that purpose to say the least. Hitting an OOM for !costly higher order requests is therefore all not that hard to imagine. We are trying hard to not invoke OOM killer as much as possible but there is simply no reliable way to detect whether more reclaim retries make sense. Disabling COMPACTION is not widespread but it seems that some users might have disable the feature without realizing full consequences (mostly along with disabling THP because compaction used to be THP mainly thing). This patch just adds a note if the OOM killer was triggered by higher order request with compaction disabled. This will help us identifying possible misconfiguration right from the oom report which is easier than to always keep in mind that somebody might have disabled COMPACTION without a good reason. Link: http://lkml.kernel.org/r/20160830111632.GD23963@dhcp22.suse.cz Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9254990fb9f0f15f25605748da20cfbeced7c816;" This patch just adds a note if the OOM killer was triggered by higher order request with compaction disabled";yes;no;no 108;MDY6Q29tbWl0MjMyNTI5ODo5MjU0OTkwZmI5ZjBmMTVmMjU2MDU3NDhkYTIwY2ZiZWNlZDdjODE2;Michal Hocko;Linus Torvalds;"oom: warn if we go OOM for higher order and compaction is disabled Since the lumpy reclaim is gone there is no source of higher order pages if CONFIG_COMPACTION=n except for the order-0 pages reclaim which is unreliable for that purpose to say the least. Hitting an OOM for !costly higher order requests is therefore all not that hard to imagine. We are trying hard to not invoke OOM killer as much as possible but there is simply no reliable way to detect whether more reclaim retries make sense. Disabling COMPACTION is not widespread but it seems that some users might have disable the feature without realizing full consequences (mostly along with disabling THP because compaction used to be THP mainly thing). This patch just adds a note if the OOM killer was triggered by higher order request with compaction disabled. This will help us identifying possible misconfiguration right from the oom report which is easier than to always keep in mind that somebody might have disabled COMPACTION without a good reason. Link: http://lkml.kernel.org/r/20160830111632.GD23963@dhcp22.suse.cz Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9254990fb9f0f15f25605748da20cfbeced7c816;" This will help us identifying possible misconfiguration right from the oom report which is easier than to always keep in mind that somebody might have disabled COMPACTION without a good reason.";no;yes;no 109;MDY6Q29tbWl0MjMyNTI5ODoxYjUxZTY1ZWFiNjRmYWM3MmNhYjAwOTY5MWU4Y2E5OTE1NjI0ODc2;Michal Hocko;Linus Torvalds;"oom, oom_reaper: allow to reap mm shared by the kthreads oom reaper was skipped for an mm which is shared with the kernel thread (aka use_mm()). The primary concern was that such a kthread might want to read from the userspace memory and see zero page as a result of the oom reaper action. This is no longer a problem after ""mm: make sure that kthreads will not refault oom reaped memory"" because any attempt to fault in when the MMF_UNSTABLE is set will result in SIGBUS and so the target user should see an error. This means that we can finally allow oom reaper also to tasks which share their mm with kthreads. Link: http://lkml.kernel.org/r/1472119394-11342-10-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/1b51e65eab64fac72cab009691e8ca9915624876;oom, oom_reaper: allow to reap mm shared by the kthreads;yes;no;no 109;MDY6Q29tbWl0MjMyNTI5ODoxYjUxZTY1ZWFiNjRmYWM3MmNhYjAwOTY5MWU4Y2E5OTE1NjI0ODc2;Michal Hocko;Linus Torvalds;"oom, oom_reaper: allow to reap mm shared by the kthreads oom reaper was skipped for an mm which is shared with the kernel thread (aka use_mm()). The primary concern was that such a kthread might want to read from the userspace memory and see zero page as a result of the oom reaper action. This is no longer a problem after ""mm: make sure that kthreads will not refault oom reaped memory"" because any attempt to fault in when the MMF_UNSTABLE is set will result in SIGBUS and so the target user should see an error. This means that we can finally allow oom reaper also to tasks which share their mm with kthreads. Link: http://lkml.kernel.org/r/1472119394-11342-10-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/1b51e65eab64fac72cab009691e8ca9915624876;"oom reaper was skipped for an mm which is shared with the kernel thread (aka use_mm())";no;no;yes 109;MDY6Q29tbWl0MjMyNTI5ODoxYjUxZTY1ZWFiNjRmYWM3MmNhYjAwOTY5MWU4Y2E5OTE1NjI0ODc2;Michal Hocko;Linus Torvalds;"oom, oom_reaper: allow to reap mm shared by the kthreads oom reaper was skipped for an mm which is shared with the kernel thread (aka use_mm()). The primary concern was that such a kthread might want to read from the userspace memory and see zero page as a result of the oom reaper action. This is no longer a problem after ""mm: make sure that kthreads will not refault oom reaped memory"" because any attempt to fault in when the MMF_UNSTABLE is set will result in SIGBUS and so the target user should see an error. This means that we can finally allow oom reaper also to tasks which share their mm with kthreads. Link: http://lkml.kernel.org/r/1472119394-11342-10-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/1b51e65eab64fac72cab009691e8ca9915624876;" The primary concern was that such a kthread might want to read from the userspace memory and see zero page as a result of the oom reaper action";no;yes;yes 109;MDY6Q29tbWl0MjMyNTI5ODoxYjUxZTY1ZWFiNjRmYWM3MmNhYjAwOTY5MWU4Y2E5OTE1NjI0ODc2;Michal Hocko;Linus Torvalds;"oom, oom_reaper: allow to reap mm shared by the kthreads oom reaper was skipped for an mm which is shared with the kernel thread (aka use_mm()). The primary concern was that such a kthread might want to read from the userspace memory and see zero page as a result of the oom reaper action. This is no longer a problem after ""mm: make sure that kthreads will not refault oom reaped memory"" because any attempt to fault in when the MMF_UNSTABLE is set will result in SIGBUS and so the target user should see an error. This means that we can finally allow oom reaper also to tasks which share their mm with kthreads. Link: http://lkml.kernel.org/r/1472119394-11342-10-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/1b51e65eab64fac72cab009691e8ca9915624876;" This is no longer a problem after ""mm: make sure that kthreads will not refault oom reaped memory"" because any attempt to fault in when the MMF_UNSTABLE is set will result in SIGBUS and so the target user should see an error";no;yes;yes 109;MDY6Q29tbWl0MjMyNTI5ODoxYjUxZTY1ZWFiNjRmYWM3MmNhYjAwOTY5MWU4Y2E5OTE1NjI0ODc2;Michal Hocko;Linus Torvalds;"oom, oom_reaper: allow to reap mm shared by the kthreads oom reaper was skipped for an mm which is shared with the kernel thread (aka use_mm()). The primary concern was that such a kthread might want to read from the userspace memory and see zero page as a result of the oom reaper action. This is no longer a problem after ""mm: make sure that kthreads will not refault oom reaped memory"" because any attempt to fault in when the MMF_UNSTABLE is set will result in SIGBUS and so the target user should see an error. This means that we can finally allow oom reaper also to tasks which share their mm with kthreads. Link: http://lkml.kernel.org/r/1472119394-11342-10-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/1b51e65eab64fac72cab009691e8ca9915624876;" This means that we can finally allow oom reaper also to tasks which share their mm with kthreads.";yes;yes;no 110;MDY6Q29tbWl0MjMyNTI5ODozZjcwZGMzOGNlYzJhZDZlNTM1NWY4MGM0YzdhMTVhM2Y3ZTk3YTE5;Michal Hocko;Linus Torvalds;"mm: make sure that kthreads will not refault oom reaped memory There are only few use_mm() users in the kernel right now. Most of them write to the target memory but vhost driver relies on copy_from_user/get_user from a kernel thread context. This makes it impossible to reap the memory of an oom victim which shares the mm with the vhost kernel thread because it could see a zero page unexpectedly and theoretically make an incorrect decision visible outside of the killed task context. To quote Michael S. Tsirkin: : Getting an error from __get_user and friends is handled gracefully. : Getting zero instead of a real value will cause userspace : memory corruption. The vhost kernel thread is bound to an open fd of the vhost device which is not tight to the mm owner life cycle in general. The device fd can be inherited or passed over to another process which means that we really have to be careful about unexpected memory corruption because unlike for normal oom victims the result will be visible outside of the oom victim context. Make sure that no kthread context (users of use_mm) can ever see corrupted data because of the oom reaper and hook into the page fault path by checking MMF_UNSTABLE mm flag. __oom_reap_task_mm will set the flag before it starts unmapping the address space while the flag is checked after the page fault has been handled. If the flag is set then SIGBUS is triggered so any g-u-p user will get a error code. Regular tasks do not need this protection because all which share the mm are killed when the mm is reaped and so the corruption will not outlive them. This patch shouldn't have any visible effect at this moment because the OOM killer doesn't invoke oom reaper for tasks with mm shared with kthreads yet. Link: http://lkml.kernel.org/r/1472119394-11342-9-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: ""Michael S. Tsirkin"" <mst@redhat.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3f70dc38cec2ad6e5355f80c4c7a15a3f7e97a19;mm: make sure that kthreads will not refault oom reaped memory;yes;yes;no 110;MDY6Q29tbWl0MjMyNTI5ODozZjcwZGMzOGNlYzJhZDZlNTM1NWY4MGM0YzdhMTVhM2Y3ZTk3YTE5;Michal Hocko;Linus Torvalds;"mm: make sure that kthreads will not refault oom reaped memory There are only few use_mm() users in the kernel right now. Most of them write to the target memory but vhost driver relies on copy_from_user/get_user from a kernel thread context. This makes it impossible to reap the memory of an oom victim which shares the mm with the vhost kernel thread because it could see a zero page unexpectedly and theoretically make an incorrect decision visible outside of the killed task context. To quote Michael S. Tsirkin: : Getting an error from __get_user and friends is handled gracefully. : Getting zero instead of a real value will cause userspace : memory corruption. The vhost kernel thread is bound to an open fd of the vhost device which is not tight to the mm owner life cycle in general. The device fd can be inherited or passed over to another process which means that we really have to be careful about unexpected memory corruption because unlike for normal oom victims the result will be visible outside of the oom victim context. Make sure that no kthread context (users of use_mm) can ever see corrupted data because of the oom reaper and hook into the page fault path by checking MMF_UNSTABLE mm flag. __oom_reap_task_mm will set the flag before it starts unmapping the address space while the flag is checked after the page fault has been handled. If the flag is set then SIGBUS is triggered so any g-u-p user will get a error code. Regular tasks do not need this protection because all which share the mm are killed when the mm is reaped and so the corruption will not outlive them. This patch shouldn't have any visible effect at this moment because the OOM killer doesn't invoke oom reaper for tasks with mm shared with kthreads yet. Link: http://lkml.kernel.org/r/1472119394-11342-9-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: ""Michael S. Tsirkin"" <mst@redhat.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3f70dc38cec2ad6e5355f80c4c7a15a3f7e97a19;There are only few use_mm() users in the kernel right now;no;no;yes 110;MDY6Q29tbWl0MjMyNTI5ODozZjcwZGMzOGNlYzJhZDZlNTM1NWY4MGM0YzdhMTVhM2Y3ZTk3YTE5;Michal Hocko;Linus Torvalds;"mm: make sure that kthreads will not refault oom reaped memory There are only few use_mm() users in the kernel right now. Most of them write to the target memory but vhost driver relies on copy_from_user/get_user from a kernel thread context. This makes it impossible to reap the memory of an oom victim which shares the mm with the vhost kernel thread because it could see a zero page unexpectedly and theoretically make an incorrect decision visible outside of the killed task context. To quote Michael S. Tsirkin: : Getting an error from __get_user and friends is handled gracefully. : Getting zero instead of a real value will cause userspace : memory corruption. The vhost kernel thread is bound to an open fd of the vhost device which is not tight to the mm owner life cycle in general. The device fd can be inherited or passed over to another process which means that we really have to be careful about unexpected memory corruption because unlike for normal oom victims the result will be visible outside of the oom victim context. Make sure that no kthread context (users of use_mm) can ever see corrupted data because of the oom reaper and hook into the page fault path by checking MMF_UNSTABLE mm flag. __oom_reap_task_mm will set the flag before it starts unmapping the address space while the flag is checked after the page fault has been handled. If the flag is set then SIGBUS is triggered so any g-u-p user will get a error code. Regular tasks do not need this protection because all which share the mm are killed when the mm is reaped and so the corruption will not outlive them. This patch shouldn't have any visible effect at this moment because the OOM killer doesn't invoke oom reaper for tasks with mm shared with kthreads yet. Link: http://lkml.kernel.org/r/1472119394-11342-9-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: ""Michael S. Tsirkin"" <mst@redhat.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3f70dc38cec2ad6e5355f80c4c7a15a3f7e97a19;" Most of them write to the target memory but vhost driver relies on copy_from_user/get_user from a kernel thread context";no;no;yes 110;MDY6Q29tbWl0MjMyNTI5ODozZjcwZGMzOGNlYzJhZDZlNTM1NWY4MGM0YzdhMTVhM2Y3ZTk3YTE5;Michal Hocko;Linus Torvalds;"mm: make sure that kthreads will not refault oom reaped memory There are only few use_mm() users in the kernel right now. Most of them write to the target memory but vhost driver relies on copy_from_user/get_user from a kernel thread context. This makes it impossible to reap the memory of an oom victim which shares the mm with the vhost kernel thread because it could see a zero page unexpectedly and theoretically make an incorrect decision visible outside of the killed task context. To quote Michael S. Tsirkin: : Getting an error from __get_user and friends is handled gracefully. : Getting zero instead of a real value will cause userspace : memory corruption. The vhost kernel thread is bound to an open fd of the vhost device which is not tight to the mm owner life cycle in general. The device fd can be inherited or passed over to another process which means that we really have to be careful about unexpected memory corruption because unlike for normal oom victims the result will be visible outside of the oom victim context. Make sure that no kthread context (users of use_mm) can ever see corrupted data because of the oom reaper and hook into the page fault path by checking MMF_UNSTABLE mm flag. __oom_reap_task_mm will set the flag before it starts unmapping the address space while the flag is checked after the page fault has been handled. If the flag is set then SIGBUS is triggered so any g-u-p user will get a error code. Regular tasks do not need this protection because all which share the mm are killed when the mm is reaped and so the corruption will not outlive them. This patch shouldn't have any visible effect at this moment because the OOM killer doesn't invoke oom reaper for tasks with mm shared with kthreads yet. Link: http://lkml.kernel.org/r/1472119394-11342-9-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: ""Michael S. Tsirkin"" <mst@redhat.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3f70dc38cec2ad6e5355f80c4c7a15a3f7e97a19;" This makes it impossible to reap the memory of an oom victim which shares the mm with the vhost kernel thread because it could see a zero page unexpectedly and theoretically make an incorrect decision visible outside of the killed task context";no;yes;yes 110;MDY6Q29tbWl0MjMyNTI5ODozZjcwZGMzOGNlYzJhZDZlNTM1NWY4MGM0YzdhMTVhM2Y3ZTk3YTE5;Michal Hocko;Linus Torvalds;"mm: make sure that kthreads will not refault oom reaped memory There are only few use_mm() users in the kernel right now. Most of them write to the target memory but vhost driver relies on copy_from_user/get_user from a kernel thread context. This makes it impossible to reap the memory of an oom victim which shares the mm with the vhost kernel thread because it could see a zero page unexpectedly and theoretically make an incorrect decision visible outside of the killed task context. To quote Michael S. Tsirkin: : Getting an error from __get_user and friends is handled gracefully. : Getting zero instead of a real value will cause userspace : memory corruption. The vhost kernel thread is bound to an open fd of the vhost device which is not tight to the mm owner life cycle in general. The device fd can be inherited or passed over to another process which means that we really have to be careful about unexpected memory corruption because unlike for normal oom victims the result will be visible outside of the oom victim context. Make sure that no kthread context (users of use_mm) can ever see corrupted data because of the oom reaper and hook into the page fault path by checking MMF_UNSTABLE mm flag. __oom_reap_task_mm will set the flag before it starts unmapping the address space while the flag is checked after the page fault has been handled. If the flag is set then SIGBUS is triggered so any g-u-p user will get a error code. Regular tasks do not need this protection because all which share the mm are killed when the mm is reaped and so the corruption will not outlive them. This patch shouldn't have any visible effect at this moment because the OOM killer doesn't invoke oom reaper for tasks with mm shared with kthreads yet. Link: http://lkml.kernel.org/r/1472119394-11342-9-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: ""Michael S. Tsirkin"" <mst@redhat.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3f70dc38cec2ad6e5355f80c4c7a15a3f7e97a19;To quote Michael S;no;no;yes 110;MDY6Q29tbWl0MjMyNTI5ODozZjcwZGMzOGNlYzJhZDZlNTM1NWY4MGM0YzdhMTVhM2Y3ZTk3YTE5;Michal Hocko;Linus Torvalds;"mm: make sure that kthreads will not refault oom reaped memory There are only few use_mm() users in the kernel right now. Most of them write to the target memory but vhost driver relies on copy_from_user/get_user from a kernel thread context. This makes it impossible to reap the memory of an oom victim which shares the mm with the vhost kernel thread because it could see a zero page unexpectedly and theoretically make an incorrect decision visible outside of the killed task context. To quote Michael S. Tsirkin: : Getting an error from __get_user and friends is handled gracefully. : Getting zero instead of a real value will cause userspace : memory corruption. The vhost kernel thread is bound to an open fd of the vhost device which is not tight to the mm owner life cycle in general. The device fd can be inherited or passed over to another process which means that we really have to be careful about unexpected memory corruption because unlike for normal oom victims the result will be visible outside of the oom victim context. Make sure that no kthread context (users of use_mm) can ever see corrupted data because of the oom reaper and hook into the page fault path by checking MMF_UNSTABLE mm flag. __oom_reap_task_mm will set the flag before it starts unmapping the address space while the flag is checked after the page fault has been handled. If the flag is set then SIGBUS is triggered so any g-u-p user will get a error code. Regular tasks do not need this protection because all which share the mm are killed when the mm is reaped and so the corruption will not outlive them. This patch shouldn't have any visible effect at this moment because the OOM killer doesn't invoke oom reaper for tasks with mm shared with kthreads yet. Link: http://lkml.kernel.org/r/1472119394-11342-9-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: ""Michael S. Tsirkin"" <mst@redhat.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3f70dc38cec2ad6e5355f80c4c7a15a3f7e97a19;Tsirkin;no;no;yes 110;MDY6Q29tbWl0MjMyNTI5ODozZjcwZGMzOGNlYzJhZDZlNTM1NWY4MGM0YzdhMTVhM2Y3ZTk3YTE5;Michal Hocko;Linus Torvalds;"mm: make sure that kthreads will not refault oom reaped memory There are only few use_mm() users in the kernel right now. Most of them write to the target memory but vhost driver relies on copy_from_user/get_user from a kernel thread context. This makes it impossible to reap the memory of an oom victim which shares the mm with the vhost kernel thread because it could see a zero page unexpectedly and theoretically make an incorrect decision visible outside of the killed task context. To quote Michael S. Tsirkin: : Getting an error from __get_user and friends is handled gracefully. : Getting zero instead of a real value will cause userspace : memory corruption. The vhost kernel thread is bound to an open fd of the vhost device which is not tight to the mm owner life cycle in general. The device fd can be inherited or passed over to another process which means that we really have to be careful about unexpected memory corruption because unlike for normal oom victims the result will be visible outside of the oom victim context. Make sure that no kthread context (users of use_mm) can ever see corrupted data because of the oom reaper and hook into the page fault path by checking MMF_UNSTABLE mm flag. __oom_reap_task_mm will set the flag before it starts unmapping the address space while the flag is checked after the page fault has been handled. If the flag is set then SIGBUS is triggered so any g-u-p user will get a error code. Regular tasks do not need this protection because all which share the mm are killed when the mm is reaped and so the corruption will not outlive them. This patch shouldn't have any visible effect at this moment because the OOM killer doesn't invoke oom reaper for tasks with mm shared with kthreads yet. Link: http://lkml.kernel.org/r/1472119394-11342-9-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: ""Michael S. Tsirkin"" <mst@redhat.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3f70dc38cec2ad6e5355f80c4c7a15a3f7e97a19;: Getting an error from __get_user and friends is handled gracefully;no;no;yes 110;MDY6Q29tbWl0MjMyNTI5ODozZjcwZGMzOGNlYzJhZDZlNTM1NWY4MGM0YzdhMTVhM2Y3ZTk3YTE5;Michal Hocko;Linus Torvalds;"mm: make sure that kthreads will not refault oom reaped memory There are only few use_mm() users in the kernel right now. Most of them write to the target memory but vhost driver relies on copy_from_user/get_user from a kernel thread context. This makes it impossible to reap the memory of an oom victim which shares the mm with the vhost kernel thread because it could see a zero page unexpectedly and theoretically make an incorrect decision visible outside of the killed task context. To quote Michael S. Tsirkin: : Getting an error from __get_user and friends is handled gracefully. : Getting zero instead of a real value will cause userspace : memory corruption. The vhost kernel thread is bound to an open fd of the vhost device which is not tight to the mm owner life cycle in general. The device fd can be inherited or passed over to another process which means that we really have to be careful about unexpected memory corruption because unlike for normal oom victims the result will be visible outside of the oom victim context. Make sure that no kthread context (users of use_mm) can ever see corrupted data because of the oom reaper and hook into the page fault path by checking MMF_UNSTABLE mm flag. __oom_reap_task_mm will set the flag before it starts unmapping the address space while the flag is checked after the page fault has been handled. If the flag is set then SIGBUS is triggered so any g-u-p user will get a error code. Regular tasks do not need this protection because all which share the mm are killed when the mm is reaped and so the corruption will not outlive them. This patch shouldn't have any visible effect at this moment because the OOM killer doesn't invoke oom reaper for tasks with mm shared with kthreads yet. Link: http://lkml.kernel.org/r/1472119394-11342-9-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: ""Michael S. Tsirkin"" <mst@redhat.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3f70dc38cec2ad6e5355f80c4c7a15a3f7e97a19;": Getting zero instead of a real value will cause userspace : memory corruption";no;yes;yes 110;MDY6Q29tbWl0MjMyNTI5ODozZjcwZGMzOGNlYzJhZDZlNTM1NWY4MGM0YzdhMTVhM2Y3ZTk3YTE5;Michal Hocko;Linus Torvalds;"mm: make sure that kthreads will not refault oom reaped memory There are only few use_mm() users in the kernel right now. Most of them write to the target memory but vhost driver relies on copy_from_user/get_user from a kernel thread context. This makes it impossible to reap the memory of an oom victim which shares the mm with the vhost kernel thread because it could see a zero page unexpectedly and theoretically make an incorrect decision visible outside of the killed task context. To quote Michael S. Tsirkin: : Getting an error from __get_user and friends is handled gracefully. : Getting zero instead of a real value will cause userspace : memory corruption. The vhost kernel thread is bound to an open fd of the vhost device which is not tight to the mm owner life cycle in general. The device fd can be inherited or passed over to another process which means that we really have to be careful about unexpected memory corruption because unlike for normal oom victims the result will be visible outside of the oom victim context. Make sure that no kthread context (users of use_mm) can ever see corrupted data because of the oom reaper and hook into the page fault path by checking MMF_UNSTABLE mm flag. __oom_reap_task_mm will set the flag before it starts unmapping the address space while the flag is checked after the page fault has been handled. If the flag is set then SIGBUS is triggered so any g-u-p user will get a error code. Regular tasks do not need this protection because all which share the mm are killed when the mm is reaped and so the corruption will not outlive them. This patch shouldn't have any visible effect at this moment because the OOM killer doesn't invoke oom reaper for tasks with mm shared with kthreads yet. Link: http://lkml.kernel.org/r/1472119394-11342-9-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: ""Michael S. Tsirkin"" <mst@redhat.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3f70dc38cec2ad6e5355f80c4c7a15a3f7e97a19;"The vhost kernel thread is bound to an open fd of the vhost device which is not tight to the mm owner life cycle in general";no;no;yes 110;MDY6Q29tbWl0MjMyNTI5ODozZjcwZGMzOGNlYzJhZDZlNTM1NWY4MGM0YzdhMTVhM2Y3ZTk3YTE5;Michal Hocko;Linus Torvalds;"mm: make sure that kthreads will not refault oom reaped memory There are only few use_mm() users in the kernel right now. Most of them write to the target memory but vhost driver relies on copy_from_user/get_user from a kernel thread context. This makes it impossible to reap the memory of an oom victim which shares the mm with the vhost kernel thread because it could see a zero page unexpectedly and theoretically make an incorrect decision visible outside of the killed task context. To quote Michael S. Tsirkin: : Getting an error from __get_user and friends is handled gracefully. : Getting zero instead of a real value will cause userspace : memory corruption. The vhost kernel thread is bound to an open fd of the vhost device which is not tight to the mm owner life cycle in general. The device fd can be inherited or passed over to another process which means that we really have to be careful about unexpected memory corruption because unlike for normal oom victims the result will be visible outside of the oom victim context. Make sure that no kthread context (users of use_mm) can ever see corrupted data because of the oom reaper and hook into the page fault path by checking MMF_UNSTABLE mm flag. __oom_reap_task_mm will set the flag before it starts unmapping the address space while the flag is checked after the page fault has been handled. If the flag is set then SIGBUS is triggered so any g-u-p user will get a error code. Regular tasks do not need this protection because all which share the mm are killed when the mm is reaped and so the corruption will not outlive them. This patch shouldn't have any visible effect at this moment because the OOM killer doesn't invoke oom reaper for tasks with mm shared with kthreads yet. Link: http://lkml.kernel.org/r/1472119394-11342-9-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: ""Michael S. Tsirkin"" <mst@redhat.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3f70dc38cec2ad6e5355f80c4c7a15a3f7e97a19;" The device fd can be inherited or passed over to another process which means that we really have to be careful about unexpected memory corruption because unlike for normal oom victims the result will be visible outside of the oom victim context";no;yes;yes 110;MDY6Q29tbWl0MjMyNTI5ODozZjcwZGMzOGNlYzJhZDZlNTM1NWY4MGM0YzdhMTVhM2Y3ZTk3YTE5;Michal Hocko;Linus Torvalds;"mm: make sure that kthreads will not refault oom reaped memory There are only few use_mm() users in the kernel right now. Most of them write to the target memory but vhost driver relies on copy_from_user/get_user from a kernel thread context. This makes it impossible to reap the memory of an oom victim which shares the mm with the vhost kernel thread because it could see a zero page unexpectedly and theoretically make an incorrect decision visible outside of the killed task context. To quote Michael S. Tsirkin: : Getting an error from __get_user and friends is handled gracefully. : Getting zero instead of a real value will cause userspace : memory corruption. The vhost kernel thread is bound to an open fd of the vhost device which is not tight to the mm owner life cycle in general. The device fd can be inherited or passed over to another process which means that we really have to be careful about unexpected memory corruption because unlike for normal oom victims the result will be visible outside of the oom victim context. Make sure that no kthread context (users of use_mm) can ever see corrupted data because of the oom reaper and hook into the page fault path by checking MMF_UNSTABLE mm flag. __oom_reap_task_mm will set the flag before it starts unmapping the address space while the flag is checked after the page fault has been handled. If the flag is set then SIGBUS is triggered so any g-u-p user will get a error code. Regular tasks do not need this protection because all which share the mm are killed when the mm is reaped and so the corruption will not outlive them. This patch shouldn't have any visible effect at this moment because the OOM killer doesn't invoke oom reaper for tasks with mm shared with kthreads yet. Link: http://lkml.kernel.org/r/1472119394-11342-9-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: ""Michael S. Tsirkin"" <mst@redhat.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3f70dc38cec2ad6e5355f80c4c7a15a3f7e97a19;"Make sure that no kthread context (users of use_mm) can ever see corrupted data because of the oom reaper and hook into the page fault path by checking MMF_UNSTABLE mm flag";yes;yes;no 110;MDY6Q29tbWl0MjMyNTI5ODozZjcwZGMzOGNlYzJhZDZlNTM1NWY4MGM0YzdhMTVhM2Y3ZTk3YTE5;Michal Hocko;Linus Torvalds;"mm: make sure that kthreads will not refault oom reaped memory There are only few use_mm() users in the kernel right now. Most of them write to the target memory but vhost driver relies on copy_from_user/get_user from a kernel thread context. This makes it impossible to reap the memory of an oom victim which shares the mm with the vhost kernel thread because it could see a zero page unexpectedly and theoretically make an incorrect decision visible outside of the killed task context. To quote Michael S. Tsirkin: : Getting an error from __get_user and friends is handled gracefully. : Getting zero instead of a real value will cause userspace : memory corruption. The vhost kernel thread is bound to an open fd of the vhost device which is not tight to the mm owner life cycle in general. The device fd can be inherited or passed over to another process which means that we really have to be careful about unexpected memory corruption because unlike for normal oom victims the result will be visible outside of the oom victim context. Make sure that no kthread context (users of use_mm) can ever see corrupted data because of the oom reaper and hook into the page fault path by checking MMF_UNSTABLE mm flag. __oom_reap_task_mm will set the flag before it starts unmapping the address space while the flag is checked after the page fault has been handled. If the flag is set then SIGBUS is triggered so any g-u-p user will get a error code. Regular tasks do not need this protection because all which share the mm are killed when the mm is reaped and so the corruption will not outlive them. This patch shouldn't have any visible effect at this moment because the OOM killer doesn't invoke oom reaper for tasks with mm shared with kthreads yet. Link: http://lkml.kernel.org/r/1472119394-11342-9-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: ""Michael S. Tsirkin"" <mst@redhat.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3f70dc38cec2ad6e5355f80c4c7a15a3f7e97a19;" __oom_reap_task_mm will set the flag before it starts unmapping the address space while the flag is checked after the page fault has been handled";yes;no;yes 110;MDY6Q29tbWl0MjMyNTI5ODozZjcwZGMzOGNlYzJhZDZlNTM1NWY4MGM0YzdhMTVhM2Y3ZTk3YTE5;Michal Hocko;Linus Torvalds;"mm: make sure that kthreads will not refault oom reaped memory There are only few use_mm() users in the kernel right now. Most of them write to the target memory but vhost driver relies on copy_from_user/get_user from a kernel thread context. This makes it impossible to reap the memory of an oom victim which shares the mm with the vhost kernel thread because it could see a zero page unexpectedly and theoretically make an incorrect decision visible outside of the killed task context. To quote Michael S. Tsirkin: : Getting an error from __get_user and friends is handled gracefully. : Getting zero instead of a real value will cause userspace : memory corruption. The vhost kernel thread is bound to an open fd of the vhost device which is not tight to the mm owner life cycle in general. The device fd can be inherited or passed over to another process which means that we really have to be careful about unexpected memory corruption because unlike for normal oom victims the result will be visible outside of the oom victim context. Make sure that no kthread context (users of use_mm) can ever see corrupted data because of the oom reaper and hook into the page fault path by checking MMF_UNSTABLE mm flag. __oom_reap_task_mm will set the flag before it starts unmapping the address space while the flag is checked after the page fault has been handled. If the flag is set then SIGBUS is triggered so any g-u-p user will get a error code. Regular tasks do not need this protection because all which share the mm are killed when the mm is reaped and so the corruption will not outlive them. This patch shouldn't have any visible effect at this moment because the OOM killer doesn't invoke oom reaper for tasks with mm shared with kthreads yet. Link: http://lkml.kernel.org/r/1472119394-11342-9-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: ""Michael S. Tsirkin"" <mst@redhat.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3f70dc38cec2ad6e5355f80c4c7a15a3f7e97a19;" If the flag is set then SIGBUS is triggered so any g-u-p user will get a error code";yes;yes;no 110;MDY6Q29tbWl0MjMyNTI5ODozZjcwZGMzOGNlYzJhZDZlNTM1NWY4MGM0YzdhMTVhM2Y3ZTk3YTE5;Michal Hocko;Linus Torvalds;"mm: make sure that kthreads will not refault oom reaped memory There are only few use_mm() users in the kernel right now. Most of them write to the target memory but vhost driver relies on copy_from_user/get_user from a kernel thread context. This makes it impossible to reap the memory of an oom victim which shares the mm with the vhost kernel thread because it could see a zero page unexpectedly and theoretically make an incorrect decision visible outside of the killed task context. To quote Michael S. Tsirkin: : Getting an error from __get_user and friends is handled gracefully. : Getting zero instead of a real value will cause userspace : memory corruption. The vhost kernel thread is bound to an open fd of the vhost device which is not tight to the mm owner life cycle in general. The device fd can be inherited or passed over to another process which means that we really have to be careful about unexpected memory corruption because unlike for normal oom victims the result will be visible outside of the oom victim context. Make sure that no kthread context (users of use_mm) can ever see corrupted data because of the oom reaper and hook into the page fault path by checking MMF_UNSTABLE mm flag. __oom_reap_task_mm will set the flag before it starts unmapping the address space while the flag is checked after the page fault has been handled. If the flag is set then SIGBUS is triggered so any g-u-p user will get a error code. Regular tasks do not need this protection because all which share the mm are killed when the mm is reaped and so the corruption will not outlive them. This patch shouldn't have any visible effect at this moment because the OOM killer doesn't invoke oom reaper for tasks with mm shared with kthreads yet. Link: http://lkml.kernel.org/r/1472119394-11342-9-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: ""Michael S. Tsirkin"" <mst@redhat.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3f70dc38cec2ad6e5355f80c4c7a15a3f7e97a19;"Regular tasks do not need this protection because all which share the mm are killed when the mm is reaped and so the corruption will not outlive them";yes;yes;yes 110;MDY6Q29tbWl0MjMyNTI5ODozZjcwZGMzOGNlYzJhZDZlNTM1NWY4MGM0YzdhMTVhM2Y3ZTk3YTE5;Michal Hocko;Linus Torvalds;"mm: make sure that kthreads will not refault oom reaped memory There are only few use_mm() users in the kernel right now. Most of them write to the target memory but vhost driver relies on copy_from_user/get_user from a kernel thread context. This makes it impossible to reap the memory of an oom victim which shares the mm with the vhost kernel thread because it could see a zero page unexpectedly and theoretically make an incorrect decision visible outside of the killed task context. To quote Michael S. Tsirkin: : Getting an error from __get_user and friends is handled gracefully. : Getting zero instead of a real value will cause userspace : memory corruption. The vhost kernel thread is bound to an open fd of the vhost device which is not tight to the mm owner life cycle in general. The device fd can be inherited or passed over to another process which means that we really have to be careful about unexpected memory corruption because unlike for normal oom victims the result will be visible outside of the oom victim context. Make sure that no kthread context (users of use_mm) can ever see corrupted data because of the oom reaper and hook into the page fault path by checking MMF_UNSTABLE mm flag. __oom_reap_task_mm will set the flag before it starts unmapping the address space while the flag is checked after the page fault has been handled. If the flag is set then SIGBUS is triggered so any g-u-p user will get a error code. Regular tasks do not need this protection because all which share the mm are killed when the mm is reaped and so the corruption will not outlive them. This patch shouldn't have any visible effect at this moment because the OOM killer doesn't invoke oom reaper for tasks with mm shared with kthreads yet. Link: http://lkml.kernel.org/r/1472119394-11342-9-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: ""Michael S. Tsirkin"" <mst@redhat.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3f70dc38cec2ad6e5355f80c4c7a15a3f7e97a19;"This patch shouldn't have any visible effect at this moment because the OOM killer doesn't invoke oom reaper for tasks with mm shared with kthreads yet.";yes;no;yes 111;MDY6Q29tbWl0MjMyNTI5ODozODUzMTIwMWMxMjE0NGNkN2Q5NmFiZmRmZTc0NDljMmIwMTM3NWU4;Tetsuo Handa;Linus Torvalds;"mm, oom: enforce exit_oom_victim on current task There are no users of exit_oom_victim on !current task anymore so enforce the API to always work on the current. Link: http://lkml.kernel.org/r/1472119394-11342-8-git-send-email-mhocko@kernel.org Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/38531201c12144cd7d96abfdfe7449c2b01375e8;mm, oom: enforce exit_oom_victim on current task;yes;no;no 111;MDY6Q29tbWl0MjMyNTI5ODozODUzMTIwMWMxMjE0NGNkN2Q5NmFiZmRmZTc0NDljMmIwMTM3NWU4;Tetsuo Handa;Linus Torvalds;"mm, oom: enforce exit_oom_victim on current task There are no users of exit_oom_victim on !current task anymore so enforce the API to always work on the current. Link: http://lkml.kernel.org/r/1472119394-11342-8-git-send-email-mhocko@kernel.org Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/38531201c12144cd7d96abfdfe7449c2b01375e8;"There are no users of exit_oom_victim on !current task anymore so enforce the API to always work on the current.";yes;yes;yes 113;MDY6Q29tbWl0MjMyNTI5ODo4NjJlMzA3M2IzZWVkMTNmMTdiZDZiZTZjYTYwNTJkYjE1YzBiNzI4;Michal Hocko;Linus Torvalds;"mm, oom: get rid of signal_struct::oom_victims After ""oom: keep mm of the killed task available"" we can safely detect an oom victim by checking task->signal->oom_mm so we do not need the signal_struct counter anymore so let's get rid of it. This alone wouldn't be sufficient for nommu archs because exit_oom_victim doesn't hide the process from the oom killer anymore. We can, however, mark the mm with a MMF flag in __mmput. We can reuse MMF_OOM_REAPED and rename it to a more generic MMF_OOM_SKIP. Link: http://lkml.kernel.org/r/1472119394-11342-6-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/862e3073b3eed13f17bd6be6ca6052db15c0b728;mm, oom: get rid of signal_struct::oom_victims;yes;no;no 113;MDY6Q29tbWl0MjMyNTI5ODo4NjJlMzA3M2IzZWVkMTNmMTdiZDZiZTZjYTYwNTJkYjE1YzBiNzI4;Michal Hocko;Linus Torvalds;"mm, oom: get rid of signal_struct::oom_victims After ""oom: keep mm of the killed task available"" we can safely detect an oom victim by checking task->signal->oom_mm so we do not need the signal_struct counter anymore so let's get rid of it. This alone wouldn't be sufficient for nommu archs because exit_oom_victim doesn't hide the process from the oom killer anymore. We can, however, mark the mm with a MMF flag in __mmput. We can reuse MMF_OOM_REAPED and rename it to a more generic MMF_OOM_SKIP. Link: http://lkml.kernel.org/r/1472119394-11342-6-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/862e3073b3eed13f17bd6be6ca6052db15c0b728;"After ""oom: keep mm of the killed task available"" we can safely detect an oom victim by checking task->signal->oom_mm so we do not need the signal_struct counter anymore so let's get rid of it";yes;yes;yes 113;MDY6Q29tbWl0MjMyNTI5ODo4NjJlMzA3M2IzZWVkMTNmMTdiZDZiZTZjYTYwNTJkYjE1YzBiNzI4;Michal Hocko;Linus Torvalds;"mm, oom: get rid of signal_struct::oom_victims After ""oom: keep mm of the killed task available"" we can safely detect an oom victim by checking task->signal->oom_mm so we do not need the signal_struct counter anymore so let's get rid of it. This alone wouldn't be sufficient for nommu archs because exit_oom_victim doesn't hide the process from the oom killer anymore. We can, however, mark the mm with a MMF flag in __mmput. We can reuse MMF_OOM_REAPED and rename it to a more generic MMF_OOM_SKIP. Link: http://lkml.kernel.org/r/1472119394-11342-6-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/862e3073b3eed13f17bd6be6ca6052db15c0b728;"This alone wouldn't be sufficient for nommu archs because exit_oom_victim doesn't hide the process from the oom killer anymore";yes;yes;yes 113;MDY6Q29tbWl0MjMyNTI5ODo4NjJlMzA3M2IzZWVkMTNmMTdiZDZiZTZjYTYwNTJkYjE1YzBiNzI4;Michal Hocko;Linus Torvalds;"mm, oom: get rid of signal_struct::oom_victims After ""oom: keep mm of the killed task available"" we can safely detect an oom victim by checking task->signal->oom_mm so we do not need the signal_struct counter anymore so let's get rid of it. This alone wouldn't be sufficient for nommu archs because exit_oom_victim doesn't hide the process from the oom killer anymore. We can, however, mark the mm with a MMF flag in __mmput. We can reuse MMF_OOM_REAPED and rename it to a more generic MMF_OOM_SKIP. Link: http://lkml.kernel.org/r/1472119394-11342-6-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/862e3073b3eed13f17bd6be6ca6052db15c0b728;We can, however, mark the mm with a MMF flag in __mmput;yes;no;no 113;MDY6Q29tbWl0MjMyNTI5ODo4NjJlMzA3M2IzZWVkMTNmMTdiZDZiZTZjYTYwNTJkYjE1YzBiNzI4;Michal Hocko;Linus Torvalds;"mm, oom: get rid of signal_struct::oom_victims After ""oom: keep mm of the killed task available"" we can safely detect an oom victim by checking task->signal->oom_mm so we do not need the signal_struct counter anymore so let's get rid of it. This alone wouldn't be sufficient for nommu archs because exit_oom_victim doesn't hide the process from the oom killer anymore. We can, however, mark the mm with a MMF flag in __mmput. We can reuse MMF_OOM_REAPED and rename it to a more generic MMF_OOM_SKIP. Link: http://lkml.kernel.org/r/1472119394-11342-6-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/862e3073b3eed13f17bd6be6ca6052db15c0b728;" We can reuse MMF_OOM_REAPED and rename it to a more generic MMF_OOM_SKIP.";yes;yes;no 114;MDY6Q29tbWl0MjMyNTI5ODoyNmRiNjJmMTc5ZDExMmQzNDUwMzFlMTQ5MjZhNGNkYTljZDQwZDZl;Michal Hocko;Linus Torvalds;"oom: keep mm of the killed task available oom_reap_task has to call exit_oom_victim in order to make sure that the oom vicim will not block the oom killer for ever. This is, however, opening new problems (e.g oom_killer_disable exclusion - see commit 74070542099c (""oom, suspend: fix oom_reaper vs. oom_killer_disable race"")). exit_oom_victim should be only called from the victim's context ideally. One way to achieve this would be to rely on per mm_struct flags. We already have MMF_OOM_REAPED to hide a task from the oom killer since ""mm, oom: hide mm which is shared with kthread or global init"". The problem is that the exit path: do_exit exit_mm tsk->mm = NULL; mmput __mmput exit_oom_victim doesn't guarantee that exit_oom_victim will get called in a bounded amount of time. At least exit_aio depends on IO which might get blocked due to lack of memory and who knows what else is lurking there. This patch takes a different approach. We remember tsk->mm into the signal_struct and bind it to the signal struct life time for all oom victims. __oom_reap_task_mm as well as oom_scan_process_thread do not have to rely on find_lock_task_mm anymore and they will have a reliable reference to the mm struct. As a result all the oom specific communication inside the OOM killer can be done via tsk->signal->oom_mm. Increasing the signal_struct for something as unlikely as the oom killer is far from ideal but this approach will make the code much more reasonable and long term we even might want to move task->mm into the signal_struct anyway. In the next step we might want to make the oom killer exclusion and access to memory reserves completely independent which would be also nice. Link: http://lkml.kernel.org/r/1472119394-11342-4-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/26db62f179d112d345031e14926a4cda9cd40d6e;oom: keep mm of the killed task available;yes;no;no 114;MDY6Q29tbWl0MjMyNTI5ODoyNmRiNjJmMTc5ZDExMmQzNDUwMzFlMTQ5MjZhNGNkYTljZDQwZDZl;Michal Hocko;Linus Torvalds;"oom: keep mm of the killed task available oom_reap_task has to call exit_oom_victim in order to make sure that the oom vicim will not block the oom killer for ever. This is, however, opening new problems (e.g oom_killer_disable exclusion - see commit 74070542099c (""oom, suspend: fix oom_reaper vs. oom_killer_disable race"")). exit_oom_victim should be only called from the victim's context ideally. One way to achieve this would be to rely on per mm_struct flags. We already have MMF_OOM_REAPED to hide a task from the oom killer since ""mm, oom: hide mm which is shared with kthread or global init"". The problem is that the exit path: do_exit exit_mm tsk->mm = NULL; mmput __mmput exit_oom_victim doesn't guarantee that exit_oom_victim will get called in a bounded amount of time. At least exit_aio depends on IO which might get blocked due to lack of memory and who knows what else is lurking there. This patch takes a different approach. We remember tsk->mm into the signal_struct and bind it to the signal struct life time for all oom victims. __oom_reap_task_mm as well as oom_scan_process_thread do not have to rely on find_lock_task_mm anymore and they will have a reliable reference to the mm struct. As a result all the oom specific communication inside the OOM killer can be done via tsk->signal->oom_mm. Increasing the signal_struct for something as unlikely as the oom killer is far from ideal but this approach will make the code much more reasonable and long term we even might want to move task->mm into the signal_struct anyway. In the next step we might want to make the oom killer exclusion and access to memory reserves completely independent which would be also nice. Link: http://lkml.kernel.org/r/1472119394-11342-4-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/26db62f179d112d345031e14926a4cda9cd40d6e;"oom_reap_task has to call exit_oom_victim in order to make sure that the oom vicim will not block the oom killer for ever";no;no;yes 114;MDY6Q29tbWl0MjMyNTI5ODoyNmRiNjJmMTc5ZDExMmQzNDUwMzFlMTQ5MjZhNGNkYTljZDQwZDZl;Michal Hocko;Linus Torvalds;"oom: keep mm of the killed task available oom_reap_task has to call exit_oom_victim in order to make sure that the oom vicim will not block the oom killer for ever. This is, however, opening new problems (e.g oom_killer_disable exclusion - see commit 74070542099c (""oom, suspend: fix oom_reaper vs. oom_killer_disable race"")). exit_oom_victim should be only called from the victim's context ideally. One way to achieve this would be to rely on per mm_struct flags. We already have MMF_OOM_REAPED to hide a task from the oom killer since ""mm, oom: hide mm which is shared with kthread or global init"". The problem is that the exit path: do_exit exit_mm tsk->mm = NULL; mmput __mmput exit_oom_victim doesn't guarantee that exit_oom_victim will get called in a bounded amount of time. At least exit_aio depends on IO which might get blocked due to lack of memory and who knows what else is lurking there. This patch takes a different approach. We remember tsk->mm into the signal_struct and bind it to the signal struct life time for all oom victims. __oom_reap_task_mm as well as oom_scan_process_thread do not have to rely on find_lock_task_mm anymore and they will have a reliable reference to the mm struct. As a result all the oom specific communication inside the OOM killer can be done via tsk->signal->oom_mm. Increasing the signal_struct for something as unlikely as the oom killer is far from ideal but this approach will make the code much more reasonable and long term we even might want to move task->mm into the signal_struct anyway. In the next step we might want to make the oom killer exclusion and access to memory reserves completely independent which would be also nice. Link: http://lkml.kernel.org/r/1472119394-11342-4-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/26db62f179d112d345031e14926a4cda9cd40d6e;" This is, however, opening new problems (e.g oom_killer_disable exclusion - see commit 74070542099c (""oom, suspend: fix oom_reaper vs";no;yes;yes 114;MDY6Q29tbWl0MjMyNTI5ODoyNmRiNjJmMTc5ZDExMmQzNDUwMzFlMTQ5MjZhNGNkYTljZDQwZDZl;Michal Hocko;Linus Torvalds;"oom: keep mm of the killed task available oom_reap_task has to call exit_oom_victim in order to make sure that the oom vicim will not block the oom killer for ever. This is, however, opening new problems (e.g oom_killer_disable exclusion - see commit 74070542099c (""oom, suspend: fix oom_reaper vs. oom_killer_disable race"")). exit_oom_victim should be only called from the victim's context ideally. One way to achieve this would be to rely on per mm_struct flags. We already have MMF_OOM_REAPED to hide a task from the oom killer since ""mm, oom: hide mm which is shared with kthread or global init"". The problem is that the exit path: do_exit exit_mm tsk->mm = NULL; mmput __mmput exit_oom_victim doesn't guarantee that exit_oom_victim will get called in a bounded amount of time. At least exit_aio depends on IO which might get blocked due to lack of memory and who knows what else is lurking there. This patch takes a different approach. We remember tsk->mm into the signal_struct and bind it to the signal struct life time for all oom victims. __oom_reap_task_mm as well as oom_scan_process_thread do not have to rely on find_lock_task_mm anymore and they will have a reliable reference to the mm struct. As a result all the oom specific communication inside the OOM killer can be done via tsk->signal->oom_mm. Increasing the signal_struct for something as unlikely as the oom killer is far from ideal but this approach will make the code much more reasonable and long term we even might want to move task->mm into the signal_struct anyway. In the next step we might want to make the oom killer exclusion and access to memory reserves completely independent which would be also nice. Link: http://lkml.kernel.org/r/1472119394-11342-4-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/26db62f179d112d345031e14926a4cda9cd40d6e;" exit_oom_victim should be only called from the victim's context ideally";no;yes;yes 114;MDY6Q29tbWl0MjMyNTI5ODoyNmRiNjJmMTc5ZDExMmQzNDUwMzFlMTQ5MjZhNGNkYTljZDQwZDZl;Michal Hocko;Linus Torvalds;"oom: keep mm of the killed task available oom_reap_task has to call exit_oom_victim in order to make sure that the oom vicim will not block the oom killer for ever. This is, however, opening new problems (e.g oom_killer_disable exclusion - see commit 74070542099c (""oom, suspend: fix oom_reaper vs. oom_killer_disable race"")). exit_oom_victim should be only called from the victim's context ideally. One way to achieve this would be to rely on per mm_struct flags. We already have MMF_OOM_REAPED to hide a task from the oom killer since ""mm, oom: hide mm which is shared with kthread or global init"". The problem is that the exit path: do_exit exit_mm tsk->mm = NULL; mmput __mmput exit_oom_victim doesn't guarantee that exit_oom_victim will get called in a bounded amount of time. At least exit_aio depends on IO which might get blocked due to lack of memory and who knows what else is lurking there. This patch takes a different approach. We remember tsk->mm into the signal_struct and bind it to the signal struct life time for all oom victims. __oom_reap_task_mm as well as oom_scan_process_thread do not have to rely on find_lock_task_mm anymore and they will have a reliable reference to the mm struct. As a result all the oom specific communication inside the OOM killer can be done via tsk->signal->oom_mm. Increasing the signal_struct for something as unlikely as the oom killer is far from ideal but this approach will make the code much more reasonable and long term we even might want to move task->mm into the signal_struct anyway. In the next step we might want to make the oom killer exclusion and access to memory reserves completely independent which would be also nice. Link: http://lkml.kernel.org/r/1472119394-11342-4-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/26db62f179d112d345031e14926a4cda9cd40d6e;One way to achieve this would be to rely on per mm_struct flags;no;no;yes 114;MDY6Q29tbWl0MjMyNTI5ODoyNmRiNjJmMTc5ZDExMmQzNDUwMzFlMTQ5MjZhNGNkYTljZDQwZDZl;Michal Hocko;Linus Torvalds;"oom: keep mm of the killed task available oom_reap_task has to call exit_oom_victim in order to make sure that the oom vicim will not block the oom killer for ever. This is, however, opening new problems (e.g oom_killer_disable exclusion - see commit 74070542099c (""oom, suspend: fix oom_reaper vs. oom_killer_disable race"")). exit_oom_victim should be only called from the victim's context ideally. One way to achieve this would be to rely on per mm_struct flags. We already have MMF_OOM_REAPED to hide a task from the oom killer since ""mm, oom: hide mm which is shared with kthread or global init"". The problem is that the exit path: do_exit exit_mm tsk->mm = NULL; mmput __mmput exit_oom_victim doesn't guarantee that exit_oom_victim will get called in a bounded amount of time. At least exit_aio depends on IO which might get blocked due to lack of memory and who knows what else is lurking there. This patch takes a different approach. We remember tsk->mm into the signal_struct and bind it to the signal struct life time for all oom victims. __oom_reap_task_mm as well as oom_scan_process_thread do not have to rely on find_lock_task_mm anymore and they will have a reliable reference to the mm struct. As a result all the oom specific communication inside the OOM killer can be done via tsk->signal->oom_mm. Increasing the signal_struct for something as unlikely as the oom killer is far from ideal but this approach will make the code much more reasonable and long term we even might want to move task->mm into the signal_struct anyway. In the next step we might want to make the oom killer exclusion and access to memory reserves completely independent which would be also nice. Link: http://lkml.kernel.org/r/1472119394-11342-4-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/26db62f179d112d345031e14926a4cda9cd40d6e;" We already have MMF_OOM_REAPED to hide a task from the oom killer since ""mm, oom: hide mm which is shared with kthread or global init""";no;no;yes 114;MDY6Q29tbWl0MjMyNTI5ODoyNmRiNjJmMTc5ZDExMmQzNDUwMzFlMTQ5MjZhNGNkYTljZDQwZDZl;Michal Hocko;Linus Torvalds;"oom: keep mm of the killed task available oom_reap_task has to call exit_oom_victim in order to make sure that the oom vicim will not block the oom killer for ever. This is, however, opening new problems (e.g oom_killer_disable exclusion - see commit 74070542099c (""oom, suspend: fix oom_reaper vs. oom_killer_disable race"")). exit_oom_victim should be only called from the victim's context ideally. One way to achieve this would be to rely on per mm_struct flags. We already have MMF_OOM_REAPED to hide a task from the oom killer since ""mm, oom: hide mm which is shared with kthread or global init"". The problem is that the exit path: do_exit exit_mm tsk->mm = NULL; mmput __mmput exit_oom_victim doesn't guarantee that exit_oom_victim will get called in a bounded amount of time. At least exit_aio depends on IO which might get blocked due to lack of memory and who knows what else is lurking there. This patch takes a different approach. We remember tsk->mm into the signal_struct and bind it to the signal struct life time for all oom victims. __oom_reap_task_mm as well as oom_scan_process_thread do not have to rely on find_lock_task_mm anymore and they will have a reliable reference to the mm struct. As a result all the oom specific communication inside the OOM killer can be done via tsk->signal->oom_mm. Increasing the signal_struct for something as unlikely as the oom killer is far from ideal but this approach will make the code much more reasonable and long term we even might want to move task->mm into the signal_struct anyway. In the next step we might want to make the oom killer exclusion and access to memory reserves completely independent which would be also nice. Link: http://lkml.kernel.org/r/1472119394-11342-4-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/26db62f179d112d345031e14926a4cda9cd40d6e;"The problem is that the exit path";no;yes;yes 114;MDY6Q29tbWl0MjMyNTI5ODoyNmRiNjJmMTc5ZDExMmQzNDUwMzFlMTQ5MjZhNGNkYTljZDQwZDZl;Michal Hocko;Linus Torvalds;"oom: keep mm of the killed task available oom_reap_task has to call exit_oom_victim in order to make sure that the oom vicim will not block the oom killer for ever. This is, however, opening new problems (e.g oom_killer_disable exclusion - see commit 74070542099c (""oom, suspend: fix oom_reaper vs. oom_killer_disable race"")). exit_oom_victim should be only called from the victim's context ideally. One way to achieve this would be to rely on per mm_struct flags. We already have MMF_OOM_REAPED to hide a task from the oom killer since ""mm, oom: hide mm which is shared with kthread or global init"". The problem is that the exit path: do_exit exit_mm tsk->mm = NULL; mmput __mmput exit_oom_victim doesn't guarantee that exit_oom_victim will get called in a bounded amount of time. At least exit_aio depends on IO which might get blocked due to lack of memory and who knows what else is lurking there. This patch takes a different approach. We remember tsk->mm into the signal_struct and bind it to the signal struct life time for all oom victims. __oom_reap_task_mm as well as oom_scan_process_thread do not have to rely on find_lock_task_mm anymore and they will have a reliable reference to the mm struct. As a result all the oom specific communication inside the OOM killer can be done via tsk->signal->oom_mm. Increasing the signal_struct for something as unlikely as the oom killer is far from ideal but this approach will make the code much more reasonable and long term we even might want to move task->mm into the signal_struct anyway. In the next step we might want to make the oom killer exclusion and access to memory reserves completely independent which would be also nice. Link: http://lkml.kernel.org/r/1472119394-11342-4-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/26db62f179d112d345031e14926a4cda9cd40d6e;" do_exit exit_mm mmput __mmput exit_oom_victim doesn't guarantee that exit_oom_victim will get called in a bounded amount of time";no;yes;yes 114;MDY6Q29tbWl0MjMyNTI5ODoyNmRiNjJmMTc5ZDExMmQzNDUwMzFlMTQ5MjZhNGNkYTljZDQwZDZl;Michal Hocko;Linus Torvalds;"oom: keep mm of the killed task available oom_reap_task has to call exit_oom_victim in order to make sure that the oom vicim will not block the oom killer for ever. This is, however, opening new problems (e.g oom_killer_disable exclusion - see commit 74070542099c (""oom, suspend: fix oom_reaper vs. oom_killer_disable race"")). exit_oom_victim should be only called from the victim's context ideally. One way to achieve this would be to rely on per mm_struct flags. We already have MMF_OOM_REAPED to hide a task from the oom killer since ""mm, oom: hide mm which is shared with kthread or global init"". The problem is that the exit path: do_exit exit_mm tsk->mm = NULL; mmput __mmput exit_oom_victim doesn't guarantee that exit_oom_victim will get called in a bounded amount of time. At least exit_aio depends on IO which might get blocked due to lack of memory and who knows what else is lurking there. This patch takes a different approach. We remember tsk->mm into the signal_struct and bind it to the signal struct life time for all oom victims. __oom_reap_task_mm as well as oom_scan_process_thread do not have to rely on find_lock_task_mm anymore and they will have a reliable reference to the mm struct. As a result all the oom specific communication inside the OOM killer can be done via tsk->signal->oom_mm. Increasing the signal_struct for something as unlikely as the oom killer is far from ideal but this approach will make the code much more reasonable and long term we even might want to move task->mm into the signal_struct anyway. In the next step we might want to make the oom killer exclusion and access to memory reserves completely independent which would be also nice. Link: http://lkml.kernel.org/r/1472119394-11342-4-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/26db62f179d112d345031e14926a4cda9cd40d6e;" At least exit_aio depends on IO which might get blocked due to lack of memory and who knows what else is lurking there";no;yes;yes 114;MDY6Q29tbWl0MjMyNTI5ODoyNmRiNjJmMTc5ZDExMmQzNDUwMzFlMTQ5MjZhNGNkYTljZDQwZDZl;Michal Hocko;Linus Torvalds;"oom: keep mm of the killed task available oom_reap_task has to call exit_oom_victim in order to make sure that the oom vicim will not block the oom killer for ever. This is, however, opening new problems (e.g oom_killer_disable exclusion - see commit 74070542099c (""oom, suspend: fix oom_reaper vs. oom_killer_disable race"")). exit_oom_victim should be only called from the victim's context ideally. One way to achieve this would be to rely on per mm_struct flags. We already have MMF_OOM_REAPED to hide a task from the oom killer since ""mm, oom: hide mm which is shared with kthread or global init"". The problem is that the exit path: do_exit exit_mm tsk->mm = NULL; mmput __mmput exit_oom_victim doesn't guarantee that exit_oom_victim will get called in a bounded amount of time. At least exit_aio depends on IO which might get blocked due to lack of memory and who knows what else is lurking there. This patch takes a different approach. We remember tsk->mm into the signal_struct and bind it to the signal struct life time for all oom victims. __oom_reap_task_mm as well as oom_scan_process_thread do not have to rely on find_lock_task_mm anymore and they will have a reliable reference to the mm struct. As a result all the oom specific communication inside the OOM killer can be done via tsk->signal->oom_mm. Increasing the signal_struct for something as unlikely as the oom killer is far from ideal but this approach will make the code much more reasonable and long term we even might want to move task->mm into the signal_struct anyway. In the next step we might want to make the oom killer exclusion and access to memory reserves completely independent which would be also nice. Link: http://lkml.kernel.org/r/1472119394-11342-4-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/26db62f179d112d345031e14926a4cda9cd40d6e;This patch takes a different approach;yes;no;no 114;MDY6Q29tbWl0MjMyNTI5ODoyNmRiNjJmMTc5ZDExMmQzNDUwMzFlMTQ5MjZhNGNkYTljZDQwZDZl;Michal Hocko;Linus Torvalds;"oom: keep mm of the killed task available oom_reap_task has to call exit_oom_victim in order to make sure that the oom vicim will not block the oom killer for ever. This is, however, opening new problems (e.g oom_killer_disable exclusion - see commit 74070542099c (""oom, suspend: fix oom_reaper vs. oom_killer_disable race"")). exit_oom_victim should be only called from the victim's context ideally. One way to achieve this would be to rely on per mm_struct flags. We already have MMF_OOM_REAPED to hide a task from the oom killer since ""mm, oom: hide mm which is shared with kthread or global init"". The problem is that the exit path: do_exit exit_mm tsk->mm = NULL; mmput __mmput exit_oom_victim doesn't guarantee that exit_oom_victim will get called in a bounded amount of time. At least exit_aio depends on IO which might get blocked due to lack of memory and who knows what else is lurking there. This patch takes a different approach. We remember tsk->mm into the signal_struct and bind it to the signal struct life time for all oom victims. __oom_reap_task_mm as well as oom_scan_process_thread do not have to rely on find_lock_task_mm anymore and they will have a reliable reference to the mm struct. As a result all the oom specific communication inside the OOM killer can be done via tsk->signal->oom_mm. Increasing the signal_struct for something as unlikely as the oom killer is far from ideal but this approach will make the code much more reasonable and long term we even might want to move task->mm into the signal_struct anyway. In the next step we might want to make the oom killer exclusion and access to memory reserves completely independent which would be also nice. Link: http://lkml.kernel.org/r/1472119394-11342-4-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/26db62f179d112d345031e14926a4cda9cd40d6e;" We remember tsk->mm into the signal_struct and bind it to the signal struct life time for all oom victims";yes;no;no 114;MDY6Q29tbWl0MjMyNTI5ODoyNmRiNjJmMTc5ZDExMmQzNDUwMzFlMTQ5MjZhNGNkYTljZDQwZDZl;Michal Hocko;Linus Torvalds;"oom: keep mm of the killed task available oom_reap_task has to call exit_oom_victim in order to make sure that the oom vicim will not block the oom killer for ever. This is, however, opening new problems (e.g oom_killer_disable exclusion - see commit 74070542099c (""oom, suspend: fix oom_reaper vs. oom_killer_disable race"")). exit_oom_victim should be only called from the victim's context ideally. One way to achieve this would be to rely on per mm_struct flags. We already have MMF_OOM_REAPED to hide a task from the oom killer since ""mm, oom: hide mm which is shared with kthread or global init"". The problem is that the exit path: do_exit exit_mm tsk->mm = NULL; mmput __mmput exit_oom_victim doesn't guarantee that exit_oom_victim will get called in a bounded amount of time. At least exit_aio depends on IO which might get blocked due to lack of memory and who knows what else is lurking there. This patch takes a different approach. We remember tsk->mm into the signal_struct and bind it to the signal struct life time for all oom victims. __oom_reap_task_mm as well as oom_scan_process_thread do not have to rely on find_lock_task_mm anymore and they will have a reliable reference to the mm struct. As a result all the oom specific communication inside the OOM killer can be done via tsk->signal->oom_mm. Increasing the signal_struct for something as unlikely as the oom killer is far from ideal but this approach will make the code much more reasonable and long term we even might want to move task->mm into the signal_struct anyway. In the next step we might want to make the oom killer exclusion and access to memory reserves completely independent which would be also nice. Link: http://lkml.kernel.org/r/1472119394-11342-4-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/26db62f179d112d345031e14926a4cda9cd40d6e;" __oom_reap_task_mm as well as oom_scan_process_thread do not have to rely on find_lock_task_mm anymore and they will have a reliable reference to the mm struct";yes;yes;no 114;MDY6Q29tbWl0MjMyNTI5ODoyNmRiNjJmMTc5ZDExMmQzNDUwMzFlMTQ5MjZhNGNkYTljZDQwZDZl;Michal Hocko;Linus Torvalds;"oom: keep mm of the killed task available oom_reap_task has to call exit_oom_victim in order to make sure that the oom vicim will not block the oom killer for ever. This is, however, opening new problems (e.g oom_killer_disable exclusion - see commit 74070542099c (""oom, suspend: fix oom_reaper vs. oom_killer_disable race"")). exit_oom_victim should be only called from the victim's context ideally. One way to achieve this would be to rely on per mm_struct flags. We already have MMF_OOM_REAPED to hide a task from the oom killer since ""mm, oom: hide mm which is shared with kthread or global init"". The problem is that the exit path: do_exit exit_mm tsk->mm = NULL; mmput __mmput exit_oom_victim doesn't guarantee that exit_oom_victim will get called in a bounded amount of time. At least exit_aio depends on IO which might get blocked due to lack of memory and who knows what else is lurking there. This patch takes a different approach. We remember tsk->mm into the signal_struct and bind it to the signal struct life time for all oom victims. __oom_reap_task_mm as well as oom_scan_process_thread do not have to rely on find_lock_task_mm anymore and they will have a reliable reference to the mm struct. As a result all the oom specific communication inside the OOM killer can be done via tsk->signal->oom_mm. Increasing the signal_struct for something as unlikely as the oom killer is far from ideal but this approach will make the code much more reasonable and long term we even might want to move task->mm into the signal_struct anyway. In the next step we might want to make the oom killer exclusion and access to memory reserves completely independent which would be also nice. Link: http://lkml.kernel.org/r/1472119394-11342-4-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/26db62f179d112d345031e14926a4cda9cd40d6e;" As a result all the oom specific communication inside the OOM killer can be done via tsk->signal->oom_mm";yes;yes;no 114;MDY6Q29tbWl0MjMyNTI5ODoyNmRiNjJmMTc5ZDExMmQzNDUwMzFlMTQ5MjZhNGNkYTljZDQwZDZl;Michal Hocko;Linus Torvalds;"oom: keep mm of the killed task available oom_reap_task has to call exit_oom_victim in order to make sure that the oom vicim will not block the oom killer for ever. This is, however, opening new problems (e.g oom_killer_disable exclusion - see commit 74070542099c (""oom, suspend: fix oom_reaper vs. oom_killer_disable race"")). exit_oom_victim should be only called from the victim's context ideally. One way to achieve this would be to rely on per mm_struct flags. We already have MMF_OOM_REAPED to hide a task from the oom killer since ""mm, oom: hide mm which is shared with kthread or global init"". The problem is that the exit path: do_exit exit_mm tsk->mm = NULL; mmput __mmput exit_oom_victim doesn't guarantee that exit_oom_victim will get called in a bounded amount of time. At least exit_aio depends on IO which might get blocked due to lack of memory and who knows what else is lurking there. This patch takes a different approach. We remember tsk->mm into the signal_struct and bind it to the signal struct life time for all oom victims. __oom_reap_task_mm as well as oom_scan_process_thread do not have to rely on find_lock_task_mm anymore and they will have a reliable reference to the mm struct. As a result all the oom specific communication inside the OOM killer can be done via tsk->signal->oom_mm. Increasing the signal_struct for something as unlikely as the oom killer is far from ideal but this approach will make the code much more reasonable and long term we even might want to move task->mm into the signal_struct anyway. In the next step we might want to make the oom killer exclusion and access to memory reserves completely independent which would be also nice. Link: http://lkml.kernel.org/r/1472119394-11342-4-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/26db62f179d112d345031e14926a4cda9cd40d6e;"Increasing the signal_struct for something as unlikely as the oom killer is far from ideal but this approach will make the code much more reasonable and long term we even might want to move task->mm into the signal_struct anyway";yes;yes;no 114;MDY6Q29tbWl0MjMyNTI5ODoyNmRiNjJmMTc5ZDExMmQzNDUwMzFlMTQ5MjZhNGNkYTljZDQwZDZl;Michal Hocko;Linus Torvalds;"oom: keep mm of the killed task available oom_reap_task has to call exit_oom_victim in order to make sure that the oom vicim will not block the oom killer for ever. This is, however, opening new problems (e.g oom_killer_disable exclusion - see commit 74070542099c (""oom, suspend: fix oom_reaper vs. oom_killer_disable race"")). exit_oom_victim should be only called from the victim's context ideally. One way to achieve this would be to rely on per mm_struct flags. We already have MMF_OOM_REAPED to hide a task from the oom killer since ""mm, oom: hide mm which is shared with kthread or global init"". The problem is that the exit path: do_exit exit_mm tsk->mm = NULL; mmput __mmput exit_oom_victim doesn't guarantee that exit_oom_victim will get called in a bounded amount of time. At least exit_aio depends on IO which might get blocked due to lack of memory and who knows what else is lurking there. This patch takes a different approach. We remember tsk->mm into the signal_struct and bind it to the signal struct life time for all oom victims. __oom_reap_task_mm as well as oom_scan_process_thread do not have to rely on find_lock_task_mm anymore and they will have a reliable reference to the mm struct. As a result all the oom specific communication inside the OOM killer can be done via tsk->signal->oom_mm. Increasing the signal_struct for something as unlikely as the oom killer is far from ideal but this approach will make the code much more reasonable and long term we even might want to move task->mm into the signal_struct anyway. In the next step we might want to make the oom killer exclusion and access to memory reserves completely independent which would be also nice. Link: http://lkml.kernel.org/r/1472119394-11342-4-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/26db62f179d112d345031e14926a4cda9cd40d6e;" In the next step we might want to make the oom killer exclusion and access to memory reserves completely independent which would be also nice.";no;yes;no 115;MDY6Q29tbWl0MjMyNTI5ODo4NDk2YWZhYmE5M2VjZTgwYTgzY2JkMDk2ZjA2NzVhMTAyMGRkZmM0;Tetsuo Handa;Linus Torvalds;"mm,oom_reaper: do not attempt to reap a task twice ""mm, oom_reaper: do not attempt to reap a task twice"" tried to give the OOM reaper one more chance to retry using MMF_OOM_NOT_REAPABLE flag. But the usefulness of the flag is rather limited and actually never shown in practice. If the flag is set, it means that the holder of mm->mmap_sem cannot call up_write() due to presumably being blocked at unkillable wait waiting for other thread's memory allocation. But since one of threads sharing that mm will queue that mm immediately via task_will_free_mem() shortcut (otherwise, oom_badness() will select the same mm again due to oom_score_adj value unchanged), retrying MMF_OOM_NOT_REAPABLE mm is unlikely helpful. Let's always set MMF_OOM_REAPED. Link: http://lkml.kernel.org/r/1472119394-11342-3-git-send-email-mhocko@kernel.org Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/8496afaba93ece80a83cbd096f0675a1020ddfc4;mm,oom_reaper: do not attempt to reap a task twice;yes;yes;no 115;MDY6Q29tbWl0MjMyNTI5ODo4NDk2YWZhYmE5M2VjZTgwYTgzY2JkMDk2ZjA2NzVhMTAyMGRkZmM0;Tetsuo Handa;Linus Torvalds;"mm,oom_reaper: do not attempt to reap a task twice ""mm, oom_reaper: do not attempt to reap a task twice"" tried to give the OOM reaper one more chance to retry using MMF_OOM_NOT_REAPABLE flag. But the usefulness of the flag is rather limited and actually never shown in practice. If the flag is set, it means that the holder of mm->mmap_sem cannot call up_write() due to presumably being blocked at unkillable wait waiting for other thread's memory allocation. But since one of threads sharing that mm will queue that mm immediately via task_will_free_mem() shortcut (otherwise, oom_badness() will select the same mm again due to oom_score_adj value unchanged), retrying MMF_OOM_NOT_REAPABLE mm is unlikely helpful. Let's always set MMF_OOM_REAPED. Link: http://lkml.kernel.org/r/1472119394-11342-3-git-send-email-mhocko@kernel.org Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/8496afaba93ece80a83cbd096f0675a1020ddfc4;"""mm, oom_reaper: do not attempt to reap a task twice"" tried to give the OOM reaper one more chance to retry using MMF_OOM_NOT_REAPABLE flag";no;no;yes 115;MDY6Q29tbWl0MjMyNTI5ODo4NDk2YWZhYmE5M2VjZTgwYTgzY2JkMDk2ZjA2NzVhMTAyMGRkZmM0;Tetsuo Handa;Linus Torvalds;"mm,oom_reaper: do not attempt to reap a task twice ""mm, oom_reaper: do not attempt to reap a task twice"" tried to give the OOM reaper one more chance to retry using MMF_OOM_NOT_REAPABLE flag. But the usefulness of the flag is rather limited and actually never shown in practice. If the flag is set, it means that the holder of mm->mmap_sem cannot call up_write() due to presumably being blocked at unkillable wait waiting for other thread's memory allocation. But since one of threads sharing that mm will queue that mm immediately via task_will_free_mem() shortcut (otherwise, oom_badness() will select the same mm again due to oom_score_adj value unchanged), retrying MMF_OOM_NOT_REAPABLE mm is unlikely helpful. Let's always set MMF_OOM_REAPED. Link: http://lkml.kernel.org/r/1472119394-11342-3-git-send-email-mhocko@kernel.org Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/8496afaba93ece80a83cbd096f0675a1020ddfc4;"But the usefulness of the flag is rather limited and actually never shown in practice";no;yes;yes 115;MDY6Q29tbWl0MjMyNTI5ODo4NDk2YWZhYmE5M2VjZTgwYTgzY2JkMDk2ZjA2NzVhMTAyMGRkZmM0;Tetsuo Handa;Linus Torvalds;"mm,oom_reaper: do not attempt to reap a task twice ""mm, oom_reaper: do not attempt to reap a task twice"" tried to give the OOM reaper one more chance to retry using MMF_OOM_NOT_REAPABLE flag. But the usefulness of the flag is rather limited and actually never shown in practice. If the flag is set, it means that the holder of mm->mmap_sem cannot call up_write() due to presumably being blocked at unkillable wait waiting for other thread's memory allocation. But since one of threads sharing that mm will queue that mm immediately via task_will_free_mem() shortcut (otherwise, oom_badness() will select the same mm again due to oom_score_adj value unchanged), retrying MMF_OOM_NOT_REAPABLE mm is unlikely helpful. Let's always set MMF_OOM_REAPED. Link: http://lkml.kernel.org/r/1472119394-11342-3-git-send-email-mhocko@kernel.org Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/8496afaba93ece80a83cbd096f0675a1020ddfc4;" If the flag is set, it means that the holder of mm->mmap_sem cannot call up_write() due to presumably being blocked at unkillable wait waiting for other thread's memory allocation";no;yes;yes 115;MDY6Q29tbWl0MjMyNTI5ODo4NDk2YWZhYmE5M2VjZTgwYTgzY2JkMDk2ZjA2NzVhMTAyMGRkZmM0;Tetsuo Handa;Linus Torvalds;"mm,oom_reaper: do not attempt to reap a task twice ""mm, oom_reaper: do not attempt to reap a task twice"" tried to give the OOM reaper one more chance to retry using MMF_OOM_NOT_REAPABLE flag. But the usefulness of the flag is rather limited and actually never shown in practice. If the flag is set, it means that the holder of mm->mmap_sem cannot call up_write() due to presumably being blocked at unkillable wait waiting for other thread's memory allocation. But since one of threads sharing that mm will queue that mm immediately via task_will_free_mem() shortcut (otherwise, oom_badness() will select the same mm again due to oom_score_adj value unchanged), retrying MMF_OOM_NOT_REAPABLE mm is unlikely helpful. Let's always set MMF_OOM_REAPED. Link: http://lkml.kernel.org/r/1472119394-11342-3-git-send-email-mhocko@kernel.org Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/8496afaba93ece80a83cbd096f0675a1020ddfc4;" But since one of threads sharing that mm will queue that mm immediately via task_will_free_mem() shortcut (otherwise, oom_badness() will select the same mm again due to oom_score_adj value unchanged), retrying MMF_OOM_NOT_REAPABLE mm is unlikely helpful";no;yes;yes 115;MDY6Q29tbWl0MjMyNTI5ODo4NDk2YWZhYmE5M2VjZTgwYTgzY2JkMDk2ZjA2NzVhMTAyMGRkZmM0;Tetsuo Handa;Linus Torvalds;"mm,oom_reaper: do not attempt to reap a task twice ""mm, oom_reaper: do not attempt to reap a task twice"" tried to give the OOM reaper one more chance to retry using MMF_OOM_NOT_REAPABLE flag. But the usefulness of the flag is rather limited and actually never shown in practice. If the flag is set, it means that the holder of mm->mmap_sem cannot call up_write() due to presumably being blocked at unkillable wait waiting for other thread's memory allocation. But since one of threads sharing that mm will queue that mm immediately via task_will_free_mem() shortcut (otherwise, oom_badness() will select the same mm again due to oom_score_adj value unchanged), retrying MMF_OOM_NOT_REAPABLE mm is unlikely helpful. Let's always set MMF_OOM_REAPED. Link: http://lkml.kernel.org/r/1472119394-11342-3-git-send-email-mhocko@kernel.org Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/8496afaba93ece80a83cbd096f0675a1020ddfc4;Let's always set MMF_OOM_REAPED.;yes;no;no 116;MDY6Q29tbWl0MjMyNTI5ODo3ZWJmZmE0NTU1MWZlN2RiODZhMmIzMmJmNTg2ZjEyNGVmNDg0ZTZl;Tetsuo Handa;Linus Torvalds;"mm,oom_reaper: reduce find_lock_task_mm() usage Patch series ""fortify oom killer even more"", v2. This patch (of 9): __oom_reap_task() can be simplified a bit if it receives a valid mm from oom_reap_task() which also uses that mm when __oom_reap_task() failed. We can drop one find_lock_task_mm() call and also make the __oom_reap_task() code flow easier to follow. Moreover, this will make later patch in the series easier to review. Pinning mm's mm_count for longer time is not really harmful because this will not pin much memory. This patch doesn't introduce any functional change. Link: http://lkml.kernel.org/r/1472119394-11342-2-git-send-email-mhocko@kernel.org Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/7ebffa45551fe7db86a2b32bf586f124ef484e6e;mm,oom_reaper: reduce find_lock_task_mm() usage;yes;yes;no 116;MDY6Q29tbWl0MjMyNTI5ODo3ZWJmZmE0NTU1MWZlN2RiODZhMmIzMmJmNTg2ZjEyNGVmNDg0ZTZl;Tetsuo Handa;Linus Torvalds;"mm,oom_reaper: reduce find_lock_task_mm() usage Patch series ""fortify oom killer even more"", v2. This patch (of 9): __oom_reap_task() can be simplified a bit if it receives a valid mm from oom_reap_task() which also uses that mm when __oom_reap_task() failed. We can drop one find_lock_task_mm() call and also make the __oom_reap_task() code flow easier to follow. Moreover, this will make later patch in the series easier to review. Pinning mm's mm_count for longer time is not really harmful because this will not pin much memory. This patch doesn't introduce any functional change. Link: http://lkml.kernel.org/r/1472119394-11342-2-git-send-email-mhocko@kernel.org Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/7ebffa45551fe7db86a2b32bf586f124ef484e6e;"Patch series ""fortify oom killer even more"", v2";no;yes;no 116;MDY6Q29tbWl0MjMyNTI5ODo3ZWJmZmE0NTU1MWZlN2RiODZhMmIzMmJmNTg2ZjEyNGVmNDg0ZTZl;Tetsuo Handa;Linus Torvalds;"mm,oom_reaper: reduce find_lock_task_mm() usage Patch series ""fortify oom killer even more"", v2. This patch (of 9): __oom_reap_task() can be simplified a bit if it receives a valid mm from oom_reap_task() which also uses that mm when __oom_reap_task() failed. We can drop one find_lock_task_mm() call and also make the __oom_reap_task() code flow easier to follow. Moreover, this will make later patch in the series easier to review. Pinning mm's mm_count for longer time is not really harmful because this will not pin much memory. This patch doesn't introduce any functional change. Link: http://lkml.kernel.org/r/1472119394-11342-2-git-send-email-mhocko@kernel.org Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/7ebffa45551fe7db86a2b32bf586f124ef484e6e;This patch (of 9);no;no;no 116;MDY6Q29tbWl0MjMyNTI5ODo3ZWJmZmE0NTU1MWZlN2RiODZhMmIzMmJmNTg2ZjEyNGVmNDg0ZTZl;Tetsuo Handa;Linus Torvalds;"mm,oom_reaper: reduce find_lock_task_mm() usage Patch series ""fortify oom killer even more"", v2. This patch (of 9): __oom_reap_task() can be simplified a bit if it receives a valid mm from oom_reap_task() which also uses that mm when __oom_reap_task() failed. We can drop one find_lock_task_mm() call and also make the __oom_reap_task() code flow easier to follow. Moreover, this will make later patch in the series easier to review. Pinning mm's mm_count for longer time is not really harmful because this will not pin much memory. This patch doesn't introduce any functional change. Link: http://lkml.kernel.org/r/1472119394-11342-2-git-send-email-mhocko@kernel.org Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/7ebffa45551fe7db86a2b32bf586f124ef484e6e;"__oom_reap_task() can be simplified a bit if it receives a valid mm from oom_reap_task() which also uses that mm when __oom_reap_task() failed";yes;yes;yes 116;MDY6Q29tbWl0MjMyNTI5ODo3ZWJmZmE0NTU1MWZlN2RiODZhMmIzMmJmNTg2ZjEyNGVmNDg0ZTZl;Tetsuo Handa;Linus Torvalds;"mm,oom_reaper: reduce find_lock_task_mm() usage Patch series ""fortify oom killer even more"", v2. This patch (of 9): __oom_reap_task() can be simplified a bit if it receives a valid mm from oom_reap_task() which also uses that mm when __oom_reap_task() failed. We can drop one find_lock_task_mm() call and also make the __oom_reap_task() code flow easier to follow. Moreover, this will make later patch in the series easier to review. Pinning mm's mm_count for longer time is not really harmful because this will not pin much memory. This patch doesn't introduce any functional change. Link: http://lkml.kernel.org/r/1472119394-11342-2-git-send-email-mhocko@kernel.org Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/7ebffa45551fe7db86a2b32bf586f124ef484e6e;"We can drop one find_lock_task_mm() call and also make the __oom_reap_task() code flow easier to follow";yes;yes;no 116;MDY6Q29tbWl0MjMyNTI5ODo3ZWJmZmE0NTU1MWZlN2RiODZhMmIzMmJmNTg2ZjEyNGVmNDg0ZTZl;Tetsuo Handa;Linus Torvalds;"mm,oom_reaper: reduce find_lock_task_mm() usage Patch series ""fortify oom killer even more"", v2. This patch (of 9): __oom_reap_task() can be simplified a bit if it receives a valid mm from oom_reap_task() which also uses that mm when __oom_reap_task() failed. We can drop one find_lock_task_mm() call and also make the __oom_reap_task() code flow easier to follow. Moreover, this will make later patch in the series easier to review. Pinning mm's mm_count for longer time is not really harmful because this will not pin much memory. This patch doesn't introduce any functional change. Link: http://lkml.kernel.org/r/1472119394-11342-2-git-send-email-mhocko@kernel.org Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/7ebffa45551fe7db86a2b32bf586f124ef484e6e;" Moreover, this will make later patch in the series easier to review";no;yes;no 116;MDY6Q29tbWl0MjMyNTI5ODo3ZWJmZmE0NTU1MWZlN2RiODZhMmIzMmJmNTg2ZjEyNGVmNDg0ZTZl;Tetsuo Handa;Linus Torvalds;"mm,oom_reaper: reduce find_lock_task_mm() usage Patch series ""fortify oom killer even more"", v2. This patch (of 9): __oom_reap_task() can be simplified a bit if it receives a valid mm from oom_reap_task() which also uses that mm when __oom_reap_task() failed. We can drop one find_lock_task_mm() call and also make the __oom_reap_task() code flow easier to follow. Moreover, this will make later patch in the series easier to review. Pinning mm's mm_count for longer time is not really harmful because this will not pin much memory. This patch doesn't introduce any functional change. Link: http://lkml.kernel.org/r/1472119394-11342-2-git-send-email-mhocko@kernel.org Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/7ebffa45551fe7db86a2b32bf586f124ef484e6e;" Pinning mm's mm_count for longer time is not really harmful because this will not pin much memory";yes;yes;no 116;MDY6Q29tbWl0MjMyNTI5ODo3ZWJmZmE0NTU1MWZlN2RiODZhMmIzMmJmNTg2ZjEyNGVmNDg0ZTZl;Tetsuo Handa;Linus Torvalds;"mm,oom_reaper: reduce find_lock_task_mm() usage Patch series ""fortify oom killer even more"", v2. This patch (of 9): __oom_reap_task() can be simplified a bit if it receives a valid mm from oom_reap_task() which also uses that mm when __oom_reap_task() failed. We can drop one find_lock_task_mm() call and also make the __oom_reap_task() code flow easier to follow. Moreover, this will make later patch in the series easier to review. Pinning mm's mm_count for longer time is not really harmful because this will not pin much memory. This patch doesn't introduce any functional change. Link: http://lkml.kernel.org/r/1472119394-11342-2-git-send-email-mhocko@kernel.org Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/7ebffa45551fe7db86a2b32bf586f124ef484e6e;This patch doesn't introduce any functional change.;yes;no;no 117;MDY6Q29tbWl0MjMyNTI5ODo1ODcwYzJlMWQ3OGIwNDNiNjlkZTMxOTk0NjljMDU2Y2EzYjA1MTAy;Michal Hocko;Linus Torvalds;"mm/oom_kill.c: fix task_will_free_mem() comment Attempt to demystify the task_will_free_mem() loop. Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/5870c2e1d78b043b69de3199469c056ca3b05102;mm/oom_kill.c: fix task_will_free_mem() comment;yes;yes;no 117;MDY6Q29tbWl0MjMyNTI5ODo1ODcwYzJlMWQ3OGIwNDNiNjlkZTMxOTk0NjljMDU2Y2EzYjA1MTAy;Michal Hocko;Linus Torvalds;"mm/oom_kill.c: fix task_will_free_mem() comment Attempt to demystify the task_will_free_mem() loop. Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/5870c2e1d78b043b69de3199469c056ca3b05102;Attempt to demystify the task_will_free_mem() loop.;yes;yes;no 118;MDY6Q29tbWl0MjMyNTI5ODo3YzVmNjRmODQ0ODNiZDEzODg2MzQ4ZWRkYThiM2U3Yjc5OWE3ZmRi;Vladimir Davydov;Linus Torvalds;"mm: oom: deduplicate victim selection code for memcg and global oom When selecting an oom victim, we use the same heuristic for both memory cgroup and global oom. The only difference is the scope of tasks to select the victim from. So we could just export an iterator over all memcg tasks and keep all oom related logic in oom_kill.c, but instead we duplicate pieces of it in memcontrol.c reusing some initially private functions of oom_kill.c in order to not duplicate all of it. That looks ugly and error prone, because any modification of select_bad_process should also be propagated to mem_cgroup_out_of_memory. Let's rework this as follows: keep all oom heuristic related code private to oom_kill.c and make oom_kill.c use exported memcg functions when it's really necessary (like in case of iterating over memcg tasks). Link: http://lkml.kernel.org/r/1470056933-7505-1-git-send-email-vdavydov@virtuozzo.com Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/7c5f64f84483bd13886348edda8b3e7b799a7fdb;mm: oom: deduplicate victim selection code for memcg and global oom;yes;yes;no 118;MDY6Q29tbWl0MjMyNTI5ODo3YzVmNjRmODQ0ODNiZDEzODg2MzQ4ZWRkYThiM2U3Yjc5OWE3ZmRi;Vladimir Davydov;Linus Torvalds;"mm: oom: deduplicate victim selection code for memcg and global oom When selecting an oom victim, we use the same heuristic for both memory cgroup and global oom. The only difference is the scope of tasks to select the victim from. So we could just export an iterator over all memcg tasks and keep all oom related logic in oom_kill.c, but instead we duplicate pieces of it in memcontrol.c reusing some initially private functions of oom_kill.c in order to not duplicate all of it. That looks ugly and error prone, because any modification of select_bad_process should also be propagated to mem_cgroup_out_of_memory. Let's rework this as follows: keep all oom heuristic related code private to oom_kill.c and make oom_kill.c use exported memcg functions when it's really necessary (like in case of iterating over memcg tasks). Link: http://lkml.kernel.org/r/1470056933-7505-1-git-send-email-vdavydov@virtuozzo.com Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/7c5f64f84483bd13886348edda8b3e7b799a7fdb;"When selecting an oom victim, we use the same heuristic for both memory cgroup and global oom";no;no;yes 118;MDY6Q29tbWl0MjMyNTI5ODo3YzVmNjRmODQ0ODNiZDEzODg2MzQ4ZWRkYThiM2U3Yjc5OWE3ZmRi;Vladimir Davydov;Linus Torvalds;"mm: oom: deduplicate victim selection code for memcg and global oom When selecting an oom victim, we use the same heuristic for both memory cgroup and global oom. The only difference is the scope of tasks to select the victim from. So we could just export an iterator over all memcg tasks and keep all oom related logic in oom_kill.c, but instead we duplicate pieces of it in memcontrol.c reusing some initially private functions of oom_kill.c in order to not duplicate all of it. That looks ugly and error prone, because any modification of select_bad_process should also be propagated to mem_cgroup_out_of_memory. Let's rework this as follows: keep all oom heuristic related code private to oom_kill.c and make oom_kill.c use exported memcg functions when it's really necessary (like in case of iterating over memcg tasks). Link: http://lkml.kernel.org/r/1470056933-7505-1-git-send-email-vdavydov@virtuozzo.com Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/7c5f64f84483bd13886348edda8b3e7b799a7fdb;" The only difference is the scope of tasks to select the victim from";no;no;yes 118;MDY6Q29tbWl0MjMyNTI5ODo3YzVmNjRmODQ0ODNiZDEzODg2MzQ4ZWRkYThiM2U3Yjc5OWE3ZmRi;Vladimir Davydov;Linus Torvalds;"mm: oom: deduplicate victim selection code for memcg and global oom When selecting an oom victim, we use the same heuristic for both memory cgroup and global oom. The only difference is the scope of tasks to select the victim from. So we could just export an iterator over all memcg tasks and keep all oom related logic in oom_kill.c, but instead we duplicate pieces of it in memcontrol.c reusing some initially private functions of oom_kill.c in order to not duplicate all of it. That looks ugly and error prone, because any modification of select_bad_process should also be propagated to mem_cgroup_out_of_memory. Let's rework this as follows: keep all oom heuristic related code private to oom_kill.c and make oom_kill.c use exported memcg functions when it's really necessary (like in case of iterating over memcg tasks). Link: http://lkml.kernel.org/r/1470056933-7505-1-git-send-email-vdavydov@virtuozzo.com Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/7c5f64f84483bd13886348edda8b3e7b799a7fdb;" So we could just export an iterator over all memcg tasks and keep all oom related logic in oom_kill.c, but instead we duplicate pieces of it in memcontrol.c reusing some initially private functions of oom_kill.c in order to not duplicate all of it";no;yes;yes 118;MDY6Q29tbWl0MjMyNTI5ODo3YzVmNjRmODQ0ODNiZDEzODg2MzQ4ZWRkYThiM2U3Yjc5OWE3ZmRi;Vladimir Davydov;Linus Torvalds;"mm: oom: deduplicate victim selection code for memcg and global oom When selecting an oom victim, we use the same heuristic for both memory cgroup and global oom. The only difference is the scope of tasks to select the victim from. So we could just export an iterator over all memcg tasks and keep all oom related logic in oom_kill.c, but instead we duplicate pieces of it in memcontrol.c reusing some initially private functions of oom_kill.c in order to not duplicate all of it. That looks ugly and error prone, because any modification of select_bad_process should also be propagated to mem_cgroup_out_of_memory. Let's rework this as follows: keep all oom heuristic related code private to oom_kill.c and make oom_kill.c use exported memcg functions when it's really necessary (like in case of iterating over memcg tasks). Link: http://lkml.kernel.org/r/1470056933-7505-1-git-send-email-vdavydov@virtuozzo.com Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/7c5f64f84483bd13886348edda8b3e7b799a7fdb;" That looks ugly and error prone, because any modification of select_bad_process should also be propagated to mem_cgroup_out_of_memory";no;yes;yes 118;MDY6Q29tbWl0MjMyNTI5ODo3YzVmNjRmODQ0ODNiZDEzODg2MzQ4ZWRkYThiM2U3Yjc5OWE3ZmRi;Vladimir Davydov;Linus Torvalds;"mm: oom: deduplicate victim selection code for memcg and global oom When selecting an oom victim, we use the same heuristic for both memory cgroup and global oom. The only difference is the scope of tasks to select the victim from. So we could just export an iterator over all memcg tasks and keep all oom related logic in oom_kill.c, but instead we duplicate pieces of it in memcontrol.c reusing some initially private functions of oom_kill.c in order to not duplicate all of it. That looks ugly and error prone, because any modification of select_bad_process should also be propagated to mem_cgroup_out_of_memory. Let's rework this as follows: keep all oom heuristic related code private to oom_kill.c and make oom_kill.c use exported memcg functions when it's really necessary (like in case of iterating over memcg tasks). Link: http://lkml.kernel.org/r/1470056933-7505-1-git-send-email-vdavydov@virtuozzo.com Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/7c5f64f84483bd13886348edda8b3e7b799a7fdb;"Let's rework this as follows: keep all oom heuristic related code private to oom_kill.c and make oom_kill.c use exported memcg functions when it's really necessary (like in case of iterating over memcg tasks).";yes;yes;no 119;MDY6Q29tbWl0MjMyNTI5ODpmMzNlNmYwNjcxYjNiYTgxYWNlZjRkN2MwNzhhZjg2YWZjYzg1NWM0;Geert Uytterhoeven;Linus Torvalds;"mm, oom: fix uninitialized ret in task_will_free_mem() mm/oom_kill.c: In function `task_will_free_mem': mm/oom_kill.c:767: warning: `ret' may be used uninitialized in this function If __task_will_free_mem() is never called inside the for_each_process() loop, ret will not be initialized. Fixes: 1af8bb43269563e4 (""mm, oom: fortify task_will_free_mem()"") Link: http://lkml.kernel.org/r/1470255599-24841-1-git-send-email-geert@linux-m68k.org Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/f33e6f0671b3ba81acef4d7c078af86afcc855c4;mm, oom: fix uninitialized ret in task_will_free_mem();yes;yes;no 119;MDY6Q29tbWl0MjMyNTI5ODpmMzNlNmYwNjcxYjNiYTgxYWNlZjRkN2MwNzhhZjg2YWZjYzg1NWM0;Geert Uytterhoeven;Linus Torvalds;"mm, oom: fix uninitialized ret in task_will_free_mem() mm/oom_kill.c: In function `task_will_free_mem': mm/oom_kill.c:767: warning: `ret' may be used uninitialized in this function If __task_will_free_mem() is never called inside the for_each_process() loop, ret will not be initialized. Fixes: 1af8bb43269563e4 (""mm, oom: fortify task_will_free_mem()"") Link: http://lkml.kernel.org/r/1470255599-24841-1-git-send-email-geert@linux-m68k.org Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/f33e6f0671b3ba81acef4d7c078af86afcc855c4; mm/oom_kill.c: In function `task_will_free_mem';no;no;no 119;MDY6Q29tbWl0MjMyNTI5ODpmMzNlNmYwNjcxYjNiYTgxYWNlZjRkN2MwNzhhZjg2YWZjYzg1NWM0;Geert Uytterhoeven;Linus Torvalds;"mm, oom: fix uninitialized ret in task_will_free_mem() mm/oom_kill.c: In function `task_will_free_mem': mm/oom_kill.c:767: warning: `ret' may be used uninitialized in this function If __task_will_free_mem() is never called inside the for_each_process() loop, ret will not be initialized. Fixes: 1af8bb43269563e4 (""mm, oom: fortify task_will_free_mem()"") Link: http://lkml.kernel.org/r/1470255599-24841-1-git-send-email-geert@linux-m68k.org Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/f33e6f0671b3ba81acef4d7c078af86afcc855c4;" mm/oom_kill.c:767: warning: `ret' may be used uninitialized in this function If __task_will_free_mem() is never called inside the for_each_process() loop, ret will not be initialized.";no;yes;yes 121;MDY6Q29tbWl0MjMyNTI5ODphMzczOTY2ZDFmNjRjMDRiYTlkMDE1OTA4N2YwZmExYjVhYWM0YzMz;Michal Hocko;Linus Torvalds;"mm, oom: hide mm which is shared with kthread or global init The only case where the oom_reaper is not triggered for the oom victim is when it shares the memory with a kernel thread (aka use_mm) or with the global init. After ""mm, oom: skip vforked tasks from being selected"" the victim cannot be a vforked task of the global init so we are left with clone(CLONE_VM) (without CLONE_SIGHAND). use_mm() users are quite rare as well. In order to help forward progress for the OOM killer, make sure that this really rare case will not get in the way - we do this by hiding the mm from the oom killer by setting MMF_OOM_REAPED flag for it. oom_scan_process_thread will ignore any TIF_MEMDIE task if it has MMF_OOM_REAPED flag set to catch these oom victims. After this patch we should guarantee forward progress for the OOM killer even when the selected victim is sharing memory with a kernel thread or global init as long as the victims mm is still alive. Link: http://lkml.kernel.org/r/1466426628-15074-11-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/a373966d1f64c04ba9d0159087f0fa1b5aac4c33;mm, oom: hide mm which is shared with kthread or global init;yes;no;no 121;MDY6Q29tbWl0MjMyNTI5ODphMzczOTY2ZDFmNjRjMDRiYTlkMDE1OTA4N2YwZmExYjVhYWM0YzMz;Michal Hocko;Linus Torvalds;"mm, oom: hide mm which is shared with kthread or global init The only case where the oom_reaper is not triggered for the oom victim is when it shares the memory with a kernel thread (aka use_mm) or with the global init. After ""mm, oom: skip vforked tasks from being selected"" the victim cannot be a vforked task of the global init so we are left with clone(CLONE_VM) (without CLONE_SIGHAND). use_mm() users are quite rare as well. In order to help forward progress for the OOM killer, make sure that this really rare case will not get in the way - we do this by hiding the mm from the oom killer by setting MMF_OOM_REAPED flag for it. oom_scan_process_thread will ignore any TIF_MEMDIE task if it has MMF_OOM_REAPED flag set to catch these oom victims. After this patch we should guarantee forward progress for the OOM killer even when the selected victim is sharing memory with a kernel thread or global init as long as the victims mm is still alive. Link: http://lkml.kernel.org/r/1466426628-15074-11-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/a373966d1f64c04ba9d0159087f0fa1b5aac4c33;"The only case where the oom_reaper is not triggered for the oom victim is when it shares the memory with a kernel thread (aka use_mm) or with the global init";no;no;yes 121;MDY6Q29tbWl0MjMyNTI5ODphMzczOTY2ZDFmNjRjMDRiYTlkMDE1OTA4N2YwZmExYjVhYWM0YzMz;Michal Hocko;Linus Torvalds;"mm, oom: hide mm which is shared with kthread or global init The only case where the oom_reaper is not triggered for the oom victim is when it shares the memory with a kernel thread (aka use_mm) or with the global init. After ""mm, oom: skip vforked tasks from being selected"" the victim cannot be a vforked task of the global init so we are left with clone(CLONE_VM) (without CLONE_SIGHAND). use_mm() users are quite rare as well. In order to help forward progress for the OOM killer, make sure that this really rare case will not get in the way - we do this by hiding the mm from the oom killer by setting MMF_OOM_REAPED flag for it. oom_scan_process_thread will ignore any TIF_MEMDIE task if it has MMF_OOM_REAPED flag set to catch these oom victims. After this patch we should guarantee forward progress for the OOM killer even when the selected victim is sharing memory with a kernel thread or global init as long as the victims mm is still alive. Link: http://lkml.kernel.org/r/1466426628-15074-11-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/a373966d1f64c04ba9d0159087f0fa1b5aac4c33;" After ""mm, oom: skip vforked tasks from being selected"" the victim cannot be a vforked task of the global init so we are left with clone(CLONE_VM) (without CLONE_SIGHAND)";no;no;yes 121;MDY6Q29tbWl0MjMyNTI5ODphMzczOTY2ZDFmNjRjMDRiYTlkMDE1OTA4N2YwZmExYjVhYWM0YzMz;Michal Hocko;Linus Torvalds;"mm, oom: hide mm which is shared with kthread or global init The only case where the oom_reaper is not triggered for the oom victim is when it shares the memory with a kernel thread (aka use_mm) or with the global init. After ""mm, oom: skip vforked tasks from being selected"" the victim cannot be a vforked task of the global init so we are left with clone(CLONE_VM) (without CLONE_SIGHAND). use_mm() users are quite rare as well. In order to help forward progress for the OOM killer, make sure that this really rare case will not get in the way - we do this by hiding the mm from the oom killer by setting MMF_OOM_REAPED flag for it. oom_scan_process_thread will ignore any TIF_MEMDIE task if it has MMF_OOM_REAPED flag set to catch these oom victims. After this patch we should guarantee forward progress for the OOM killer even when the selected victim is sharing memory with a kernel thread or global init as long as the victims mm is still alive. Link: http://lkml.kernel.org/r/1466426628-15074-11-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/a373966d1f64c04ba9d0159087f0fa1b5aac4c33;" use_mm() users are quite rare as well";no;no;yes 121;MDY6Q29tbWl0MjMyNTI5ODphMzczOTY2ZDFmNjRjMDRiYTlkMDE1OTA4N2YwZmExYjVhYWM0YzMz;Michal Hocko;Linus Torvalds;"mm, oom: hide mm which is shared with kthread or global init The only case where the oom_reaper is not triggered for the oom victim is when it shares the memory with a kernel thread (aka use_mm) or with the global init. After ""mm, oom: skip vforked tasks from being selected"" the victim cannot be a vforked task of the global init so we are left with clone(CLONE_VM) (without CLONE_SIGHAND). use_mm() users are quite rare as well. In order to help forward progress for the OOM killer, make sure that this really rare case will not get in the way - we do this by hiding the mm from the oom killer by setting MMF_OOM_REAPED flag for it. oom_scan_process_thread will ignore any TIF_MEMDIE task if it has MMF_OOM_REAPED flag set to catch these oom victims. After this patch we should guarantee forward progress for the OOM killer even when the selected victim is sharing memory with a kernel thread or global init as long as the victims mm is still alive. Link: http://lkml.kernel.org/r/1466426628-15074-11-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/a373966d1f64c04ba9d0159087f0fa1b5aac4c33;"In order to help forward progress for the OOM killer, make sure that this really rare case will not get in the way - we do this by hiding the mm from the oom killer by setting MMF_OOM_REAPED flag for it";yes;yes;no 121;MDY6Q29tbWl0MjMyNTI5ODphMzczOTY2ZDFmNjRjMDRiYTlkMDE1OTA4N2YwZmExYjVhYWM0YzMz;Michal Hocko;Linus Torvalds;"mm, oom: hide mm which is shared with kthread or global init The only case where the oom_reaper is not triggered for the oom victim is when it shares the memory with a kernel thread (aka use_mm) or with the global init. After ""mm, oom: skip vforked tasks from being selected"" the victim cannot be a vforked task of the global init so we are left with clone(CLONE_VM) (without CLONE_SIGHAND). use_mm() users are quite rare as well. In order to help forward progress for the OOM killer, make sure that this really rare case will not get in the way - we do this by hiding the mm from the oom killer by setting MMF_OOM_REAPED flag for it. oom_scan_process_thread will ignore any TIF_MEMDIE task if it has MMF_OOM_REAPED flag set to catch these oom victims. After this patch we should guarantee forward progress for the OOM killer even when the selected victim is sharing memory with a kernel thread or global init as long as the victims mm is still alive. Link: http://lkml.kernel.org/r/1466426628-15074-11-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/a373966d1f64c04ba9d0159087f0fa1b5aac4c33;"oom_scan_process_thread will ignore any TIF_MEMDIE task if it has MMF_OOM_REAPED flag set to catch these oom victims";yes;no;yes 121;MDY6Q29tbWl0MjMyNTI5ODphMzczOTY2ZDFmNjRjMDRiYTlkMDE1OTA4N2YwZmExYjVhYWM0YzMz;Michal Hocko;Linus Torvalds;"mm, oom: hide mm which is shared with kthread or global init The only case where the oom_reaper is not triggered for the oom victim is when it shares the memory with a kernel thread (aka use_mm) or with the global init. After ""mm, oom: skip vforked tasks from being selected"" the victim cannot be a vforked task of the global init so we are left with clone(CLONE_VM) (without CLONE_SIGHAND). use_mm() users are quite rare as well. In order to help forward progress for the OOM killer, make sure that this really rare case will not get in the way - we do this by hiding the mm from the oom killer by setting MMF_OOM_REAPED flag for it. oom_scan_process_thread will ignore any TIF_MEMDIE task if it has MMF_OOM_REAPED flag set to catch these oom victims. After this patch we should guarantee forward progress for the OOM killer even when the selected victim is sharing memory with a kernel thread or global init as long as the victims mm is still alive. Link: http://lkml.kernel.org/r/1466426628-15074-11-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/a373966d1f64c04ba9d0159087f0fa1b5aac4c33;"After this patch we should guarantee forward progress for the OOM killer even when the selected victim is sharing memory with a kernel thread or global init as long as the victims mm is still alive.";yes;yes;no 122;MDY6Q29tbWl0MjMyNTI5ODoxMWE0MTBkNTE2ZTg5MzIwZmUwODE3NjA2ZWVhYjU4ZjM2YzIyOTY4;Michal Hocko;Linus Torvalds;"mm, oom_reaper: do not attempt to reap a task more than twice oom_reaper relies on the mmap_sem for read to do its job. Many places which might block readers have been converted to use down_write_killable and that has reduced chances of the contention a lot. Some paths where the mmap_sem is held for write can take other locks and they might either be not prepared to fail due to fatal signal pending or too impractical to be changed. This patch introduces MMF_OOM_NOT_REAPABLE flag which gets set after the first attempt to reap a task's mm fails. If the flag is present after the failure then we set MMF_OOM_REAPED to hide this mm from the oom killer completely so it can go and chose another victim. As a result a risk of OOM deadlock when the oom victim would be blocked indefinetly and so the oom killer cannot make any progress should be mitigated considerably while we still try really hard to perform all reclaim attempts and stay predictable in the behavior. Link: http://lkml.kernel.org/r/1466426628-15074-10-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/11a410d516e89320fe0817606eeab58f36c22968;mm, oom_reaper: do not attempt to reap a task more than twice;yes;yes;no 122;MDY6Q29tbWl0MjMyNTI5ODoxMWE0MTBkNTE2ZTg5MzIwZmUwODE3NjA2ZWVhYjU4ZjM2YzIyOTY4;Michal Hocko;Linus Torvalds;"mm, oom_reaper: do not attempt to reap a task more than twice oom_reaper relies on the mmap_sem for read to do its job. Many places which might block readers have been converted to use down_write_killable and that has reduced chances of the contention a lot. Some paths where the mmap_sem is held for write can take other locks and they might either be not prepared to fail due to fatal signal pending or too impractical to be changed. This patch introduces MMF_OOM_NOT_REAPABLE flag which gets set after the first attempt to reap a task's mm fails. If the flag is present after the failure then we set MMF_OOM_REAPED to hide this mm from the oom killer completely so it can go and chose another victim. As a result a risk of OOM deadlock when the oom victim would be blocked indefinetly and so the oom killer cannot make any progress should be mitigated considerably while we still try really hard to perform all reclaim attempts and stay predictable in the behavior. Link: http://lkml.kernel.org/r/1466426628-15074-10-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/11a410d516e89320fe0817606eeab58f36c22968;oom_reaper relies on the mmap_sem for read to do its job;no;no;yes 122;MDY6Q29tbWl0MjMyNTI5ODoxMWE0MTBkNTE2ZTg5MzIwZmUwODE3NjA2ZWVhYjU4ZjM2YzIyOTY4;Michal Hocko;Linus Torvalds;"mm, oom_reaper: do not attempt to reap a task more than twice oom_reaper relies on the mmap_sem for read to do its job. Many places which might block readers have been converted to use down_write_killable and that has reduced chances of the contention a lot. Some paths where the mmap_sem is held for write can take other locks and they might either be not prepared to fail due to fatal signal pending or too impractical to be changed. This patch introduces MMF_OOM_NOT_REAPABLE flag which gets set after the first attempt to reap a task's mm fails. If the flag is present after the failure then we set MMF_OOM_REAPED to hide this mm from the oom killer completely so it can go and chose another victim. As a result a risk of OOM deadlock when the oom victim would be blocked indefinetly and so the oom killer cannot make any progress should be mitigated considerably while we still try really hard to perform all reclaim attempts and stay predictable in the behavior. Link: http://lkml.kernel.org/r/1466426628-15074-10-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/11a410d516e89320fe0817606eeab58f36c22968;" Many places which might block readers have been converted to use down_write_killable and that has reduced chances of the contention a lot";no;no;yes 122;MDY6Q29tbWl0MjMyNTI5ODoxMWE0MTBkNTE2ZTg5MzIwZmUwODE3NjA2ZWVhYjU4ZjM2YzIyOTY4;Michal Hocko;Linus Torvalds;"mm, oom_reaper: do not attempt to reap a task more than twice oom_reaper relies on the mmap_sem for read to do its job. Many places which might block readers have been converted to use down_write_killable and that has reduced chances of the contention a lot. Some paths where the mmap_sem is held for write can take other locks and they might either be not prepared to fail due to fatal signal pending or too impractical to be changed. This patch introduces MMF_OOM_NOT_REAPABLE flag which gets set after the first attempt to reap a task's mm fails. If the flag is present after the failure then we set MMF_OOM_REAPED to hide this mm from the oom killer completely so it can go and chose another victim. As a result a risk of OOM deadlock when the oom victim would be blocked indefinetly and so the oom killer cannot make any progress should be mitigated considerably while we still try really hard to perform all reclaim attempts and stay predictable in the behavior. Link: http://lkml.kernel.org/r/1466426628-15074-10-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/11a410d516e89320fe0817606eeab58f36c22968;" Some paths where the mmap_sem is held for write can take other locks and they might either be not prepared to fail due to fatal signal pending or too impractical to be changed";no;yes;yes 122;MDY6Q29tbWl0MjMyNTI5ODoxMWE0MTBkNTE2ZTg5MzIwZmUwODE3NjA2ZWVhYjU4ZjM2YzIyOTY4;Michal Hocko;Linus Torvalds;"mm, oom_reaper: do not attempt to reap a task more than twice oom_reaper relies on the mmap_sem for read to do its job. Many places which might block readers have been converted to use down_write_killable and that has reduced chances of the contention a lot. Some paths where the mmap_sem is held for write can take other locks and they might either be not prepared to fail due to fatal signal pending or too impractical to be changed. This patch introduces MMF_OOM_NOT_REAPABLE flag which gets set after the first attempt to reap a task's mm fails. If the flag is present after the failure then we set MMF_OOM_REAPED to hide this mm from the oom killer completely so it can go and chose another victim. As a result a risk of OOM deadlock when the oom victim would be blocked indefinetly and so the oom killer cannot make any progress should be mitigated considerably while we still try really hard to perform all reclaim attempts and stay predictable in the behavior. Link: http://lkml.kernel.org/r/1466426628-15074-10-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/11a410d516e89320fe0817606eeab58f36c22968;"This patch introduces MMF_OOM_NOT_REAPABLE flag which gets set after the first attempt to reap a task's mm fails";yes;no;no 122;MDY6Q29tbWl0MjMyNTI5ODoxMWE0MTBkNTE2ZTg5MzIwZmUwODE3NjA2ZWVhYjU4ZjM2YzIyOTY4;Michal Hocko;Linus Torvalds;"mm, oom_reaper: do not attempt to reap a task more than twice oom_reaper relies on the mmap_sem for read to do its job. Many places which might block readers have been converted to use down_write_killable and that has reduced chances of the contention a lot. Some paths where the mmap_sem is held for write can take other locks and they might either be not prepared to fail due to fatal signal pending or too impractical to be changed. This patch introduces MMF_OOM_NOT_REAPABLE flag which gets set after the first attempt to reap a task's mm fails. If the flag is present after the failure then we set MMF_OOM_REAPED to hide this mm from the oom killer completely so it can go and chose another victim. As a result a risk of OOM deadlock when the oom victim would be blocked indefinetly and so the oom killer cannot make any progress should be mitigated considerably while we still try really hard to perform all reclaim attempts and stay predictable in the behavior. Link: http://lkml.kernel.org/r/1466426628-15074-10-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/11a410d516e89320fe0817606eeab58f36c22968;" If the flag is present after the failure then we set MMF_OOM_REAPED to hide this mm from the oom killer completely so it can go and chose another victim";yes;yes;no 122;MDY6Q29tbWl0MjMyNTI5ODoxMWE0MTBkNTE2ZTg5MzIwZmUwODE3NjA2ZWVhYjU4ZjM2YzIyOTY4;Michal Hocko;Linus Torvalds;"mm, oom_reaper: do not attempt to reap a task more than twice oom_reaper relies on the mmap_sem for read to do its job. Many places which might block readers have been converted to use down_write_killable and that has reduced chances of the contention a lot. Some paths where the mmap_sem is held for write can take other locks and they might either be not prepared to fail due to fatal signal pending or too impractical to be changed. This patch introduces MMF_OOM_NOT_REAPABLE flag which gets set after the first attempt to reap a task's mm fails. If the flag is present after the failure then we set MMF_OOM_REAPED to hide this mm from the oom killer completely so it can go and chose another victim. As a result a risk of OOM deadlock when the oom victim would be blocked indefinetly and so the oom killer cannot make any progress should be mitigated considerably while we still try really hard to perform all reclaim attempts and stay predictable in the behavior. Link: http://lkml.kernel.org/r/1466426628-15074-10-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/11a410d516e89320fe0817606eeab58f36c22968;"As a result a risk of OOM deadlock when the oom victim would be blocked indefinetly and so the oom killer cannot make any progress should be mitigated considerably while we still try really hard to perform all reclaim attempts and stay predictable in the behavior.";yes;yes;no 123;MDY6Q29tbWl0MjMyNTI5ODo2OTY0NTNlNjY2MzBhZDQ1ZTY0NGM0NTcxMzA3ZmEzZWJlYzlhODM1;Michal Hocko;Linus Torvalds;"mm, oom: task_will_free_mem should skip oom_reaped tasks The 0-day robot has encountered the following: Out of memory: Kill process 3914 (trinity-c0) score 167 or sacrifice child Killed process 3914 (trinity-c0) total-vm:55864kB, anon-rss:1512kB, file-rss:1088kB, shmem-rss:25616kB oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26488kB oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26900kB oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26900kB oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:27296kB oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:28148kB oom_reaper is trying to reap the same task again and again. This is possible only when the oom killer is bypassed because of task_will_free_mem because we skip over tasks with MMF_OOM_REAPED already set during select_bad_process. Teach task_will_free_mem to skip over MMF_OOM_REAPED tasks as well because they will be unlikely to free anything more. Analyzed by Tetsuo Handa. Link: http://lkml.kernel.org/r/1466426628-15074-9-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/696453e66630ad45e644c4571307fa3ebec9a835;mm, oom: task_will_free_mem should skip oom_reaped tasks;yes;no;no 123;MDY6Q29tbWl0MjMyNTI5ODo2OTY0NTNlNjY2MzBhZDQ1ZTY0NGM0NTcxMzA3ZmEzZWJlYzlhODM1;Michal Hocko;Linus Torvalds;"mm, oom: task_will_free_mem should skip oom_reaped tasks The 0-day robot has encountered the following: Out of memory: Kill process 3914 (trinity-c0) score 167 or sacrifice child Killed process 3914 (trinity-c0) total-vm:55864kB, anon-rss:1512kB, file-rss:1088kB, shmem-rss:25616kB oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26488kB oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26900kB oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26900kB oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:27296kB oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:28148kB oom_reaper is trying to reap the same task again and again. This is possible only when the oom killer is bypassed because of task_will_free_mem because we skip over tasks with MMF_OOM_REAPED already set during select_bad_process. Teach task_will_free_mem to skip over MMF_OOM_REAPED tasks as well because they will be unlikely to free anything more. Analyzed by Tetsuo Handa. Link: http://lkml.kernel.org/r/1466426628-15074-9-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/696453e66630ad45e644c4571307fa3ebec9a835;The 0-day robot has encountered the following;no;yes;yes 123;MDY6Q29tbWl0MjMyNTI5ODo2OTY0NTNlNjY2MzBhZDQ1ZTY0NGM0NTcxMzA3ZmEzZWJlYzlhODM1;Michal Hocko;Linus Torvalds;"mm, oom: task_will_free_mem should skip oom_reaped tasks The 0-day robot has encountered the following: Out of memory: Kill process 3914 (trinity-c0) score 167 or sacrifice child Killed process 3914 (trinity-c0) total-vm:55864kB, anon-rss:1512kB, file-rss:1088kB, shmem-rss:25616kB oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26488kB oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26900kB oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26900kB oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:27296kB oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:28148kB oom_reaper is trying to reap the same task again and again. This is possible only when the oom killer is bypassed because of task_will_free_mem because we skip over tasks with MMF_OOM_REAPED already set during select_bad_process. Teach task_will_free_mem to skip over MMF_OOM_REAPED tasks as well because they will be unlikely to free anything more. Analyzed by Tetsuo Handa. Link: http://lkml.kernel.org/r/1466426628-15074-9-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/696453e66630ad45e644c4571307fa3ebec9a835;" Out of memory: Kill process 3914 (trinity-c0) score 167 or sacrifice child Killed process 3914 (trinity-c0) total-vm:55864kB, anon-rss:1512kB, file-rss:1088kB, shmem-rss:25616kB oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26488kB oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26900kB oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26900kB oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:27296kB oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:28148kB oom_reaper is trying to reap the same task again and again";no;yes;yes 123;MDY6Q29tbWl0MjMyNTI5ODo2OTY0NTNlNjY2MzBhZDQ1ZTY0NGM0NTcxMzA3ZmEzZWJlYzlhODM1;Michal Hocko;Linus Torvalds;"mm, oom: task_will_free_mem should skip oom_reaped tasks The 0-day robot has encountered the following: Out of memory: Kill process 3914 (trinity-c0) score 167 or sacrifice child Killed process 3914 (trinity-c0) total-vm:55864kB, anon-rss:1512kB, file-rss:1088kB, shmem-rss:25616kB oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26488kB oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26900kB oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26900kB oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:27296kB oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:28148kB oom_reaper is trying to reap the same task again and again. This is possible only when the oom killer is bypassed because of task_will_free_mem because we skip over tasks with MMF_OOM_REAPED already set during select_bad_process. Teach task_will_free_mem to skip over MMF_OOM_REAPED tasks as well because they will be unlikely to free anything more. Analyzed by Tetsuo Handa. Link: http://lkml.kernel.org/r/1466426628-15074-9-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/696453e66630ad45e644c4571307fa3ebec9a835;"This is possible only when the oom killer is bypassed because of task_will_free_mem because we skip over tasks with MMF_OOM_REAPED already set during select_bad_process";no;no;yes 123;MDY6Q29tbWl0MjMyNTI5ODo2OTY0NTNlNjY2MzBhZDQ1ZTY0NGM0NTcxMzA3ZmEzZWJlYzlhODM1;Michal Hocko;Linus Torvalds;"mm, oom: task_will_free_mem should skip oom_reaped tasks The 0-day robot has encountered the following: Out of memory: Kill process 3914 (trinity-c0) score 167 or sacrifice child Killed process 3914 (trinity-c0) total-vm:55864kB, anon-rss:1512kB, file-rss:1088kB, shmem-rss:25616kB oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26488kB oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26900kB oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26900kB oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:27296kB oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:28148kB oom_reaper is trying to reap the same task again and again. This is possible only when the oom killer is bypassed because of task_will_free_mem because we skip over tasks with MMF_OOM_REAPED already set during select_bad_process. Teach task_will_free_mem to skip over MMF_OOM_REAPED tasks as well because they will be unlikely to free anything more. Analyzed by Tetsuo Handa. Link: http://lkml.kernel.org/r/1466426628-15074-9-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/696453e66630ad45e644c4571307fa3ebec9a835;" Teach task_will_free_mem to skip over MMF_OOM_REAPED tasks as well because they will be unlikely to free anything more";yes;yes;no 123;MDY6Q29tbWl0MjMyNTI5ODo2OTY0NTNlNjY2MzBhZDQ1ZTY0NGM0NTcxMzA3ZmEzZWJlYzlhODM1;Michal Hocko;Linus Torvalds;"mm, oom: task_will_free_mem should skip oom_reaped tasks The 0-day robot has encountered the following: Out of memory: Kill process 3914 (trinity-c0) score 167 or sacrifice child Killed process 3914 (trinity-c0) total-vm:55864kB, anon-rss:1512kB, file-rss:1088kB, shmem-rss:25616kB oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26488kB oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26900kB oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26900kB oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:27296kB oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:28148kB oom_reaper is trying to reap the same task again and again. This is possible only when the oom killer is bypassed because of task_will_free_mem because we skip over tasks with MMF_OOM_REAPED already set during select_bad_process. Teach task_will_free_mem to skip over MMF_OOM_REAPED tasks as well because they will be unlikely to free anything more. Analyzed by Tetsuo Handa. Link: http://lkml.kernel.org/r/1466426628-15074-9-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/696453e66630ad45e644c4571307fa3ebec9a835;Analyzed by Tetsuo Handa.;no;no;no 124;MDY6Q29tbWl0MjMyNTI5ODoxYWY4YmI0MzI2OTU2M2U0NThlYmNmMGVjZTgxMmU5YTk3MDg2NGIz;Michal Hocko;Linus Torvalds;"mm, oom: fortify task_will_free_mem() task_will_free_mem is rather weak. It doesn't really tell whether the task has chance to drop its mm. 98748bd72200 (""oom: consider multi-threaded tasks in task_will_free_mem"") made a first step into making it more robust for multi-threaded applications so now we know that the whole process is going down and probably drop the mm. This patch builds on top for more complex scenarios where mm is shared between different processes - CLONE_VM without CLONE_SIGHAND, or in kernel use_mm(). Make sure that all processes sharing the mm are killed or exiting. This will allow us to replace try_oom_reaper by wake_oom_reaper because task_will_free_mem implies the task is reapable now. Therefore all paths which bypass the oom killer are now reapable and so they shouldn't lock up the oom killer. Link: http://lkml.kernel.org/r/1466426628-15074-8-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/1af8bb43269563e458ebcf0ece812e9a970864b3;mm, oom: fortify task_will_free_mem();yes;yes;no 124;MDY6Q29tbWl0MjMyNTI5ODoxYWY4YmI0MzI2OTU2M2U0NThlYmNmMGVjZTgxMmU5YTk3MDg2NGIz;Michal Hocko;Linus Torvalds;"mm, oom: fortify task_will_free_mem() task_will_free_mem is rather weak. It doesn't really tell whether the task has chance to drop its mm. 98748bd72200 (""oom: consider multi-threaded tasks in task_will_free_mem"") made a first step into making it more robust for multi-threaded applications so now we know that the whole process is going down and probably drop the mm. This patch builds on top for more complex scenarios where mm is shared between different processes - CLONE_VM without CLONE_SIGHAND, or in kernel use_mm(). Make sure that all processes sharing the mm are killed or exiting. This will allow us to replace try_oom_reaper by wake_oom_reaper because task_will_free_mem implies the task is reapable now. Therefore all paths which bypass the oom killer are now reapable and so they shouldn't lock up the oom killer. Link: http://lkml.kernel.org/r/1466426628-15074-8-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/1af8bb43269563e458ebcf0ece812e9a970864b3;task_will_free_mem is rather weak;no;yes;yes 124;MDY6Q29tbWl0MjMyNTI5ODoxYWY4YmI0MzI2OTU2M2U0NThlYmNmMGVjZTgxMmU5YTk3MDg2NGIz;Michal Hocko;Linus Torvalds;"mm, oom: fortify task_will_free_mem() task_will_free_mem is rather weak. It doesn't really tell whether the task has chance to drop its mm. 98748bd72200 (""oom: consider multi-threaded tasks in task_will_free_mem"") made a first step into making it more robust for multi-threaded applications so now we know that the whole process is going down and probably drop the mm. This patch builds on top for more complex scenarios where mm is shared between different processes - CLONE_VM without CLONE_SIGHAND, or in kernel use_mm(). Make sure that all processes sharing the mm are killed or exiting. This will allow us to replace try_oom_reaper by wake_oom_reaper because task_will_free_mem implies the task is reapable now. Therefore all paths which bypass the oom killer are now reapable and so they shouldn't lock up the oom killer. Link: http://lkml.kernel.org/r/1466426628-15074-8-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/1af8bb43269563e458ebcf0ece812e9a970864b3;" It doesn't really tell whether the task has chance to drop its mm";no;yes;yes 124;MDY6Q29tbWl0MjMyNTI5ODoxYWY4YmI0MzI2OTU2M2U0NThlYmNmMGVjZTgxMmU5YTk3MDg2NGIz;Michal Hocko;Linus Torvalds;"mm, oom: fortify task_will_free_mem() task_will_free_mem is rather weak. It doesn't really tell whether the task has chance to drop its mm. 98748bd72200 (""oom: consider multi-threaded tasks in task_will_free_mem"") made a first step into making it more robust for multi-threaded applications so now we know that the whole process is going down and probably drop the mm. This patch builds on top for more complex scenarios where mm is shared between different processes - CLONE_VM without CLONE_SIGHAND, or in kernel use_mm(). Make sure that all processes sharing the mm are killed or exiting. This will allow us to replace try_oom_reaper by wake_oom_reaper because task_will_free_mem implies the task is reapable now. Therefore all paths which bypass the oom killer are now reapable and so they shouldn't lock up the oom killer. Link: http://lkml.kernel.org/r/1466426628-15074-8-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/1af8bb43269563e458ebcf0ece812e9a970864b3;" 98748bd72200 (""oom: consider multi-threaded tasks in task_will_free_mem"") made a first step into making it more robust for multi-threaded applications so now we know that the whole process is going down and probably drop the mm";no;no;yes 124;MDY6Q29tbWl0MjMyNTI5ODoxYWY4YmI0MzI2OTU2M2U0NThlYmNmMGVjZTgxMmU5YTk3MDg2NGIz;Michal Hocko;Linus Torvalds;"mm, oom: fortify task_will_free_mem() task_will_free_mem is rather weak. It doesn't really tell whether the task has chance to drop its mm. 98748bd72200 (""oom: consider multi-threaded tasks in task_will_free_mem"") made a first step into making it more robust for multi-threaded applications so now we know that the whole process is going down and probably drop the mm. This patch builds on top for more complex scenarios where mm is shared between different processes - CLONE_VM without CLONE_SIGHAND, or in kernel use_mm(). Make sure that all processes sharing the mm are killed or exiting. This will allow us to replace try_oom_reaper by wake_oom_reaper because task_will_free_mem implies the task is reapable now. Therefore all paths which bypass the oom killer are now reapable and so they shouldn't lock up the oom killer. Link: http://lkml.kernel.org/r/1466426628-15074-8-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/1af8bb43269563e458ebcf0ece812e9a970864b3;"This patch builds on top for more complex scenarios where mm is shared between different processes - CLONE_VM without CLONE_SIGHAND, or in kernel use_mm()";yes;yes;yes 124;MDY6Q29tbWl0MjMyNTI5ODoxYWY4YmI0MzI2OTU2M2U0NThlYmNmMGVjZTgxMmU5YTk3MDg2NGIz;Michal Hocko;Linus Torvalds;"mm, oom: fortify task_will_free_mem() task_will_free_mem is rather weak. It doesn't really tell whether the task has chance to drop its mm. 98748bd72200 (""oom: consider multi-threaded tasks in task_will_free_mem"") made a first step into making it more robust for multi-threaded applications so now we know that the whole process is going down and probably drop the mm. This patch builds on top for more complex scenarios where mm is shared between different processes - CLONE_VM without CLONE_SIGHAND, or in kernel use_mm(). Make sure that all processes sharing the mm are killed or exiting. This will allow us to replace try_oom_reaper by wake_oom_reaper because task_will_free_mem implies the task is reapable now. Therefore all paths which bypass the oom killer are now reapable and so they shouldn't lock up the oom killer. Link: http://lkml.kernel.org/r/1466426628-15074-8-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/1af8bb43269563e458ebcf0ece812e9a970864b3;Make sure that all processes sharing the mm are killed or exiting;yes;no;no 124;MDY6Q29tbWl0MjMyNTI5ODoxYWY4YmI0MzI2OTU2M2U0NThlYmNmMGVjZTgxMmU5YTk3MDg2NGIz;Michal Hocko;Linus Torvalds;"mm, oom: fortify task_will_free_mem() task_will_free_mem is rather weak. It doesn't really tell whether the task has chance to drop its mm. 98748bd72200 (""oom: consider multi-threaded tasks in task_will_free_mem"") made a first step into making it more robust for multi-threaded applications so now we know that the whole process is going down and probably drop the mm. This patch builds on top for more complex scenarios where mm is shared between different processes - CLONE_VM without CLONE_SIGHAND, or in kernel use_mm(). Make sure that all processes sharing the mm are killed or exiting. This will allow us to replace try_oom_reaper by wake_oom_reaper because task_will_free_mem implies the task is reapable now. Therefore all paths which bypass the oom killer are now reapable and so they shouldn't lock up the oom killer. Link: http://lkml.kernel.org/r/1466426628-15074-8-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/1af8bb43269563e458ebcf0ece812e9a970864b3;" This will allow us to replace try_oom_reaper by wake_oom_reaper because task_will_free_mem implies the task is reapable now";yes;yes;no 124;MDY6Q29tbWl0MjMyNTI5ODoxYWY4YmI0MzI2OTU2M2U0NThlYmNmMGVjZTgxMmU5YTk3MDg2NGIz;Michal Hocko;Linus Torvalds;"mm, oom: fortify task_will_free_mem() task_will_free_mem is rather weak. It doesn't really tell whether the task has chance to drop its mm. 98748bd72200 (""oom: consider multi-threaded tasks in task_will_free_mem"") made a first step into making it more robust for multi-threaded applications so now we know that the whole process is going down and probably drop the mm. This patch builds on top for more complex scenarios where mm is shared between different processes - CLONE_VM without CLONE_SIGHAND, or in kernel use_mm(). Make sure that all processes sharing the mm are killed or exiting. This will allow us to replace try_oom_reaper by wake_oom_reaper because task_will_free_mem implies the task is reapable now. Therefore all paths which bypass the oom killer are now reapable and so they shouldn't lock up the oom killer. Link: http://lkml.kernel.org/r/1466426628-15074-8-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/1af8bb43269563e458ebcf0ece812e9a970864b3;" Therefore all paths which bypass the oom killer are now reapable and so they shouldn't lock up the oom killer.";yes;yes;no 125;MDY6Q29tbWl0MjMyNTI5ODo5N2ZkNDljMjM1NWZmZGVkZTY1MjZhZmMwYzcyYmMzMTRkMDVmNDJh;Michal Hocko;Linus Torvalds;"mm, oom: kill all tasks sharing the mm Currently oom_kill_process skips both the oom reaper and SIG_KILL if a process sharing the same mm is unkillable via OOM_ADJUST_MIN. After ""mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj"" all such processes are sharing the same value so we shouldn't see such a task at all (oom_badness would rule them out). We can still encounter oom disabled vforked task which has to be killed as well if we want to have other tasks sharing the mm reapable because it can access the memory before doing exec. Killing such a task should be acceptable because it is highly unlikely it has done anything useful because it cannot modify any memory before it calls exec. An alternative would be to keep the task alive and skip the oom reaper and risk all the weird corner cases where the OOM killer cannot make forward progress because the oom victim hung somewhere on the way to exit. [rientjes@google.com - drop printk when OOM_SCORE_ADJ_MIN killed task the setting is inherently racy and we cannot do much about it without introducing locks in hot paths] Link: http://lkml.kernel.org/r/1466426628-15074-7-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/97fd49c2355ffdede6526afc0c72bc314d05f42a;mm, oom: kill all tasks sharing the mm;yes;no;no 125;MDY6Q29tbWl0MjMyNTI5ODo5N2ZkNDljMjM1NWZmZGVkZTY1MjZhZmMwYzcyYmMzMTRkMDVmNDJh;Michal Hocko;Linus Torvalds;"mm, oom: kill all tasks sharing the mm Currently oom_kill_process skips both the oom reaper and SIG_KILL if a process sharing the same mm is unkillable via OOM_ADJUST_MIN. After ""mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj"" all such processes are sharing the same value so we shouldn't see such a task at all (oom_badness would rule them out). We can still encounter oom disabled vforked task which has to be killed as well if we want to have other tasks sharing the mm reapable because it can access the memory before doing exec. Killing such a task should be acceptable because it is highly unlikely it has done anything useful because it cannot modify any memory before it calls exec. An alternative would be to keep the task alive and skip the oom reaper and risk all the weird corner cases where the OOM killer cannot make forward progress because the oom victim hung somewhere on the way to exit. [rientjes@google.com - drop printk when OOM_SCORE_ADJ_MIN killed task the setting is inherently racy and we cannot do much about it without introducing locks in hot paths] Link: http://lkml.kernel.org/r/1466426628-15074-7-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/97fd49c2355ffdede6526afc0c72bc314d05f42a;"Currently oom_kill_process skips both the oom reaper and SIG_KILL if a process sharing the same mm is unkillable via OOM_ADJUST_MIN";no;no;yes 125;MDY6Q29tbWl0MjMyNTI5ODo5N2ZkNDljMjM1NWZmZGVkZTY1MjZhZmMwYzcyYmMzMTRkMDVmNDJh;Michal Hocko;Linus Torvalds;"mm, oom: kill all tasks sharing the mm Currently oom_kill_process skips both the oom reaper and SIG_KILL if a process sharing the same mm is unkillable via OOM_ADJUST_MIN. After ""mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj"" all such processes are sharing the same value so we shouldn't see such a task at all (oom_badness would rule them out). We can still encounter oom disabled vforked task which has to be killed as well if we want to have other tasks sharing the mm reapable because it can access the memory before doing exec. Killing such a task should be acceptable because it is highly unlikely it has done anything useful because it cannot modify any memory before it calls exec. An alternative would be to keep the task alive and skip the oom reaper and risk all the weird corner cases where the OOM killer cannot make forward progress because the oom victim hung somewhere on the way to exit. [rientjes@google.com - drop printk when OOM_SCORE_ADJ_MIN killed task the setting is inherently racy and we cannot do much about it without introducing locks in hot paths] Link: http://lkml.kernel.org/r/1466426628-15074-7-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/97fd49c2355ffdede6526afc0c72bc314d05f42a;" After ""mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj"" all such processes are sharing the same value so we shouldn't see such a task at all (oom_badness would rule them out)";no;no;yes 125;MDY6Q29tbWl0MjMyNTI5ODo5N2ZkNDljMjM1NWZmZGVkZTY1MjZhZmMwYzcyYmMzMTRkMDVmNDJh;Michal Hocko;Linus Torvalds;"mm, oom: kill all tasks sharing the mm Currently oom_kill_process skips both the oom reaper and SIG_KILL if a process sharing the same mm is unkillable via OOM_ADJUST_MIN. After ""mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj"" all such processes are sharing the same value so we shouldn't see such a task at all (oom_badness would rule them out). We can still encounter oom disabled vforked task which has to be killed as well if we want to have other tasks sharing the mm reapable because it can access the memory before doing exec. Killing such a task should be acceptable because it is highly unlikely it has done anything useful because it cannot modify any memory before it calls exec. An alternative would be to keep the task alive and skip the oom reaper and risk all the weird corner cases where the OOM killer cannot make forward progress because the oom victim hung somewhere on the way to exit. [rientjes@google.com - drop printk when OOM_SCORE_ADJ_MIN killed task the setting is inherently racy and we cannot do much about it without introducing locks in hot paths] Link: http://lkml.kernel.org/r/1466426628-15074-7-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/97fd49c2355ffdede6526afc0c72bc314d05f42a;"We can still encounter oom disabled vforked task which has to be killed as well if we want to have other tasks sharing the mm reapable because it can access the memory before doing exec";yes;yes;yes 125;MDY6Q29tbWl0MjMyNTI5ODo5N2ZkNDljMjM1NWZmZGVkZTY1MjZhZmMwYzcyYmMzMTRkMDVmNDJh;Michal Hocko;Linus Torvalds;"mm, oom: kill all tasks sharing the mm Currently oom_kill_process skips both the oom reaper and SIG_KILL if a process sharing the same mm is unkillable via OOM_ADJUST_MIN. After ""mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj"" all such processes are sharing the same value so we shouldn't see such a task at all (oom_badness would rule them out). We can still encounter oom disabled vforked task which has to be killed as well if we want to have other tasks sharing the mm reapable because it can access the memory before doing exec. Killing such a task should be acceptable because it is highly unlikely it has done anything useful because it cannot modify any memory before it calls exec. An alternative would be to keep the task alive and skip the oom reaper and risk all the weird corner cases where the OOM killer cannot make forward progress because the oom victim hung somewhere on the way to exit. [rientjes@google.com - drop printk when OOM_SCORE_ADJ_MIN killed task the setting is inherently racy and we cannot do much about it without introducing locks in hot paths] Link: http://lkml.kernel.org/r/1466426628-15074-7-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/97fd49c2355ffdede6526afc0c72bc314d05f42a;" Killing such a task should be acceptable because it is highly unlikely it has done anything useful because it cannot modify any memory before it calls exec";yes;yes;yes 125;MDY6Q29tbWl0MjMyNTI5ODo5N2ZkNDljMjM1NWZmZGVkZTY1MjZhZmMwYzcyYmMzMTRkMDVmNDJh;Michal Hocko;Linus Torvalds;"mm, oom: kill all tasks sharing the mm Currently oom_kill_process skips both the oom reaper and SIG_KILL if a process sharing the same mm is unkillable via OOM_ADJUST_MIN. After ""mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj"" all such processes are sharing the same value so we shouldn't see such a task at all (oom_badness would rule them out). We can still encounter oom disabled vforked task which has to be killed as well if we want to have other tasks sharing the mm reapable because it can access the memory before doing exec. Killing such a task should be acceptable because it is highly unlikely it has done anything useful because it cannot modify any memory before it calls exec. An alternative would be to keep the task alive and skip the oom reaper and risk all the weird corner cases where the OOM killer cannot make forward progress because the oom victim hung somewhere on the way to exit. [rientjes@google.com - drop printk when OOM_SCORE_ADJ_MIN killed task the setting is inherently racy and we cannot do much about it without introducing locks in hot paths] Link: http://lkml.kernel.org/r/1466426628-15074-7-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/97fd49c2355ffdede6526afc0c72bc314d05f42a;" An alternative would be to keep the task alive and skip the oom reaper and risk all the weird corner cases where the OOM killer cannot make forward progress because the oom victim hung somewhere on the way to exit";yes;yes;yes 125;MDY6Q29tbWl0MjMyNTI5ODo5N2ZkNDljMjM1NWZmZGVkZTY1MjZhZmMwYzcyYmMzMTRkMDVmNDJh;Michal Hocko;Linus Torvalds;"mm, oom: kill all tasks sharing the mm Currently oom_kill_process skips both the oom reaper and SIG_KILL if a process sharing the same mm is unkillable via OOM_ADJUST_MIN. After ""mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj"" all such processes are sharing the same value so we shouldn't see such a task at all (oom_badness would rule them out). We can still encounter oom disabled vforked task which has to be killed as well if we want to have other tasks sharing the mm reapable because it can access the memory before doing exec. Killing such a task should be acceptable because it is highly unlikely it has done anything useful because it cannot modify any memory before it calls exec. An alternative would be to keep the task alive and skip the oom reaper and risk all the weird corner cases where the OOM killer cannot make forward progress because the oom victim hung somewhere on the way to exit. [rientjes@google.com - drop printk when OOM_SCORE_ADJ_MIN killed task the setting is inherently racy and we cannot do much about it without introducing locks in hot paths] Link: http://lkml.kernel.org/r/1466426628-15074-7-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/97fd49c2355ffdede6526afc0c72bc314d05f42a;"[rientjes@google.com - drop printk when OOM_SCORE_ADJ_MIN killed task the setting is inherently racy and we cannot do much about it without introducing locks in hot paths]";yes;yes;yes 126;MDY6Q29tbWl0MjMyNTI5ODpiMThkYzVmMjkxYzA3ZGRhZjMxNTYyYjlmMjdiM2ExMjJmMWY5Yjdl;Michal Hocko;Linus Torvalds;"mm, oom: skip vforked tasks from being selected vforked tasks are not really sitting on any memory. They are sharing the mm with parent until they exec into a new code. Until then it is just pinning the address space. OOM killer will kill the vforked task along with its parent but we still can end up selecting vforked task when the parent wouldn't be selected. E.g. init doing vfork to launch a task or vforked being a child of oom unkillable task with an updated oom_score_adj to be killable. Add a new helper to check whether a task is in the vfork sharing memory with its parent and use it in oom_badness to skip over these tasks. Link: http://lkml.kernel.org/r/1466426628-15074-6-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/b18dc5f291c07ddaf31562b9f27b3a122f1f9b7e;mm, oom: skip vforked tasks from being selected;yes;no;no 126;MDY6Q29tbWl0MjMyNTI5ODpiMThkYzVmMjkxYzA3ZGRhZjMxNTYyYjlmMjdiM2ExMjJmMWY5Yjdl;Michal Hocko;Linus Torvalds;"mm, oom: skip vforked tasks from being selected vforked tasks are not really sitting on any memory. They are sharing the mm with parent until they exec into a new code. Until then it is just pinning the address space. OOM killer will kill the vforked task along with its parent but we still can end up selecting vforked task when the parent wouldn't be selected. E.g. init doing vfork to launch a task or vforked being a child of oom unkillable task with an updated oom_score_adj to be killable. Add a new helper to check whether a task is in the vfork sharing memory with its parent and use it in oom_badness to skip over these tasks. Link: http://lkml.kernel.org/r/1466426628-15074-6-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/b18dc5f291c07ddaf31562b9f27b3a122f1f9b7e;vforked tasks are not really sitting on any memory;no;no;yes 126;MDY6Q29tbWl0MjMyNTI5ODpiMThkYzVmMjkxYzA3ZGRhZjMxNTYyYjlmMjdiM2ExMjJmMWY5Yjdl;Michal Hocko;Linus Torvalds;"mm, oom: skip vforked tasks from being selected vforked tasks are not really sitting on any memory. They are sharing the mm with parent until they exec into a new code. Until then it is just pinning the address space. OOM killer will kill the vforked task along with its parent but we still can end up selecting vforked task when the parent wouldn't be selected. E.g. init doing vfork to launch a task or vforked being a child of oom unkillable task with an updated oom_score_adj to be killable. Add a new helper to check whether a task is in the vfork sharing memory with its parent and use it in oom_badness to skip over these tasks. Link: http://lkml.kernel.org/r/1466426628-15074-6-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/b18dc5f291c07ddaf31562b9f27b3a122f1f9b7e;" They are sharing the mm with parent until they exec into a new code";no;no;yes 126;MDY6Q29tbWl0MjMyNTI5ODpiMThkYzVmMjkxYzA3ZGRhZjMxNTYyYjlmMjdiM2ExMjJmMWY5Yjdl;Michal Hocko;Linus Torvalds;"mm, oom: skip vforked tasks from being selected vforked tasks are not really sitting on any memory. They are sharing the mm with parent until they exec into a new code. Until then it is just pinning the address space. OOM killer will kill the vforked task along with its parent but we still can end up selecting vforked task when the parent wouldn't be selected. E.g. init doing vfork to launch a task or vforked being a child of oom unkillable task with an updated oom_score_adj to be killable. Add a new helper to check whether a task is in the vfork sharing memory with its parent and use it in oom_badness to skip over these tasks. Link: http://lkml.kernel.org/r/1466426628-15074-6-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/b18dc5f291c07ddaf31562b9f27b3a122f1f9b7e;" Until then it is just pinning the address space";no;yes;yes 126;MDY6Q29tbWl0MjMyNTI5ODpiMThkYzVmMjkxYzA3ZGRhZjMxNTYyYjlmMjdiM2ExMjJmMWY5Yjdl;Michal Hocko;Linus Torvalds;"mm, oom: skip vforked tasks from being selected vforked tasks are not really sitting on any memory. They are sharing the mm with parent until they exec into a new code. Until then it is just pinning the address space. OOM killer will kill the vforked task along with its parent but we still can end up selecting vforked task when the parent wouldn't be selected. E.g. init doing vfork to launch a task or vforked being a child of oom unkillable task with an updated oom_score_adj to be killable. Add a new helper to check whether a task is in the vfork sharing memory with its parent and use it in oom_badness to skip over these tasks. Link: http://lkml.kernel.org/r/1466426628-15074-6-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/b18dc5f291c07ddaf31562b9f27b3a122f1f9b7e;" OOM killer will kill the vforked task along with its parent but we still can end up selecting vforked task when the parent wouldn't be selected";no;yes;yes 126;MDY6Q29tbWl0MjMyNTI5ODpiMThkYzVmMjkxYzA3ZGRhZjMxNTYyYjlmMjdiM2ExMjJmMWY5Yjdl;Michal Hocko;Linus Torvalds;"mm, oom: skip vforked tasks from being selected vforked tasks are not really sitting on any memory. They are sharing the mm with parent until they exec into a new code. Until then it is just pinning the address space. OOM killer will kill the vforked task along with its parent but we still can end up selecting vforked task when the parent wouldn't be selected. E.g. init doing vfork to launch a task or vforked being a child of oom unkillable task with an updated oom_score_adj to be killable. Add a new helper to check whether a task is in the vfork sharing memory with its parent and use it in oom_badness to skip over these tasks. Link: http://lkml.kernel.org/r/1466426628-15074-6-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/b18dc5f291c07ddaf31562b9f27b3a122f1f9b7e;" init doing vfork to launch a task or vforked being a child of oom unkillable task with an updated oom_score_adj to be killable";no;no;yes 126;MDY6Q29tbWl0MjMyNTI5ODpiMThkYzVmMjkxYzA3ZGRhZjMxNTYyYjlmMjdiM2ExMjJmMWY5Yjdl;Michal Hocko;Linus Torvalds;"mm, oom: skip vforked tasks from being selected vforked tasks are not really sitting on any memory. They are sharing the mm with parent until they exec into a new code. Until then it is just pinning the address space. OOM killer will kill the vforked task along with its parent but we still can end up selecting vforked task when the parent wouldn't be selected. E.g. init doing vfork to launch a task or vforked being a child of oom unkillable task with an updated oom_score_adj to be killable. Add a new helper to check whether a task is in the vfork sharing memory with its parent and use it in oom_badness to skip over these tasks. Link: http://lkml.kernel.org/r/1466426628-15074-6-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/b18dc5f291c07ddaf31562b9f27b3a122f1f9b7e;"Add a new helper to check whether a task is in the vfork sharing memory with its parent and use it in oom_badness to skip over these tasks.";yes;no;no 127;MDY6Q29tbWl0MjMyNTI5ODo0NGE3MGFkZWM5MTBkNjkyOTY4OWU0MmI2ZTVjZWU1YjdkMjAyZDIw;Michal Hocko;Linus Torvalds;"mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj oom_score_adj is shared for the thread groups (via struct signal) but this is not sufficient to cover processes sharing mm (CLONE_VM without CLONE_SIGHAND) and so we can easily end up in a situation when some processes update their oom_score_adj and confuse the oom killer. In the worst case some of those processes might hide from the oom killer altogether via OOM_SCORE_ADJ_MIN while others are eligible. OOM killer would then pick up those eligible but won't be allowed to kill others sharing the same mm so the mm wouldn't release the mm and so the memory. It would be ideal to have the oom_score_adj per mm_struct because that is the natural entity OOM killer considers. But this will not work because some programs are doing vfork() set_oom_adj() exec() We can achieve the same though. oom_score_adj write handler can set the oom_score_adj for all processes sharing the same mm if the task is not in the middle of vfork. As a result all the processes will share the same oom_score_adj. The current implementation is rather pessimistic and checks all the existing processes by default if there is more than 1 holder of the mm but we do not have any reliable way to check for external users yet. Link: http://lkml.kernel.org/r/1466426628-15074-5-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/44a70adec910d6929689e42b6e5cee5b7d202d20;mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj;yes;no;no 127;MDY6Q29tbWl0MjMyNTI5ODo0NGE3MGFkZWM5MTBkNjkyOTY4OWU0MmI2ZTVjZWU1YjdkMjAyZDIw;Michal Hocko;Linus Torvalds;"mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj oom_score_adj is shared for the thread groups (via struct signal) but this is not sufficient to cover processes sharing mm (CLONE_VM without CLONE_SIGHAND) and so we can easily end up in a situation when some processes update their oom_score_adj and confuse the oom killer. In the worst case some of those processes might hide from the oom killer altogether via OOM_SCORE_ADJ_MIN while others are eligible. OOM killer would then pick up those eligible but won't be allowed to kill others sharing the same mm so the mm wouldn't release the mm and so the memory. It would be ideal to have the oom_score_adj per mm_struct because that is the natural entity OOM killer considers. But this will not work because some programs are doing vfork() set_oom_adj() exec() We can achieve the same though. oom_score_adj write handler can set the oom_score_adj for all processes sharing the same mm if the task is not in the middle of vfork. As a result all the processes will share the same oom_score_adj. The current implementation is rather pessimistic and checks all the existing processes by default if there is more than 1 holder of the mm but we do not have any reliable way to check for external users yet. Link: http://lkml.kernel.org/r/1466426628-15074-5-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/44a70adec910d6929689e42b6e5cee5b7d202d20;"oom_score_adj is shared for the thread groups (via struct signal) but this is not sufficient to cover processes sharing mm (CLONE_VM without CLONE_SIGHAND) and so we can easily end up in a situation when some processes update their oom_score_adj and confuse the oom killer";no;yes;yes 127;MDY6Q29tbWl0MjMyNTI5ODo0NGE3MGFkZWM5MTBkNjkyOTY4OWU0MmI2ZTVjZWU1YjdkMjAyZDIw;Michal Hocko;Linus Torvalds;"mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj oom_score_adj is shared for the thread groups (via struct signal) but this is not sufficient to cover processes sharing mm (CLONE_VM without CLONE_SIGHAND) and so we can easily end up in a situation when some processes update their oom_score_adj and confuse the oom killer. In the worst case some of those processes might hide from the oom killer altogether via OOM_SCORE_ADJ_MIN while others are eligible. OOM killer would then pick up those eligible but won't be allowed to kill others sharing the same mm so the mm wouldn't release the mm and so the memory. It would be ideal to have the oom_score_adj per mm_struct because that is the natural entity OOM killer considers. But this will not work because some programs are doing vfork() set_oom_adj() exec() We can achieve the same though. oom_score_adj write handler can set the oom_score_adj for all processes sharing the same mm if the task is not in the middle of vfork. As a result all the processes will share the same oom_score_adj. The current implementation is rather pessimistic and checks all the existing processes by default if there is more than 1 holder of the mm but we do not have any reliable way to check for external users yet. Link: http://lkml.kernel.org/r/1466426628-15074-5-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/44a70adec910d6929689e42b6e5cee5b7d202d20;" In the worst case some of those processes might hide from the oom killer altogether via OOM_SCORE_ADJ_MIN while others are eligible";no;yes;yes 127;MDY6Q29tbWl0MjMyNTI5ODo0NGE3MGFkZWM5MTBkNjkyOTY4OWU0MmI2ZTVjZWU1YjdkMjAyZDIw;Michal Hocko;Linus Torvalds;"mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj oom_score_adj is shared for the thread groups (via struct signal) but this is not sufficient to cover processes sharing mm (CLONE_VM without CLONE_SIGHAND) and so we can easily end up in a situation when some processes update their oom_score_adj and confuse the oom killer. In the worst case some of those processes might hide from the oom killer altogether via OOM_SCORE_ADJ_MIN while others are eligible. OOM killer would then pick up those eligible but won't be allowed to kill others sharing the same mm so the mm wouldn't release the mm and so the memory. It would be ideal to have the oom_score_adj per mm_struct because that is the natural entity OOM killer considers. But this will not work because some programs are doing vfork() set_oom_adj() exec() We can achieve the same though. oom_score_adj write handler can set the oom_score_adj for all processes sharing the same mm if the task is not in the middle of vfork. As a result all the processes will share the same oom_score_adj. The current implementation is rather pessimistic and checks all the existing processes by default if there is more than 1 holder of the mm but we do not have any reliable way to check for external users yet. Link: http://lkml.kernel.org/r/1466426628-15074-5-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/44a70adec910d6929689e42b6e5cee5b7d202d20;" OOM killer would then pick up those eligible but won't be allowed to kill others sharing the same mm so the mm wouldn't release the mm and so the memory";no;yes;yes 127;MDY6Q29tbWl0MjMyNTI5ODo0NGE3MGFkZWM5MTBkNjkyOTY4OWU0MmI2ZTVjZWU1YjdkMjAyZDIw;Michal Hocko;Linus Torvalds;"mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj oom_score_adj is shared for the thread groups (via struct signal) but this is not sufficient to cover processes sharing mm (CLONE_VM without CLONE_SIGHAND) and so we can easily end up in a situation when some processes update their oom_score_adj and confuse the oom killer. In the worst case some of those processes might hide from the oom killer altogether via OOM_SCORE_ADJ_MIN while others are eligible. OOM killer would then pick up those eligible but won't be allowed to kill others sharing the same mm so the mm wouldn't release the mm and so the memory. It would be ideal to have the oom_score_adj per mm_struct because that is the natural entity OOM killer considers. But this will not work because some programs are doing vfork() set_oom_adj() exec() We can achieve the same though. oom_score_adj write handler can set the oom_score_adj for all processes sharing the same mm if the task is not in the middle of vfork. As a result all the processes will share the same oom_score_adj. The current implementation is rather pessimistic and checks all the existing processes by default if there is more than 1 holder of the mm but we do not have any reliable way to check for external users yet. Link: http://lkml.kernel.org/r/1466426628-15074-5-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/44a70adec910d6929689e42b6e5cee5b7d202d20;"It would be ideal to have the oom_score_adj per mm_struct because that is the natural entity OOM killer considers";no;yes;yes 127;MDY6Q29tbWl0MjMyNTI5ODo0NGE3MGFkZWM5MTBkNjkyOTY4OWU0MmI2ZTVjZWU1YjdkMjAyZDIw;Michal Hocko;Linus Torvalds;"mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj oom_score_adj is shared for the thread groups (via struct signal) but this is not sufficient to cover processes sharing mm (CLONE_VM without CLONE_SIGHAND) and so we can easily end up in a situation when some processes update their oom_score_adj and confuse the oom killer. In the worst case some of those processes might hide from the oom killer altogether via OOM_SCORE_ADJ_MIN while others are eligible. OOM killer would then pick up those eligible but won't be allowed to kill others sharing the same mm so the mm wouldn't release the mm and so the memory. It would be ideal to have the oom_score_adj per mm_struct because that is the natural entity OOM killer considers. But this will not work because some programs are doing vfork() set_oom_adj() exec() We can achieve the same though. oom_score_adj write handler can set the oom_score_adj for all processes sharing the same mm if the task is not in the middle of vfork. As a result all the processes will share the same oom_score_adj. The current implementation is rather pessimistic and checks all the existing processes by default if there is more than 1 holder of the mm but we do not have any reliable way to check for external users yet. Link: http://lkml.kernel.org/r/1466426628-15074-5-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/44a70adec910d6929689e42b6e5cee5b7d202d20;" But this will not work because some programs are doing vfork() set_oom_adj() exec() We can achieve the same though";no;yes;yes 127;MDY6Q29tbWl0MjMyNTI5ODo0NGE3MGFkZWM5MTBkNjkyOTY4OWU0MmI2ZTVjZWU1YjdkMjAyZDIw;Michal Hocko;Linus Torvalds;"mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj oom_score_adj is shared for the thread groups (via struct signal) but this is not sufficient to cover processes sharing mm (CLONE_VM without CLONE_SIGHAND) and so we can easily end up in a situation when some processes update their oom_score_adj and confuse the oom killer. In the worst case some of those processes might hide from the oom killer altogether via OOM_SCORE_ADJ_MIN while others are eligible. OOM killer would then pick up those eligible but won't be allowed to kill others sharing the same mm so the mm wouldn't release the mm and so the memory. It would be ideal to have the oom_score_adj per mm_struct because that is the natural entity OOM killer considers. But this will not work because some programs are doing vfork() set_oom_adj() exec() We can achieve the same though. oom_score_adj write handler can set the oom_score_adj for all processes sharing the same mm if the task is not in the middle of vfork. As a result all the processes will share the same oom_score_adj. The current implementation is rather pessimistic and checks all the existing processes by default if there is more than 1 holder of the mm but we do not have any reliable way to check for external users yet. Link: http://lkml.kernel.org/r/1466426628-15074-5-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/44a70adec910d6929689e42b6e5cee5b7d202d20;" oom_score_adj write handler can set the oom_score_adj for all processes sharing the same mm if the task is not in the middle of vfork";yes;no;no 127;MDY6Q29tbWl0MjMyNTI5ODo0NGE3MGFkZWM5MTBkNjkyOTY4OWU0MmI2ZTVjZWU1YjdkMjAyZDIw;Michal Hocko;Linus Torvalds;"mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj oom_score_adj is shared for the thread groups (via struct signal) but this is not sufficient to cover processes sharing mm (CLONE_VM without CLONE_SIGHAND) and so we can easily end up in a situation when some processes update their oom_score_adj and confuse the oom killer. In the worst case some of those processes might hide from the oom killer altogether via OOM_SCORE_ADJ_MIN while others are eligible. OOM killer would then pick up those eligible but won't be allowed to kill others sharing the same mm so the mm wouldn't release the mm and so the memory. It would be ideal to have the oom_score_adj per mm_struct because that is the natural entity OOM killer considers. But this will not work because some programs are doing vfork() set_oom_adj() exec() We can achieve the same though. oom_score_adj write handler can set the oom_score_adj for all processes sharing the same mm if the task is not in the middle of vfork. As a result all the processes will share the same oom_score_adj. The current implementation is rather pessimistic and checks all the existing processes by default if there is more than 1 holder of the mm but we do not have any reliable way to check for external users yet. Link: http://lkml.kernel.org/r/1466426628-15074-5-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/44a70adec910d6929689e42b6e5cee5b7d202d20;" As a result all the processes will share the same oom_score_adj";yes;yes;no 127;MDY6Q29tbWl0MjMyNTI5ODo0NGE3MGFkZWM5MTBkNjkyOTY4OWU0MmI2ZTVjZWU1YjdkMjAyZDIw;Michal Hocko;Linus Torvalds;"mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj oom_score_adj is shared for the thread groups (via struct signal) but this is not sufficient to cover processes sharing mm (CLONE_VM without CLONE_SIGHAND) and so we can easily end up in a situation when some processes update their oom_score_adj and confuse the oom killer. In the worst case some of those processes might hide from the oom killer altogether via OOM_SCORE_ADJ_MIN while others are eligible. OOM killer would then pick up those eligible but won't be allowed to kill others sharing the same mm so the mm wouldn't release the mm and so the memory. It would be ideal to have the oom_score_adj per mm_struct because that is the natural entity OOM killer considers. But this will not work because some programs are doing vfork() set_oom_adj() exec() We can achieve the same though. oom_score_adj write handler can set the oom_score_adj for all processes sharing the same mm if the task is not in the middle of vfork. As a result all the processes will share the same oom_score_adj. The current implementation is rather pessimistic and checks all the existing processes by default if there is more than 1 holder of the mm but we do not have any reliable way to check for external users yet. Link: http://lkml.kernel.org/r/1466426628-15074-5-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/44a70adec910d6929689e42b6e5cee5b7d202d20;" The current implementation is rather pessimistic and checks all the existing processes by default if there is more than 1 holder of the mm but we do not have any reliable way to check for external users yet.";yes;no;no 128;MDY6Q29tbWl0MjMyNTI5ODplNWUzZjRjNGYwZTk1ZWNiYWQyZjhkMmY0ZjZhMjliYjhhOTAyMjZi;Michal Hocko;Linus Torvalds;"mm, oom_reaper: make sure that mmput_async is called only when memory was reaped Tetsuo is worried that mmput_async might still lead to a premature new oom victim selection due to the following race: __oom_reap_task exit_mm find_lock_task_mm atomic_inc(mm->mm_users) # = 2 task_unlock task_lock task->mm = NULL up_read(&mm->mmap_sem) < somebody write locks mmap_sem > task_unlock mmput atomic_dec_and_test # = 1 exit_oom_victim down_read_trylock # failed - no reclaim mmput_async # Takes unpredictable amount of time < new OOM situation > the final __mmput will be executed in the delayed context which might happen far in the future. Such a race is highly unlikely because the write holder of mmap_sem would have to be an external task (all direct holders are already killed or exiting) and it usually have to pin mm_users in order to do anything reasonable. We can, however, make sure that the mmput_async is only called when we do not back off and reap some memory. That would reduce the impact of the delayed __mmput because the real content would be already freed. Pin mm_count to keep it alive after we drop task_lock and before we try to get mmap_sem. If the mmap_sem succeeds we can try to grab mm_users reference and then go on with unmapping the address space. It is not clear whether this race is possible at all but it is better to be more robust and do not pin mm_users unless we are sure we are actually doing some real work during __oom_reap_task. Link: http://lkml.kernel.org/r/1465306987-30297-1-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/e5e3f4c4f0e95ecbad2f8d2f4f6a29bb8a90226b;mm, oom_reaper: make sure that mmput_async is called only when memory was reaped;yes;yes;no 128;MDY6Q29tbWl0MjMyNTI5ODplNWUzZjRjNGYwZTk1ZWNiYWQyZjhkMmY0ZjZhMjliYjhhOTAyMjZi;Michal Hocko;Linus Torvalds;"mm, oom_reaper: make sure that mmput_async is called only when memory was reaped Tetsuo is worried that mmput_async might still lead to a premature new oom victim selection due to the following race: __oom_reap_task exit_mm find_lock_task_mm atomic_inc(mm->mm_users) # = 2 task_unlock task_lock task->mm = NULL up_read(&mm->mmap_sem) < somebody write locks mmap_sem > task_unlock mmput atomic_dec_and_test # = 1 exit_oom_victim down_read_trylock # failed - no reclaim mmput_async # Takes unpredictable amount of time < new OOM situation > the final __mmput will be executed in the delayed context which might happen far in the future. Such a race is highly unlikely because the write holder of mmap_sem would have to be an external task (all direct holders are already killed or exiting) and it usually have to pin mm_users in order to do anything reasonable. We can, however, make sure that the mmput_async is only called when we do not back off and reap some memory. That would reduce the impact of the delayed __mmput because the real content would be already freed. Pin mm_count to keep it alive after we drop task_lock and before we try to get mmap_sem. If the mmap_sem succeeds we can try to grab mm_users reference and then go on with unmapping the address space. It is not clear whether this race is possible at all but it is better to be more robust and do not pin mm_users unless we are sure we are actually doing some real work during __oom_reap_task. Link: http://lkml.kernel.org/r/1465306987-30297-1-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/e5e3f4c4f0e95ecbad2f8d2f4f6a29bb8a90226b;"Tetsuo is worried that mmput_async might still lead to a premature new oom victim selection due to the following race";no;yes;yes 128;MDY6Q29tbWl0MjMyNTI5ODplNWUzZjRjNGYwZTk1ZWNiYWQyZjhkMmY0ZjZhMjliYjhhOTAyMjZi;Michal Hocko;Linus Torvalds;"mm, oom_reaper: make sure that mmput_async is called only when memory was reaped Tetsuo is worried that mmput_async might still lead to a premature new oom victim selection due to the following race: __oom_reap_task exit_mm find_lock_task_mm atomic_inc(mm->mm_users) # = 2 task_unlock task_lock task->mm = NULL up_read(&mm->mmap_sem) < somebody write locks mmap_sem > task_unlock mmput atomic_dec_and_test # = 1 exit_oom_victim down_read_trylock # failed - no reclaim mmput_async # Takes unpredictable amount of time < new OOM situation > the final __mmput will be executed in the delayed context which might happen far in the future. Such a race is highly unlikely because the write holder of mmap_sem would have to be an external task (all direct holders are already killed or exiting) and it usually have to pin mm_users in order to do anything reasonable. We can, however, make sure that the mmput_async is only called when we do not back off and reap some memory. That would reduce the impact of the delayed __mmput because the real content would be already freed. Pin mm_count to keep it alive after we drop task_lock and before we try to get mmap_sem. If the mmap_sem succeeds we can try to grab mm_users reference and then go on with unmapping the address space. It is not clear whether this race is possible at all but it is better to be more robust and do not pin mm_users unless we are sure we are actually doing some real work during __oom_reap_task. Link: http://lkml.kernel.org/r/1465306987-30297-1-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/e5e3f4c4f0e95ecbad2f8d2f4f6a29bb8a90226b;"__oom_reap_task exit_mm find_lock_task_mm atomic_inc(mm->mm_users) # = 2 task_unlock task_lock task->mm = NULL up_read(&mm->mmap_sem) < somebody write locks mmap_sem > task_unlock mmput atomic_dec_and_test # = 1 exit_oom_victim down_read_trylock # failed - no reclaim mmput_async # Takes unpredictable amount of time < new OOM situation > the final __mmput will be executed in the delayed context which might happen far in the future";no;yes;yes 128;MDY6Q29tbWl0MjMyNTI5ODplNWUzZjRjNGYwZTk1ZWNiYWQyZjhkMmY0ZjZhMjliYjhhOTAyMjZi;Michal Hocko;Linus Torvalds;"mm, oom_reaper: make sure that mmput_async is called only when memory was reaped Tetsuo is worried that mmput_async might still lead to a premature new oom victim selection due to the following race: __oom_reap_task exit_mm find_lock_task_mm atomic_inc(mm->mm_users) # = 2 task_unlock task_lock task->mm = NULL up_read(&mm->mmap_sem) < somebody write locks mmap_sem > task_unlock mmput atomic_dec_and_test # = 1 exit_oom_victim down_read_trylock # failed - no reclaim mmput_async # Takes unpredictable amount of time < new OOM situation > the final __mmput will be executed in the delayed context which might happen far in the future. Such a race is highly unlikely because the write holder of mmap_sem would have to be an external task (all direct holders are already killed or exiting) and it usually have to pin mm_users in order to do anything reasonable. We can, however, make sure that the mmput_async is only called when we do not back off and reap some memory. That would reduce the impact of the delayed __mmput because the real content would be already freed. Pin mm_count to keep it alive after we drop task_lock and before we try to get mmap_sem. If the mmap_sem succeeds we can try to grab mm_users reference and then go on with unmapping the address space. It is not clear whether this race is possible at all but it is better to be more robust and do not pin mm_users unless we are sure we are actually doing some real work during __oom_reap_task. Link: http://lkml.kernel.org/r/1465306987-30297-1-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/e5e3f4c4f0e95ecbad2f8d2f4f6a29bb8a90226b;" Such a race is highly unlikely because the write holder of mmap_sem would have to be an external task (all direct holders are already killed or exiting) and it usually have to pin mm_users in order to do anything reasonable";no;no;yes 128;MDY6Q29tbWl0MjMyNTI5ODplNWUzZjRjNGYwZTk1ZWNiYWQyZjhkMmY0ZjZhMjliYjhhOTAyMjZi;Michal Hocko;Linus Torvalds;"mm, oom_reaper: make sure that mmput_async is called only when memory was reaped Tetsuo is worried that mmput_async might still lead to a premature new oom victim selection due to the following race: __oom_reap_task exit_mm find_lock_task_mm atomic_inc(mm->mm_users) # = 2 task_unlock task_lock task->mm = NULL up_read(&mm->mmap_sem) < somebody write locks mmap_sem > task_unlock mmput atomic_dec_and_test # = 1 exit_oom_victim down_read_trylock # failed - no reclaim mmput_async # Takes unpredictable amount of time < new OOM situation > the final __mmput will be executed in the delayed context which might happen far in the future. Such a race is highly unlikely because the write holder of mmap_sem would have to be an external task (all direct holders are already killed or exiting) and it usually have to pin mm_users in order to do anything reasonable. We can, however, make sure that the mmput_async is only called when we do not back off and reap some memory. That would reduce the impact of the delayed __mmput because the real content would be already freed. Pin mm_count to keep it alive after we drop task_lock and before we try to get mmap_sem. If the mmap_sem succeeds we can try to grab mm_users reference and then go on with unmapping the address space. It is not clear whether this race is possible at all but it is better to be more robust and do not pin mm_users unless we are sure we are actually doing some real work during __oom_reap_task. Link: http://lkml.kernel.org/r/1465306987-30297-1-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/e5e3f4c4f0e95ecbad2f8d2f4f6a29bb8a90226b;"We can, however, make sure that the mmput_async is only called when we do not back off and reap some memory";yes;yes;no 128;MDY6Q29tbWl0MjMyNTI5ODplNWUzZjRjNGYwZTk1ZWNiYWQyZjhkMmY0ZjZhMjliYjhhOTAyMjZi;Michal Hocko;Linus Torvalds;"mm, oom_reaper: make sure that mmput_async is called only when memory was reaped Tetsuo is worried that mmput_async might still lead to a premature new oom victim selection due to the following race: __oom_reap_task exit_mm find_lock_task_mm atomic_inc(mm->mm_users) # = 2 task_unlock task_lock task->mm = NULL up_read(&mm->mmap_sem) < somebody write locks mmap_sem > task_unlock mmput atomic_dec_and_test # = 1 exit_oom_victim down_read_trylock # failed - no reclaim mmput_async # Takes unpredictable amount of time < new OOM situation > the final __mmput will be executed in the delayed context which might happen far in the future. Such a race is highly unlikely because the write holder of mmap_sem would have to be an external task (all direct holders are already killed or exiting) and it usually have to pin mm_users in order to do anything reasonable. We can, however, make sure that the mmput_async is only called when we do not back off and reap some memory. That would reduce the impact of the delayed __mmput because the real content would be already freed. Pin mm_count to keep it alive after we drop task_lock and before we try to get mmap_sem. If the mmap_sem succeeds we can try to grab mm_users reference and then go on with unmapping the address space. It is not clear whether this race is possible at all but it is better to be more robust and do not pin mm_users unless we are sure we are actually doing some real work during __oom_reap_task. Link: http://lkml.kernel.org/r/1465306987-30297-1-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/e5e3f4c4f0e95ecbad2f8d2f4f6a29bb8a90226b;" That would reduce the impact of the delayed __mmput because the real content would be already freed";no;yes;no 128;MDY6Q29tbWl0MjMyNTI5ODplNWUzZjRjNGYwZTk1ZWNiYWQyZjhkMmY0ZjZhMjliYjhhOTAyMjZi;Michal Hocko;Linus Torvalds;"mm, oom_reaper: make sure that mmput_async is called only when memory was reaped Tetsuo is worried that mmput_async might still lead to a premature new oom victim selection due to the following race: __oom_reap_task exit_mm find_lock_task_mm atomic_inc(mm->mm_users) # = 2 task_unlock task_lock task->mm = NULL up_read(&mm->mmap_sem) < somebody write locks mmap_sem > task_unlock mmput atomic_dec_and_test # = 1 exit_oom_victim down_read_trylock # failed - no reclaim mmput_async # Takes unpredictable amount of time < new OOM situation > the final __mmput will be executed in the delayed context which might happen far in the future. Such a race is highly unlikely because the write holder of mmap_sem would have to be an external task (all direct holders are already killed or exiting) and it usually have to pin mm_users in order to do anything reasonable. We can, however, make sure that the mmput_async is only called when we do not back off and reap some memory. That would reduce the impact of the delayed __mmput because the real content would be already freed. Pin mm_count to keep it alive after we drop task_lock and before we try to get mmap_sem. If the mmap_sem succeeds we can try to grab mm_users reference and then go on with unmapping the address space. It is not clear whether this race is possible at all but it is better to be more robust and do not pin mm_users unless we are sure we are actually doing some real work during __oom_reap_task. Link: http://lkml.kernel.org/r/1465306987-30297-1-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/e5e3f4c4f0e95ecbad2f8d2f4f6a29bb8a90226b;"Pin mm_count to keep it alive after we drop task_lock and before we try to get mmap_sem";yes;no;no 128;MDY6Q29tbWl0MjMyNTI5ODplNWUzZjRjNGYwZTk1ZWNiYWQyZjhkMmY0ZjZhMjliYjhhOTAyMjZi;Michal Hocko;Linus Torvalds;"mm, oom_reaper: make sure that mmput_async is called only when memory was reaped Tetsuo is worried that mmput_async might still lead to a premature new oom victim selection due to the following race: __oom_reap_task exit_mm find_lock_task_mm atomic_inc(mm->mm_users) # = 2 task_unlock task_lock task->mm = NULL up_read(&mm->mmap_sem) < somebody write locks mmap_sem > task_unlock mmput atomic_dec_and_test # = 1 exit_oom_victim down_read_trylock # failed - no reclaim mmput_async # Takes unpredictable amount of time < new OOM situation > the final __mmput will be executed in the delayed context which might happen far in the future. Such a race is highly unlikely because the write holder of mmap_sem would have to be an external task (all direct holders are already killed or exiting) and it usually have to pin mm_users in order to do anything reasonable. We can, however, make sure that the mmput_async is only called when we do not back off and reap some memory. That would reduce the impact of the delayed __mmput because the real content would be already freed. Pin mm_count to keep it alive after we drop task_lock and before we try to get mmap_sem. If the mmap_sem succeeds we can try to grab mm_users reference and then go on with unmapping the address space. It is not clear whether this race is possible at all but it is better to be more robust and do not pin mm_users unless we are sure we are actually doing some real work during __oom_reap_task. Link: http://lkml.kernel.org/r/1465306987-30297-1-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/e5e3f4c4f0e95ecbad2f8d2f4f6a29bb8a90226b;" If the mmap_sem succeeds we can try to grab mm_users reference and then go on with unmapping the address space";yes;no;no 128;MDY6Q29tbWl0MjMyNTI5ODplNWUzZjRjNGYwZTk1ZWNiYWQyZjhkMmY0ZjZhMjliYjhhOTAyMjZi;Michal Hocko;Linus Torvalds;"mm, oom_reaper: make sure that mmput_async is called only when memory was reaped Tetsuo is worried that mmput_async might still lead to a premature new oom victim selection due to the following race: __oom_reap_task exit_mm find_lock_task_mm atomic_inc(mm->mm_users) # = 2 task_unlock task_lock task->mm = NULL up_read(&mm->mmap_sem) < somebody write locks mmap_sem > task_unlock mmput atomic_dec_and_test # = 1 exit_oom_victim down_read_trylock # failed - no reclaim mmput_async # Takes unpredictable amount of time < new OOM situation > the final __mmput will be executed in the delayed context which might happen far in the future. Such a race is highly unlikely because the write holder of mmap_sem would have to be an external task (all direct holders are already killed or exiting) and it usually have to pin mm_users in order to do anything reasonable. We can, however, make sure that the mmput_async is only called when we do not back off and reap some memory. That would reduce the impact of the delayed __mmput because the real content would be already freed. Pin mm_count to keep it alive after we drop task_lock and before we try to get mmap_sem. If the mmap_sem succeeds we can try to grab mm_users reference and then go on with unmapping the address space. It is not clear whether this race is possible at all but it is better to be more robust and do not pin mm_users unless we are sure we are actually doing some real work during __oom_reap_task. Link: http://lkml.kernel.org/r/1465306987-30297-1-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/e5e3f4c4f0e95ecbad2f8d2f4f6a29bb8a90226b;"It is not clear whether this race is possible at all but it is better to be more robust and do not pin mm_users unless we are sure we are actually doing some real work during __oom_reap_task.";yes;yes;no 129;MDY6Q29tbWl0MjMyNTI5ODpmYmU4NGEwOWRhNzQ2Zjc4MTU1MzA1MWJiM2RiYzYzZjdiMGE1MTYy;Tetsuo Handa;Linus Torvalds;"mm,oom: remove unused argument from oom_scan_process_thread(). oom_scan_process_thread() does not use totalpages argument. oom_badness() uses it. Link: http://lkml.kernel.org/r/1463796041-7889-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/fbe84a09da746f781553051bb3dbc63f7b0a5162;mm,oom: remove unused argument from oom_scan_process_thread().;yes;yes;no 129;MDY6Q29tbWl0MjMyNTI5ODpmYmU4NGEwOWRhNzQ2Zjc4MTU1MzA1MWJiM2RiYzYzZjdiMGE1MTYy;Tetsuo Handa;Linus Torvalds;"mm,oom: remove unused argument from oom_scan_process_thread(). oom_scan_process_thread() does not use totalpages argument. oom_badness() uses it. Link: http://lkml.kernel.org/r/1463796041-7889-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/fbe84a09da746f781553051bb3dbc63f7b0a5162;oom_scan_process_thread() does not use totalpages argument;no;no;yes 129;MDY6Q29tbWl0MjMyNTI5ODpmYmU4NGEwOWRhNzQ2Zjc4MTU1MzA1MWJiM2RiYzYzZjdiMGE1MTYy;Tetsuo Handa;Linus Torvalds;"mm,oom: remove unused argument from oom_scan_process_thread(). oom_scan_process_thread() does not use totalpages argument. oom_badness() uses it. Link: http://lkml.kernel.org/r/1463796041-7889-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/fbe84a09da746f781553051bb3dbc63f7b0a5162;oom_badness() uses it.;no;no;yes 131;MDY6Q29tbWl0MjMyNTI5ODo3OThmZDc1Njk1MmM0YjZjYjdkZmU2ZjY0MzdlOWYwMmRhNzlhNWJj;Vladimir Davydov;Linus Torvalds;"mm: zap ZONE_OOM_LOCKED Not used since oom_lock was instroduced. Link: http://lkml.kernel.org/r/1464358093-22663-1-git-send-email-vdavydov@virtuozzo.com Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/798fd756952c4b6cb7dfe6f6437e9f02da79a5bc;mm: zap ZONE_OOM_LOCKED;yes;yes;no 131;MDY6Q29tbWl0MjMyNTI5ODo3OThmZDc1Njk1MmM0YjZjYjdkZmU2ZjY0MzdlOWYwMmRhNzlhNWJj;Vladimir Davydov;Linus Torvalds;"mm: zap ZONE_OOM_LOCKED Not used since oom_lock was instroduced. Link: http://lkml.kernel.org/r/1464358093-22663-1-git-send-email-vdavydov@virtuozzo.com Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/798fd756952c4b6cb7dfe6f6437e9f02da79a5bc;Not used since oom_lock was instroduced.;no;yes;yes 132;MDY6Q29tbWl0MjMyNTI5ODo5ZGYxMGZiN2I4MGJjMmY1NDA5NTZiYTAxYjVlN2VlMTAxMjAwMWE1;Tetsuo Handa;Linus Torvalds;"oom_reaper: avoid pointless atomic_inc_not_zero usage. Since commit 36324a990cf5 (""oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space"") changed to use find_lock_task_mm() for finding a mm_struct to reap, it is guaranteed that mm->mm_users > 0 because find_lock_task_mm() returns a task_struct with ->mm != NULL. Therefore, we can safely use atomic_inc(). Link: http://lkml.kernel.org/r/1465024759-8074-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9df10fb7b80bc2f540956ba01b5e7ee1012001a5;oom_reaper: avoid pointless atomic_inc_not_zero usage.;yes;yes;no 132;MDY6Q29tbWl0MjMyNTI5ODo5ZGYxMGZiN2I4MGJjMmY1NDA5NTZiYTAxYjVlN2VlMTAxMjAwMWE1;Tetsuo Handa;Linus Torvalds;"oom_reaper: avoid pointless atomic_inc_not_zero usage. Since commit 36324a990cf5 (""oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space"") changed to use find_lock_task_mm() for finding a mm_struct to reap, it is guaranteed that mm->mm_users > 0 because find_lock_task_mm() returns a task_struct with ->mm != NULL. Therefore, we can safely use atomic_inc(). Link: http://lkml.kernel.org/r/1465024759-8074-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9df10fb7b80bc2f540956ba01b5e7ee1012001a5;"Since commit 36324a990cf5 (""oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space"") changed to use find_lock_task_mm() for finding a mm_struct to reap, it is guaranteed that mm->mm_users > 0 because find_lock_task_mm() returns a task_struct with ->mm != NULL";no;no;yes 132;MDY6Q29tbWl0MjMyNTI5ODo5ZGYxMGZiN2I4MGJjMmY1NDA5NTZiYTAxYjVlN2VlMTAxMjAwMWE1;Tetsuo Handa;Linus Torvalds;"oom_reaper: avoid pointless atomic_inc_not_zero usage. Since commit 36324a990cf5 (""oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space"") changed to use find_lock_task_mm() for finding a mm_struct to reap, it is guaranteed that mm->mm_users > 0 because find_lock_task_mm() returns a task_struct with ->mm != NULL. Therefore, we can safely use atomic_inc(). Link: http://lkml.kernel.org/r/1465024759-8074-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9df10fb7b80bc2f540956ba01b5e7ee1012001a5;Therefore, we can safely use atomic_inc().;yes;yes;no 133;MDY6Q29tbWl0MjMyNTI5ODo0OTFhMWM2NWFlNDk4ZGVhMGUzOWIyNGE0NmU1MjhhNzhhODUzMmVk;Tetsuo Handa;Linus Torvalds;"mm,oom_reaper: don't call mmput_async() without atomic_inc_not_zero() Commit e2fe14564d33 (""oom_reaper: close race with exiting task"") reduced frequency of needlessly selecting next OOM victim, but was calling mmput_async() when atomic_inc_not_zero() failed. Link: http://lkml.kernel.org/r/1464423365-5555-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/491a1c65ae498dea0e39b24a46e528a78a8532ed;mm,oom_reaper: don't call mmput_async() without atomic_inc_not_zero();yes;no;no 133;MDY6Q29tbWl0MjMyNTI5ODo0OTFhMWM2NWFlNDk4ZGVhMGUzOWIyNGE0NmU1MjhhNzhhODUzMmVk;Tetsuo Handa;Linus Torvalds;"mm,oom_reaper: don't call mmput_async() without atomic_inc_not_zero() Commit e2fe14564d33 (""oom_reaper: close race with exiting task"") reduced frequency of needlessly selecting next OOM victim, but was calling mmput_async() when atomic_inc_not_zero() failed. Link: http://lkml.kernel.org/r/1464423365-5555-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/491a1c65ae498dea0e39b24a46e528a78a8532ed;"Commit e2fe14564d33 (""oom_reaper: close race with exiting task"") reduced frequency of needlessly selecting next OOM victim, but was calling mmput_async() when atomic_inc_not_zero() failed.";no;yes;yes 134;MDY6Q29tbWl0MjMyNTI5ODpjYmRjZjdmNzg5MDA2MjVkZTM1MTczOTYxYjliOTVjZGUyMmJjZTQ1;Michal Hocko;Linus Torvalds;"mm, oom_reaper: do not use siglock in try_oom_reaper() Oleg has noted that siglock usage in try_oom_reaper is both pointless and dangerous. signal_group_exit can be checked lockless. The problem is that sighand becomes NULL in __exit_signal so we can crash. Fixes: 3ef22dfff239 (""oom, oom_reaper: try to reap tasks which skip regular OOM killer path"") Link: http://lkml.kernel.org/r/1464679423-30218-1-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Suggested-by: Oleg Nesterov <oleg@redhat.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/cbdcf7f78900625de35173961b9b95cde22bce45;mm, oom_reaper: do not use siglock in try_oom_reaper();yes;no;no 134;MDY6Q29tbWl0MjMyNTI5ODpjYmRjZjdmNzg5MDA2MjVkZTM1MTczOTYxYjliOTVjZGUyMmJjZTQ1;Michal Hocko;Linus Torvalds;"mm, oom_reaper: do not use siglock in try_oom_reaper() Oleg has noted that siglock usage in try_oom_reaper is both pointless and dangerous. signal_group_exit can be checked lockless. The problem is that sighand becomes NULL in __exit_signal so we can crash. Fixes: 3ef22dfff239 (""oom, oom_reaper: try to reap tasks which skip regular OOM killer path"") Link: http://lkml.kernel.org/r/1464679423-30218-1-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Suggested-by: Oleg Nesterov <oleg@redhat.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/cbdcf7f78900625de35173961b9b95cde22bce45;"Oleg has noted that siglock usage in try_oom_reaper is both pointless and dangerous";no;yes;yes 134;MDY6Q29tbWl0MjMyNTI5ODpjYmRjZjdmNzg5MDA2MjVkZTM1MTczOTYxYjliOTVjZGUyMmJjZTQ1;Michal Hocko;Linus Torvalds;"mm, oom_reaper: do not use siglock in try_oom_reaper() Oleg has noted that siglock usage in try_oom_reaper is both pointless and dangerous. signal_group_exit can be checked lockless. The problem is that sighand becomes NULL in __exit_signal so we can crash. Fixes: 3ef22dfff239 (""oom, oom_reaper: try to reap tasks which skip regular OOM killer path"") Link: http://lkml.kernel.org/r/1464679423-30218-1-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Suggested-by: Oleg Nesterov <oleg@redhat.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/cbdcf7f78900625de35173961b9b95cde22bce45; signal_group_exit can be checked lockless;no;no;yes 134;MDY6Q29tbWl0MjMyNTI5ODpjYmRjZjdmNzg5MDA2MjVkZTM1MTczOTYxYjliOTVjZGUyMmJjZTQ1;Michal Hocko;Linus Torvalds;"mm, oom_reaper: do not use siglock in try_oom_reaper() Oleg has noted that siglock usage in try_oom_reaper is both pointless and dangerous. signal_group_exit can be checked lockless. The problem is that sighand becomes NULL in __exit_signal so we can crash. Fixes: 3ef22dfff239 (""oom, oom_reaper: try to reap tasks which skip regular OOM killer path"") Link: http://lkml.kernel.org/r/1464679423-30218-1-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Suggested-by: Oleg Nesterov <oleg@redhat.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/cbdcf7f78900625de35173961b9b95cde22bce45;" The problem is that sighand becomes NULL in __exit_signal so we can crash";no;yes;yes 134;MDY6Q29tbWl0MjMyNTI5ODpjYmRjZjdmNzg5MDA2MjVkZTM1MTczOTYxYjliOTVjZGUyMmJjZTQ1;Michal Hocko;Linus Torvalds;"mm, oom_reaper: do not use siglock in try_oom_reaper() Oleg has noted that siglock usage in try_oom_reaper is both pointless and dangerous. signal_group_exit can be checked lockless. The problem is that sighand becomes NULL in __exit_signal so we can crash. Fixes: 3ef22dfff239 (""oom, oom_reaper: try to reap tasks which skip regular OOM killer path"") Link: http://lkml.kernel.org/r/1464679423-30218-1-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Suggested-by: Oleg Nesterov <oleg@redhat.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/cbdcf7f78900625de35173961b9b95cde22bce45;"Fixes: 3ef22dfff239 (""oom, oom_reaper: try to reap tasks which skip regular OOM killer path"")";no;yes;no 135;MDY6Q29tbWl0MjMyNTI5ODplMmZlMTQ1NjRkMzMxNmQxNjI1ZWQyMGJmMTA4Mzk5NWY0OTYwODkz;Michal Hocko;Linus Torvalds;"oom_reaper: close race with exiting task Tetsuo has reported: Out of memory: Kill process 443 (oleg's-test) score 855 or sacrifice child Killed process 443 (oleg's-test) total-vm:493248kB, anon-rss:423880kB, file-rss:4kB, shmem-rss:0kB sh invoked oom-killer: gfp_mask=0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), order=0, oom_score_adj=0 sh cpuset=/ mems_allowed=0 CPU: 2 PID: 1 Comm: sh Not tainted 4.6.0-rc7+ #51 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013 Call Trace: dump_stack+0x85/0xc8 dump_header+0x5b/0x394 oom_reaper: reaped process 443 (oleg's-test), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB In other words: __oom_reap_task exit_mm atomic_inc_not_zero tsk->mm = NULL mmput atomic_dec_and_test # > 0 exit_oom_victim # New victim will be # selected <OOM killer invoked> # no TIF_MEMDIE task so we can select a new one unmap_page_range # to release the memory The race exists even without the oom_reaper because anybody who pins the address space and gets preempted might race with exit_mm but oom_reaper made this race more probable. We can address the oom_reaper part by using oom_lock for __oom_reap_task because this would guarantee that a new oom victim will not be selected if the oom reaper might race with the exit path. This doesn't solve the original issue, though, because somebody else still might be pinning mm_users and so __mmput won't be called to release the memory but that is not really realiably solvable because the task will get away from the oom sight as soon as it is unhashed from the task_list and so we cannot guarantee a new victim won't be selected. [akpm@linux-foundation.org: fix use of unused `mm', Per Stephen] [akpm@linux-foundation.org: coding-style fixes] Fixes: aac453635549 (""mm, oom: introduce oom reaper"") Link: http://lkml.kernel.org/r/1464271493-20008-1-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/e2fe14564d3316d1625ed20bf1083995f4960893;oom_reaper: close race with exiting task;yes;yes;no 135;MDY6Q29tbWl0MjMyNTI5ODplMmZlMTQ1NjRkMzMxNmQxNjI1ZWQyMGJmMTA4Mzk5NWY0OTYwODkz;Michal Hocko;Linus Torvalds;"oom_reaper: close race with exiting task Tetsuo has reported: Out of memory: Kill process 443 (oleg's-test) score 855 or sacrifice child Killed process 443 (oleg's-test) total-vm:493248kB, anon-rss:423880kB, file-rss:4kB, shmem-rss:0kB sh invoked oom-killer: gfp_mask=0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), order=0, oom_score_adj=0 sh cpuset=/ mems_allowed=0 CPU: 2 PID: 1 Comm: sh Not tainted 4.6.0-rc7+ #51 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013 Call Trace: dump_stack+0x85/0xc8 dump_header+0x5b/0x394 oom_reaper: reaped process 443 (oleg's-test), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB In other words: __oom_reap_task exit_mm atomic_inc_not_zero tsk->mm = NULL mmput atomic_dec_and_test # > 0 exit_oom_victim # New victim will be # selected <OOM killer invoked> # no TIF_MEMDIE task so we can select a new one unmap_page_range # to release the memory The race exists even without the oom_reaper because anybody who pins the address space and gets preempted might race with exit_mm but oom_reaper made this race more probable. We can address the oom_reaper part by using oom_lock for __oom_reap_task because this would guarantee that a new oom victim will not be selected if the oom reaper might race with the exit path. This doesn't solve the original issue, though, because somebody else still might be pinning mm_users and so __mmput won't be called to release the memory but that is not really realiably solvable because the task will get away from the oom sight as soon as it is unhashed from the task_list and so we cannot guarantee a new victim won't be selected. [akpm@linux-foundation.org: fix use of unused `mm', Per Stephen] [akpm@linux-foundation.org: coding-style fixes] Fixes: aac453635549 (""mm, oom: introduce oom reaper"") Link: http://lkml.kernel.org/r/1464271493-20008-1-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/e2fe14564d3316d1625ed20bf1083995f4960893;Tetsuo has reported;no;no;yes 135;MDY6Q29tbWl0MjMyNTI5ODplMmZlMTQ1NjRkMzMxNmQxNjI1ZWQyMGJmMTA4Mzk5NWY0OTYwODkz;Michal Hocko;Linus Torvalds;"oom_reaper: close race with exiting task Tetsuo has reported: Out of memory: Kill process 443 (oleg's-test) score 855 or sacrifice child Killed process 443 (oleg's-test) total-vm:493248kB, anon-rss:423880kB, file-rss:4kB, shmem-rss:0kB sh invoked oom-killer: gfp_mask=0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), order=0, oom_score_adj=0 sh cpuset=/ mems_allowed=0 CPU: 2 PID: 1 Comm: sh Not tainted 4.6.0-rc7+ #51 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013 Call Trace: dump_stack+0x85/0xc8 dump_header+0x5b/0x394 oom_reaper: reaped process 443 (oleg's-test), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB In other words: __oom_reap_task exit_mm atomic_inc_not_zero tsk->mm = NULL mmput atomic_dec_and_test # > 0 exit_oom_victim # New victim will be # selected <OOM killer invoked> # no TIF_MEMDIE task so we can select a new one unmap_page_range # to release the memory The race exists even without the oom_reaper because anybody who pins the address space and gets preempted might race with exit_mm but oom_reaper made this race more probable. We can address the oom_reaper part by using oom_lock for __oom_reap_task because this would guarantee that a new oom victim will not be selected if the oom reaper might race with the exit path. This doesn't solve the original issue, though, because somebody else still might be pinning mm_users and so __mmput won't be called to release the memory but that is not really realiably solvable because the task will get away from the oom sight as soon as it is unhashed from the task_list and so we cannot guarantee a new victim won't be selected. [akpm@linux-foundation.org: fix use of unused `mm', Per Stephen] [akpm@linux-foundation.org: coding-style fixes] Fixes: aac453635549 (""mm, oom: introduce oom reaper"") Link: http://lkml.kernel.org/r/1464271493-20008-1-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/e2fe14564d3316d1625ed20bf1083995f4960893;" __oom_reap_task exit_mm atomic_inc_not_zero tsk->mm = NULL mmput atomic_dec_and_test # > 0 exit_oom_victim # New victim will be # selected <OOM killer invoked> # no TIF_MEMDIE task so we can select a new one unmap_page_range # to release the memory The race exists even without the oom_reaper because anybody who pins the address space and gets preempted might race with exit_mm but oom_reaper made this race more probable";no;yes;yes 135;MDY6Q29tbWl0MjMyNTI5ODplMmZlMTQ1NjRkMzMxNmQxNjI1ZWQyMGJmMTA4Mzk5NWY0OTYwODkz;Michal Hocko;Linus Torvalds;"oom_reaper: close race with exiting task Tetsuo has reported: Out of memory: Kill process 443 (oleg's-test) score 855 or sacrifice child Killed process 443 (oleg's-test) total-vm:493248kB, anon-rss:423880kB, file-rss:4kB, shmem-rss:0kB sh invoked oom-killer: gfp_mask=0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), order=0, oom_score_adj=0 sh cpuset=/ mems_allowed=0 CPU: 2 PID: 1 Comm: sh Not tainted 4.6.0-rc7+ #51 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013 Call Trace: dump_stack+0x85/0xc8 dump_header+0x5b/0x394 oom_reaper: reaped process 443 (oleg's-test), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB In other words: __oom_reap_task exit_mm atomic_inc_not_zero tsk->mm = NULL mmput atomic_dec_and_test # > 0 exit_oom_victim # New victim will be # selected <OOM killer invoked> # no TIF_MEMDIE task so we can select a new one unmap_page_range # to release the memory The race exists even without the oom_reaper because anybody who pins the address space and gets preempted might race with exit_mm but oom_reaper made this race more probable. We can address the oom_reaper part by using oom_lock for __oom_reap_task because this would guarantee that a new oom victim will not be selected if the oom reaper might race with the exit path. This doesn't solve the original issue, though, because somebody else still might be pinning mm_users and so __mmput won't be called to release the memory but that is not really realiably solvable because the task will get away from the oom sight as soon as it is unhashed from the task_list and so we cannot guarantee a new victim won't be selected. [akpm@linux-foundation.org: fix use of unused `mm', Per Stephen] [akpm@linux-foundation.org: coding-style fixes] Fixes: aac453635549 (""mm, oom: introduce oom reaper"") Link: http://lkml.kernel.org/r/1464271493-20008-1-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/e2fe14564d3316d1625ed20bf1083995f4960893;"We can address the oom_reaper part by using oom_lock for __oom_reap_task because this would guarantee that a new oom victim will not be selected if the oom reaper might race with the exit path";yes;yes;yes 135;MDY6Q29tbWl0MjMyNTI5ODplMmZlMTQ1NjRkMzMxNmQxNjI1ZWQyMGJmMTA4Mzk5NWY0OTYwODkz;Michal Hocko;Linus Torvalds;"oom_reaper: close race with exiting task Tetsuo has reported: Out of memory: Kill process 443 (oleg's-test) score 855 or sacrifice child Killed process 443 (oleg's-test) total-vm:493248kB, anon-rss:423880kB, file-rss:4kB, shmem-rss:0kB sh invoked oom-killer: gfp_mask=0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), order=0, oom_score_adj=0 sh cpuset=/ mems_allowed=0 CPU: 2 PID: 1 Comm: sh Not tainted 4.6.0-rc7+ #51 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013 Call Trace: dump_stack+0x85/0xc8 dump_header+0x5b/0x394 oom_reaper: reaped process 443 (oleg's-test), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB In other words: __oom_reap_task exit_mm atomic_inc_not_zero tsk->mm = NULL mmput atomic_dec_and_test # > 0 exit_oom_victim # New victim will be # selected <OOM killer invoked> # no TIF_MEMDIE task so we can select a new one unmap_page_range # to release the memory The race exists even without the oom_reaper because anybody who pins the address space and gets preempted might race with exit_mm but oom_reaper made this race more probable. We can address the oom_reaper part by using oom_lock for __oom_reap_task because this would guarantee that a new oom victim will not be selected if the oom reaper might race with the exit path. This doesn't solve the original issue, though, because somebody else still might be pinning mm_users and so __mmput won't be called to release the memory but that is not really realiably solvable because the task will get away from the oom sight as soon as it is unhashed from the task_list and so we cannot guarantee a new victim won't be selected. [akpm@linux-foundation.org: fix use of unused `mm', Per Stephen] [akpm@linux-foundation.org: coding-style fixes] Fixes: aac453635549 (""mm, oom: introduce oom reaper"") Link: http://lkml.kernel.org/r/1464271493-20008-1-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/e2fe14564d3316d1625ed20bf1083995f4960893;" This doesn't solve the original issue, though, because somebody else still might be pinning mm_users and so __mmput won't be called to release the memory but that is not really realiably solvable because the task will get away from the oom sight as soon as it is unhashed from the task_list and so we cannot guarantee a new victim won't be selected.";yes;yes;yes 137;MDY6Q29tbWl0MjMyNTI5ODpmNDQ2NjZiMDQ2MDVkMWM3ZmQ5NGFiOTBiN2NjZjYzM2U3ZWZmMjI4;Tetsuo Handa;Linus Torvalds;"mm,oom: speed up select_bad_process() loop Since commit 3a5dda7a17cf (""oom: prevent unnecessary oom kills or kernel panics""), select_bad_process() is using for_each_process_thread(). Since oom_unkillable_task() scans all threads in the caller's thread group and oom_task_origin() scans signal_struct of the caller's thread group, we don't need to call oom_unkillable_task() and oom_task_origin() on each thread. Also, since !mm test will be done later at oom_badness(), we don't need to do !mm test on each thread. Therefore, we only need to do TIF_MEMDIE test on each thread. Although the original code was correct it was quite inefficient because each thread group was scanned num_threads times which can be a lot especially with processes with many threads. Even though the OOM is extremely cold path it is always good to be as effective as possible when we are inside rcu_read_lock() - aka unpreemptible context. If we track number of TIF_MEMDIE threads inside signal_struct, we don't need to do TIF_MEMDIE test on each thread. This will allow select_bad_process() to use for_each_process(). This patch adds a counter to signal_struct for tracking how many TIF_MEMDIE threads are in a given thread group, and check it at oom_scan_process_thread() so that select_bad_process() can use for_each_process() rather than for_each_process_thread(). [mhocko@suse.com: do not blow the signal_struct size] Link: http://lkml.kernel.org/r/20160520075035.GF19172@dhcp22.suse.cz Link: http://lkml.kernel.org/r/201605182230.IDC73435.MVSOHLFOQFOJtF@I-love.SAKURA.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/f44666b04605d1c7fd94ab90b7ccf633e7eff228;mm,oom: speed up select_bad_process() loop;yes;yes;no 137;MDY6Q29tbWl0MjMyNTI5ODpmNDQ2NjZiMDQ2MDVkMWM3ZmQ5NGFiOTBiN2NjZjYzM2U3ZWZmMjI4;Tetsuo Handa;Linus Torvalds;"mm,oom: speed up select_bad_process() loop Since commit 3a5dda7a17cf (""oom: prevent unnecessary oom kills or kernel panics""), select_bad_process() is using for_each_process_thread(). Since oom_unkillable_task() scans all threads in the caller's thread group and oom_task_origin() scans signal_struct of the caller's thread group, we don't need to call oom_unkillable_task() and oom_task_origin() on each thread. Also, since !mm test will be done later at oom_badness(), we don't need to do !mm test on each thread. Therefore, we only need to do TIF_MEMDIE test on each thread. Although the original code was correct it was quite inefficient because each thread group was scanned num_threads times which can be a lot especially with processes with many threads. Even though the OOM is extremely cold path it is always good to be as effective as possible when we are inside rcu_read_lock() - aka unpreemptible context. If we track number of TIF_MEMDIE threads inside signal_struct, we don't need to do TIF_MEMDIE test on each thread. This will allow select_bad_process() to use for_each_process(). This patch adds a counter to signal_struct for tracking how many TIF_MEMDIE threads are in a given thread group, and check it at oom_scan_process_thread() so that select_bad_process() can use for_each_process() rather than for_each_process_thread(). [mhocko@suse.com: do not blow the signal_struct size] Link: http://lkml.kernel.org/r/20160520075035.GF19172@dhcp22.suse.cz Link: http://lkml.kernel.org/r/201605182230.IDC73435.MVSOHLFOQFOJtF@I-love.SAKURA.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/f44666b04605d1c7fd94ab90b7ccf633e7eff228;"Since commit 3a5dda7a17cf (""oom: prevent unnecessary oom kills or kernel panics""), select_bad_process() is using for_each_process_thread()";no;no;yes 137;MDY6Q29tbWl0MjMyNTI5ODpmNDQ2NjZiMDQ2MDVkMWM3ZmQ5NGFiOTBiN2NjZjYzM2U3ZWZmMjI4;Tetsuo Handa;Linus Torvalds;"mm,oom: speed up select_bad_process() loop Since commit 3a5dda7a17cf (""oom: prevent unnecessary oom kills or kernel panics""), select_bad_process() is using for_each_process_thread(). Since oom_unkillable_task() scans all threads in the caller's thread group and oom_task_origin() scans signal_struct of the caller's thread group, we don't need to call oom_unkillable_task() and oom_task_origin() on each thread. Also, since !mm test will be done later at oom_badness(), we don't need to do !mm test on each thread. Therefore, we only need to do TIF_MEMDIE test on each thread. Although the original code was correct it was quite inefficient because each thread group was scanned num_threads times which can be a lot especially with processes with many threads. Even though the OOM is extremely cold path it is always good to be as effective as possible when we are inside rcu_read_lock() - aka unpreemptible context. If we track number of TIF_MEMDIE threads inside signal_struct, we don't need to do TIF_MEMDIE test on each thread. This will allow select_bad_process() to use for_each_process(). This patch adds a counter to signal_struct for tracking how many TIF_MEMDIE threads are in a given thread group, and check it at oom_scan_process_thread() so that select_bad_process() can use for_each_process() rather than for_each_process_thread(). [mhocko@suse.com: do not blow the signal_struct size] Link: http://lkml.kernel.org/r/20160520075035.GF19172@dhcp22.suse.cz Link: http://lkml.kernel.org/r/201605182230.IDC73435.MVSOHLFOQFOJtF@I-love.SAKURA.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/f44666b04605d1c7fd94ab90b7ccf633e7eff228;"Since oom_unkillable_task() scans all threads in the caller's thread group and oom_task_origin() scans signal_struct of the caller's thread group, we don't need to call oom_unkillable_task() and oom_task_origin() on each thread";no;yes;yes 137;MDY6Q29tbWl0MjMyNTI5ODpmNDQ2NjZiMDQ2MDVkMWM3ZmQ5NGFiOTBiN2NjZjYzM2U3ZWZmMjI4;Tetsuo Handa;Linus Torvalds;"mm,oom: speed up select_bad_process() loop Since commit 3a5dda7a17cf (""oom: prevent unnecessary oom kills or kernel panics""), select_bad_process() is using for_each_process_thread(). Since oom_unkillable_task() scans all threads in the caller's thread group and oom_task_origin() scans signal_struct of the caller's thread group, we don't need to call oom_unkillable_task() and oom_task_origin() on each thread. Also, since !mm test will be done later at oom_badness(), we don't need to do !mm test on each thread. Therefore, we only need to do TIF_MEMDIE test on each thread. Although the original code was correct it was quite inefficient because each thread group was scanned num_threads times which can be a lot especially with processes with many threads. Even though the OOM is extremely cold path it is always good to be as effective as possible when we are inside rcu_read_lock() - aka unpreemptible context. If we track number of TIF_MEMDIE threads inside signal_struct, we don't need to do TIF_MEMDIE test on each thread. This will allow select_bad_process() to use for_each_process(). This patch adds a counter to signal_struct for tracking how many TIF_MEMDIE threads are in a given thread group, and check it at oom_scan_process_thread() so that select_bad_process() can use for_each_process() rather than for_each_process_thread(). [mhocko@suse.com: do not blow the signal_struct size] Link: http://lkml.kernel.org/r/20160520075035.GF19172@dhcp22.suse.cz Link: http://lkml.kernel.org/r/201605182230.IDC73435.MVSOHLFOQFOJtF@I-love.SAKURA.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/f44666b04605d1c7fd94ab90b7ccf633e7eff228;" Also, since !mm test will be done later at oom_badness(), we don't need to do !mm test on each thread";no;yes;yes 137;MDY6Q29tbWl0MjMyNTI5ODpmNDQ2NjZiMDQ2MDVkMWM3ZmQ5NGFiOTBiN2NjZjYzM2U3ZWZmMjI4;Tetsuo Handa;Linus Torvalds;"mm,oom: speed up select_bad_process() loop Since commit 3a5dda7a17cf (""oom: prevent unnecessary oom kills or kernel panics""), select_bad_process() is using for_each_process_thread(). Since oom_unkillable_task() scans all threads in the caller's thread group and oom_task_origin() scans signal_struct of the caller's thread group, we don't need to call oom_unkillable_task() and oom_task_origin() on each thread. Also, since !mm test will be done later at oom_badness(), we don't need to do !mm test on each thread. Therefore, we only need to do TIF_MEMDIE test on each thread. Although the original code was correct it was quite inefficient because each thread group was scanned num_threads times which can be a lot especially with processes with many threads. Even though the OOM is extremely cold path it is always good to be as effective as possible when we are inside rcu_read_lock() - aka unpreemptible context. If we track number of TIF_MEMDIE threads inside signal_struct, we don't need to do TIF_MEMDIE test on each thread. This will allow select_bad_process() to use for_each_process(). This patch adds a counter to signal_struct for tracking how many TIF_MEMDIE threads are in a given thread group, and check it at oom_scan_process_thread() so that select_bad_process() can use for_each_process() rather than for_each_process_thread(). [mhocko@suse.com: do not blow the signal_struct size] Link: http://lkml.kernel.org/r/20160520075035.GF19172@dhcp22.suse.cz Link: http://lkml.kernel.org/r/201605182230.IDC73435.MVSOHLFOQFOJtF@I-love.SAKURA.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/f44666b04605d1c7fd94ab90b7ccf633e7eff228;" Therefore, we only need to do TIF_MEMDIE test on each thread";no;no;yes 137;MDY6Q29tbWl0MjMyNTI5ODpmNDQ2NjZiMDQ2MDVkMWM3ZmQ5NGFiOTBiN2NjZjYzM2U3ZWZmMjI4;Tetsuo Handa;Linus Torvalds;"mm,oom: speed up select_bad_process() loop Since commit 3a5dda7a17cf (""oom: prevent unnecessary oom kills or kernel panics""), select_bad_process() is using for_each_process_thread(). Since oom_unkillable_task() scans all threads in the caller's thread group and oom_task_origin() scans signal_struct of the caller's thread group, we don't need to call oom_unkillable_task() and oom_task_origin() on each thread. Also, since !mm test will be done later at oom_badness(), we don't need to do !mm test on each thread. Therefore, we only need to do TIF_MEMDIE test on each thread. Although the original code was correct it was quite inefficient because each thread group was scanned num_threads times which can be a lot especially with processes with many threads. Even though the OOM is extremely cold path it is always good to be as effective as possible when we are inside rcu_read_lock() - aka unpreemptible context. If we track number of TIF_MEMDIE threads inside signal_struct, we don't need to do TIF_MEMDIE test on each thread. This will allow select_bad_process() to use for_each_process(). This patch adds a counter to signal_struct for tracking how many TIF_MEMDIE threads are in a given thread group, and check it at oom_scan_process_thread() so that select_bad_process() can use for_each_process() rather than for_each_process_thread(). [mhocko@suse.com: do not blow the signal_struct size] Link: http://lkml.kernel.org/r/20160520075035.GF19172@dhcp22.suse.cz Link: http://lkml.kernel.org/r/201605182230.IDC73435.MVSOHLFOQFOJtF@I-love.SAKURA.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/f44666b04605d1c7fd94ab90b7ccf633e7eff228;"Although the original code was correct it was quite inefficient because each thread group was scanned num_threads times which can be a lot especially with processes with many threads";no;no;yes 137;MDY6Q29tbWl0MjMyNTI5ODpmNDQ2NjZiMDQ2MDVkMWM3ZmQ5NGFiOTBiN2NjZjYzM2U3ZWZmMjI4;Tetsuo Handa;Linus Torvalds;"mm,oom: speed up select_bad_process() loop Since commit 3a5dda7a17cf (""oom: prevent unnecessary oom kills or kernel panics""), select_bad_process() is using for_each_process_thread(). Since oom_unkillable_task() scans all threads in the caller's thread group and oom_task_origin() scans signal_struct of the caller's thread group, we don't need to call oom_unkillable_task() and oom_task_origin() on each thread. Also, since !mm test will be done later at oom_badness(), we don't need to do !mm test on each thread. Therefore, we only need to do TIF_MEMDIE test on each thread. Although the original code was correct it was quite inefficient because each thread group was scanned num_threads times which can be a lot especially with processes with many threads. Even though the OOM is extremely cold path it is always good to be as effective as possible when we are inside rcu_read_lock() - aka unpreemptible context. If we track number of TIF_MEMDIE threads inside signal_struct, we don't need to do TIF_MEMDIE test on each thread. This will allow select_bad_process() to use for_each_process(). This patch adds a counter to signal_struct for tracking how many TIF_MEMDIE threads are in a given thread group, and check it at oom_scan_process_thread() so that select_bad_process() can use for_each_process() rather than for_each_process_thread(). [mhocko@suse.com: do not blow the signal_struct size] Link: http://lkml.kernel.org/r/20160520075035.GF19172@dhcp22.suse.cz Link: http://lkml.kernel.org/r/201605182230.IDC73435.MVSOHLFOQFOJtF@I-love.SAKURA.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/f44666b04605d1c7fd94ab90b7ccf633e7eff228;" Even though the OOM is extremely cold path it is always good to be as effective as possible when we are inside rcu_read_lock() - aka unpreemptible context";no;yes;yes 137;MDY6Q29tbWl0MjMyNTI5ODpmNDQ2NjZiMDQ2MDVkMWM3ZmQ5NGFiOTBiN2NjZjYzM2U3ZWZmMjI4;Tetsuo Handa;Linus Torvalds;"mm,oom: speed up select_bad_process() loop Since commit 3a5dda7a17cf (""oom: prevent unnecessary oom kills or kernel panics""), select_bad_process() is using for_each_process_thread(). Since oom_unkillable_task() scans all threads in the caller's thread group and oom_task_origin() scans signal_struct of the caller's thread group, we don't need to call oom_unkillable_task() and oom_task_origin() on each thread. Also, since !mm test will be done later at oom_badness(), we don't need to do !mm test on each thread. Therefore, we only need to do TIF_MEMDIE test on each thread. Although the original code was correct it was quite inefficient because each thread group was scanned num_threads times which can be a lot especially with processes with many threads. Even though the OOM is extremely cold path it is always good to be as effective as possible when we are inside rcu_read_lock() - aka unpreemptible context. If we track number of TIF_MEMDIE threads inside signal_struct, we don't need to do TIF_MEMDIE test on each thread. This will allow select_bad_process() to use for_each_process(). This patch adds a counter to signal_struct for tracking how many TIF_MEMDIE threads are in a given thread group, and check it at oom_scan_process_thread() so that select_bad_process() can use for_each_process() rather than for_each_process_thread(). [mhocko@suse.com: do not blow the signal_struct size] Link: http://lkml.kernel.org/r/20160520075035.GF19172@dhcp22.suse.cz Link: http://lkml.kernel.org/r/201605182230.IDC73435.MVSOHLFOQFOJtF@I-love.SAKURA.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/f44666b04605d1c7fd94ab90b7ccf633e7eff228;"If we track number of TIF_MEMDIE threads inside signal_struct, we don't need to do TIF_MEMDIE test on each thread";yes;yes;no 137;MDY6Q29tbWl0MjMyNTI5ODpmNDQ2NjZiMDQ2MDVkMWM3ZmQ5NGFiOTBiN2NjZjYzM2U3ZWZmMjI4;Tetsuo Handa;Linus Torvalds;"mm,oom: speed up select_bad_process() loop Since commit 3a5dda7a17cf (""oom: prevent unnecessary oom kills or kernel panics""), select_bad_process() is using for_each_process_thread(). Since oom_unkillable_task() scans all threads in the caller's thread group and oom_task_origin() scans signal_struct of the caller's thread group, we don't need to call oom_unkillable_task() and oom_task_origin() on each thread. Also, since !mm test will be done later at oom_badness(), we don't need to do !mm test on each thread. Therefore, we only need to do TIF_MEMDIE test on each thread. Although the original code was correct it was quite inefficient because each thread group was scanned num_threads times which can be a lot especially with processes with many threads. Even though the OOM is extremely cold path it is always good to be as effective as possible when we are inside rcu_read_lock() - aka unpreemptible context. If we track number of TIF_MEMDIE threads inside signal_struct, we don't need to do TIF_MEMDIE test on each thread. This will allow select_bad_process() to use for_each_process(). This patch adds a counter to signal_struct for tracking how many TIF_MEMDIE threads are in a given thread group, and check it at oom_scan_process_thread() so that select_bad_process() can use for_each_process() rather than for_each_process_thread(). [mhocko@suse.com: do not blow the signal_struct size] Link: http://lkml.kernel.org/r/20160520075035.GF19172@dhcp22.suse.cz Link: http://lkml.kernel.org/r/201605182230.IDC73435.MVSOHLFOQFOJtF@I-love.SAKURA.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/f44666b04605d1c7fd94ab90b7ccf633e7eff228;" This will allow select_bad_process() to use for_each_process()";yes;yes;no 137;MDY6Q29tbWl0MjMyNTI5ODpmNDQ2NjZiMDQ2MDVkMWM3ZmQ5NGFiOTBiN2NjZjYzM2U3ZWZmMjI4;Tetsuo Handa;Linus Torvalds;"mm,oom: speed up select_bad_process() loop Since commit 3a5dda7a17cf (""oom: prevent unnecessary oom kills or kernel panics""), select_bad_process() is using for_each_process_thread(). Since oom_unkillable_task() scans all threads in the caller's thread group and oom_task_origin() scans signal_struct of the caller's thread group, we don't need to call oom_unkillable_task() and oom_task_origin() on each thread. Also, since !mm test will be done later at oom_badness(), we don't need to do !mm test on each thread. Therefore, we only need to do TIF_MEMDIE test on each thread. Although the original code was correct it was quite inefficient because each thread group was scanned num_threads times which can be a lot especially with processes with many threads. Even though the OOM is extremely cold path it is always good to be as effective as possible when we are inside rcu_read_lock() - aka unpreemptible context. If we track number of TIF_MEMDIE threads inside signal_struct, we don't need to do TIF_MEMDIE test on each thread. This will allow select_bad_process() to use for_each_process(). This patch adds a counter to signal_struct for tracking how many TIF_MEMDIE threads are in a given thread group, and check it at oom_scan_process_thread() so that select_bad_process() can use for_each_process() rather than for_each_process_thread(). [mhocko@suse.com: do not blow the signal_struct size] Link: http://lkml.kernel.org/r/20160520075035.GF19172@dhcp22.suse.cz Link: http://lkml.kernel.org/r/201605182230.IDC73435.MVSOHLFOQFOJtF@I-love.SAKURA.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/f44666b04605d1c7fd94ab90b7ccf633e7eff228;"This patch adds a counter to signal_struct for tracking how many TIF_MEMDIE threads are in a given thread group, and check it at oom_scan_process_thread() so that select_bad_process() can use for_each_process() rather than for_each_process_thread().";yes;yes;no 138;MDY6Q29tbWl0MjMyNTI5ODplYzhkN2MxNGVhMTQ5MjJmZTIxOTQ1YjQ1OGE3NWUzOWYxMWRkODMy;Michal Hocko;Linus Torvalds;"mm, oom_reaper: do not mmput synchronously from the oom reaper context Tetsuo has properly noted that mmput slow path might get blocked waiting for another party (e.g. exit_aio waits for an IO). If that happens the oom_reaper would be put out of the way and will not be able to process next oom victim. We should strive for making this context as reliable and independent on other subsystems as much as possible. Introduce mmput_async which will perform the slow path from an async (WQ) context. This will delay the operation but that shouldn't be a problem because the oom_reaper has reclaimed the victim's address space for most cases as much as possible and the remaining context shouldn't bind too much memory anymore. The only exception is when mmap_sem trylock has failed which shouldn't happen too often. The issue is only theoretical but not impossible. Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/ec8d7c14ea14922fe21945b458a75e39f11dd832;mm, oom_reaper: do not mmput synchronously from the oom reaper context;yes;no;no 138;MDY6Q29tbWl0MjMyNTI5ODplYzhkN2MxNGVhMTQ5MjJmZTIxOTQ1YjQ1OGE3NWUzOWYxMWRkODMy;Michal Hocko;Linus Torvalds;"mm, oom_reaper: do not mmput synchronously from the oom reaper context Tetsuo has properly noted that mmput slow path might get blocked waiting for another party (e.g. exit_aio waits for an IO). If that happens the oom_reaper would be put out of the way and will not be able to process next oom victim. We should strive for making this context as reliable and independent on other subsystems as much as possible. Introduce mmput_async which will perform the slow path from an async (WQ) context. This will delay the operation but that shouldn't be a problem because the oom_reaper has reclaimed the victim's address space for most cases as much as possible and the remaining context shouldn't bind too much memory anymore. The only exception is when mmap_sem trylock has failed which shouldn't happen too often. The issue is only theoretical but not impossible. Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/ec8d7c14ea14922fe21945b458a75e39f11dd832;"Tetsuo has properly noted that mmput slow path might get blocked waiting for another party (e.g";no;no;yes 138;MDY6Q29tbWl0MjMyNTI5ODplYzhkN2MxNGVhMTQ5MjJmZTIxOTQ1YjQ1OGE3NWUzOWYxMWRkODMy;Michal Hocko;Linus Torvalds;"mm, oom_reaper: do not mmput synchronously from the oom reaper context Tetsuo has properly noted that mmput slow path might get blocked waiting for another party (e.g. exit_aio waits for an IO). If that happens the oom_reaper would be put out of the way and will not be able to process next oom victim. We should strive for making this context as reliable and independent on other subsystems as much as possible. Introduce mmput_async which will perform the slow path from an async (WQ) context. This will delay the operation but that shouldn't be a problem because the oom_reaper has reclaimed the victim's address space for most cases as much as possible and the remaining context shouldn't bind too much memory anymore. The only exception is when mmap_sem trylock has failed which shouldn't happen too often. The issue is only theoretical but not impossible. Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/ec8d7c14ea14922fe21945b458a75e39f11dd832; exit_aio waits for an IO);no;no;yes 138;MDY6Q29tbWl0MjMyNTI5ODplYzhkN2MxNGVhMTQ5MjJmZTIxOTQ1YjQ1OGE3NWUzOWYxMWRkODMy;Michal Hocko;Linus Torvalds;"mm, oom_reaper: do not mmput synchronously from the oom reaper context Tetsuo has properly noted that mmput slow path might get blocked waiting for another party (e.g. exit_aio waits for an IO). If that happens the oom_reaper would be put out of the way and will not be able to process next oom victim. We should strive for making this context as reliable and independent on other subsystems as much as possible. Introduce mmput_async which will perform the slow path from an async (WQ) context. This will delay the operation but that shouldn't be a problem because the oom_reaper has reclaimed the victim's address space for most cases as much as possible and the remaining context shouldn't bind too much memory anymore. The only exception is when mmap_sem trylock has failed which shouldn't happen too often. The issue is only theoretical but not impossible. Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/ec8d7c14ea14922fe21945b458a75e39f11dd832;" If that happens the oom_reaper would be put out of the way and will not be able to process next oom victim";no;yes;yes 138;MDY6Q29tbWl0MjMyNTI5ODplYzhkN2MxNGVhMTQ5MjJmZTIxOTQ1YjQ1OGE3NWUzOWYxMWRkODMy;Michal Hocko;Linus Torvalds;"mm, oom_reaper: do not mmput synchronously from the oom reaper context Tetsuo has properly noted that mmput slow path might get blocked waiting for another party (e.g. exit_aio waits for an IO). If that happens the oom_reaper would be put out of the way and will not be able to process next oom victim. We should strive for making this context as reliable and independent on other subsystems as much as possible. Introduce mmput_async which will perform the slow path from an async (WQ) context. This will delay the operation but that shouldn't be a problem because the oom_reaper has reclaimed the victim's address space for most cases as much as possible and the remaining context shouldn't bind too much memory anymore. The only exception is when mmap_sem trylock has failed which shouldn't happen too often. The issue is only theoretical but not impossible. Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/ec8d7c14ea14922fe21945b458a75e39f11dd832;" We should strive for making this context as reliable and independent on other subsystems as much as possible";yes;yes;no 138;MDY6Q29tbWl0MjMyNTI5ODplYzhkN2MxNGVhMTQ5MjJmZTIxOTQ1YjQ1OGE3NWUzOWYxMWRkODMy;Michal Hocko;Linus Torvalds;"mm, oom_reaper: do not mmput synchronously from the oom reaper context Tetsuo has properly noted that mmput slow path might get blocked waiting for another party (e.g. exit_aio waits for an IO). If that happens the oom_reaper would be put out of the way and will not be able to process next oom victim. We should strive for making this context as reliable and independent on other subsystems as much as possible. Introduce mmput_async which will perform the slow path from an async (WQ) context. This will delay the operation but that shouldn't be a problem because the oom_reaper has reclaimed the victim's address space for most cases as much as possible and the remaining context shouldn't bind too much memory anymore. The only exception is when mmap_sem trylock has failed which shouldn't happen too often. The issue is only theoretical but not impossible. Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/ec8d7c14ea14922fe21945b458a75e39f11dd832;"Introduce mmput_async which will perform the slow path from an async (WQ) context";yes;no;no 138;MDY6Q29tbWl0MjMyNTI5ODplYzhkN2MxNGVhMTQ5MjJmZTIxOTQ1YjQ1OGE3NWUzOWYxMWRkODMy;Michal Hocko;Linus Torvalds;"mm, oom_reaper: do not mmput synchronously from the oom reaper context Tetsuo has properly noted that mmput slow path might get blocked waiting for another party (e.g. exit_aio waits for an IO). If that happens the oom_reaper would be put out of the way and will not be able to process next oom victim. We should strive for making this context as reliable and independent on other subsystems as much as possible. Introduce mmput_async which will perform the slow path from an async (WQ) context. This will delay the operation but that shouldn't be a problem because the oom_reaper has reclaimed the victim's address space for most cases as much as possible and the remaining context shouldn't bind too much memory anymore. The only exception is when mmap_sem trylock has failed which shouldn't happen too often. The issue is only theoretical but not impossible. Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/ec8d7c14ea14922fe21945b458a75e39f11dd832;" This will delay the operation but that shouldn't be a problem because the oom_reaper has reclaimed the victim's address space for most cases as much as possible and the remaining context shouldn't bind too much memory anymore";yes;yes;yes 138;MDY6Q29tbWl0MjMyNTI5ODplYzhkN2MxNGVhMTQ5MjJmZTIxOTQ1YjQ1OGE3NWUzOWYxMWRkODMy;Michal Hocko;Linus Torvalds;"mm, oom_reaper: do not mmput synchronously from the oom reaper context Tetsuo has properly noted that mmput slow path might get blocked waiting for another party (e.g. exit_aio waits for an IO). If that happens the oom_reaper would be put out of the way and will not be able to process next oom victim. We should strive for making this context as reliable and independent on other subsystems as much as possible. Introduce mmput_async which will perform the slow path from an async (WQ) context. This will delay the operation but that shouldn't be a problem because the oom_reaper has reclaimed the victim's address space for most cases as much as possible and the remaining context shouldn't bind too much memory anymore. The only exception is when mmap_sem trylock has failed which shouldn't happen too often. The issue is only theoretical but not impossible. Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/ec8d7c14ea14922fe21945b458a75e39f11dd832;" The only exception is when mmap_sem trylock has failed which shouldn't happen too often";yes;yes;yes 138;MDY6Q29tbWl0MjMyNTI5ODplYzhkN2MxNGVhMTQ5MjJmZTIxOTQ1YjQ1OGE3NWUzOWYxMWRkODMy;Michal Hocko;Linus Torvalds;"mm, oom_reaper: do not mmput synchronously from the oom reaper context Tetsuo has properly noted that mmput slow path might get blocked waiting for another party (e.g. exit_aio waits for an IO). If that happens the oom_reaper would be put out of the way and will not be able to process next oom victim. We should strive for making this context as reliable and independent on other subsystems as much as possible. Introduce mmput_async which will perform the slow path from an async (WQ) context. This will delay the operation but that shouldn't be a problem because the oom_reaper has reclaimed the victim's address space for most cases as much as possible and the remaining context shouldn't bind too much memory anymore. The only exception is when mmap_sem trylock has failed which shouldn't happen too often. The issue is only theoretical but not impossible. Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/ec8d7c14ea14922fe21945b458a75e39f11dd832;The issue is only theoretical but not impossible.;no;no;yes 139;MDY6Q29tbWl0MjMyNTI5ODpiYjhhNGI3ZmQxMjY2ZWY4ODhiM2E4MGFhNWYyNjYwNjJiMjI0ZWY0;Michal Hocko;Linus Torvalds;"mm, oom_reaper: hide oom reaped tasks from OOM killer more carefully Commit 36324a990cf5 (""oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space"") not only clears TIF_MEMDIE for oom reaped task but also set OOM_SCORE_ADJ_MIN for the target task to hide it from the oom killer. This works in simple cases but it is not sufficient for (unlikely) cases where the mm is shared between independent processes (as they do not share signal struct). If the mm had only small amount of memory which could be reaped then another task sharing the mm could be selected and that wouldn't help to move out from the oom situation. Introduce MMF_OOM_REAPED mm flag which is checked in oom_badness (same as OOM_SCORE_ADJ_MIN) and task is skipped if the flag is set. Set the flag after __oom_reap_task is done with a task. This will force the select_bad_process() to ignore all already oom reaped tasks as well as no such task is sacrificed for its parent. Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/bb8a4b7fd1266ef888b3a80aa5f266062b224ef4;mm, oom_reaper: hide oom reaped tasks from OOM killer more carefully;yes;yes;no 139;MDY6Q29tbWl0MjMyNTI5ODpiYjhhNGI3ZmQxMjY2ZWY4ODhiM2E4MGFhNWYyNjYwNjJiMjI0ZWY0;Michal Hocko;Linus Torvalds;"mm, oom_reaper: hide oom reaped tasks from OOM killer more carefully Commit 36324a990cf5 (""oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space"") not only clears TIF_MEMDIE for oom reaped task but also set OOM_SCORE_ADJ_MIN for the target task to hide it from the oom killer. This works in simple cases but it is not sufficient for (unlikely) cases where the mm is shared between independent processes (as they do not share signal struct). If the mm had only small amount of memory which could be reaped then another task sharing the mm could be selected and that wouldn't help to move out from the oom situation. Introduce MMF_OOM_REAPED mm flag which is checked in oom_badness (same as OOM_SCORE_ADJ_MIN) and task is skipped if the flag is set. Set the flag after __oom_reap_task is done with a task. This will force the select_bad_process() to ignore all already oom reaped tasks as well as no such task is sacrificed for its parent. Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/bb8a4b7fd1266ef888b3a80aa5f266062b224ef4;"Commit 36324a990cf5 (""oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space"") not only clears TIF_MEMDIE for oom reaped task but also set OOM_SCORE_ADJ_MIN for the target task to hide it from the oom killer";no;no;yes 139;MDY6Q29tbWl0MjMyNTI5ODpiYjhhNGI3ZmQxMjY2ZWY4ODhiM2E4MGFhNWYyNjYwNjJiMjI0ZWY0;Michal Hocko;Linus Torvalds;"mm, oom_reaper: hide oom reaped tasks from OOM killer more carefully Commit 36324a990cf5 (""oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space"") not only clears TIF_MEMDIE for oom reaped task but also set OOM_SCORE_ADJ_MIN for the target task to hide it from the oom killer. This works in simple cases but it is not sufficient for (unlikely) cases where the mm is shared between independent processes (as they do not share signal struct). If the mm had only small amount of memory which could be reaped then another task sharing the mm could be selected and that wouldn't help to move out from the oom situation. Introduce MMF_OOM_REAPED mm flag which is checked in oom_badness (same as OOM_SCORE_ADJ_MIN) and task is skipped if the flag is set. Set the flag after __oom_reap_task is done with a task. This will force the select_bad_process() to ignore all already oom reaped tasks as well as no such task is sacrificed for its parent. Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/bb8a4b7fd1266ef888b3a80aa5f266062b224ef4;" This works in simple cases but it is not sufficient for (unlikely) cases where the mm is shared between independent processes (as they do not share signal struct)";no;yes;yes 139;MDY6Q29tbWl0MjMyNTI5ODpiYjhhNGI3ZmQxMjY2ZWY4ODhiM2E4MGFhNWYyNjYwNjJiMjI0ZWY0;Michal Hocko;Linus Torvalds;"mm, oom_reaper: hide oom reaped tasks from OOM killer more carefully Commit 36324a990cf5 (""oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space"") not only clears TIF_MEMDIE for oom reaped task but also set OOM_SCORE_ADJ_MIN for the target task to hide it from the oom killer. This works in simple cases but it is not sufficient for (unlikely) cases where the mm is shared between independent processes (as they do not share signal struct). If the mm had only small amount of memory which could be reaped then another task sharing the mm could be selected and that wouldn't help to move out from the oom situation. Introduce MMF_OOM_REAPED mm flag which is checked in oom_badness (same as OOM_SCORE_ADJ_MIN) and task is skipped if the flag is set. Set the flag after __oom_reap_task is done with a task. This will force the select_bad_process() to ignore all already oom reaped tasks as well as no such task is sacrificed for its parent. Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/bb8a4b7fd1266ef888b3a80aa5f266062b224ef4;" If the mm had only small amount of memory which could be reaped then another task sharing the mm could be selected and that wouldn't help to move out from the oom situation";no;yes;yes 139;MDY6Q29tbWl0MjMyNTI5ODpiYjhhNGI3ZmQxMjY2ZWY4ODhiM2E4MGFhNWYyNjYwNjJiMjI0ZWY0;Michal Hocko;Linus Torvalds;"mm, oom_reaper: hide oom reaped tasks from OOM killer more carefully Commit 36324a990cf5 (""oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space"") not only clears TIF_MEMDIE for oom reaped task but also set OOM_SCORE_ADJ_MIN for the target task to hide it from the oom killer. This works in simple cases but it is not sufficient for (unlikely) cases where the mm is shared between independent processes (as they do not share signal struct). If the mm had only small amount of memory which could be reaped then another task sharing the mm could be selected and that wouldn't help to move out from the oom situation. Introduce MMF_OOM_REAPED mm flag which is checked in oom_badness (same as OOM_SCORE_ADJ_MIN) and task is skipped if the flag is set. Set the flag after __oom_reap_task is done with a task. This will force the select_bad_process() to ignore all already oom reaped tasks as well as no such task is sacrificed for its parent. Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/bb8a4b7fd1266ef888b3a80aa5f266062b224ef4;"Introduce MMF_OOM_REAPED mm flag which is checked in oom_badness (same as OOM_SCORE_ADJ_MIN) and task is skipped if the flag is set";yes;no;no 139;MDY6Q29tbWl0MjMyNTI5ODpiYjhhNGI3ZmQxMjY2ZWY4ODhiM2E4MGFhNWYyNjYwNjJiMjI0ZWY0;Michal Hocko;Linus Torvalds;"mm, oom_reaper: hide oom reaped tasks from OOM killer more carefully Commit 36324a990cf5 (""oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space"") not only clears TIF_MEMDIE for oom reaped task but also set OOM_SCORE_ADJ_MIN for the target task to hide it from the oom killer. This works in simple cases but it is not sufficient for (unlikely) cases where the mm is shared between independent processes (as they do not share signal struct). If the mm had only small amount of memory which could be reaped then another task sharing the mm could be selected and that wouldn't help to move out from the oom situation. Introduce MMF_OOM_REAPED mm flag which is checked in oom_badness (same as OOM_SCORE_ADJ_MIN) and task is skipped if the flag is set. Set the flag after __oom_reap_task is done with a task. This will force the select_bad_process() to ignore all already oom reaped tasks as well as no such task is sacrificed for its parent. Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/bb8a4b7fd1266ef888b3a80aa5f266062b224ef4;" Set the flag after __oom_reap_task is done with a task";yes;no;no 139;MDY6Q29tbWl0MjMyNTI5ODpiYjhhNGI3ZmQxMjY2ZWY4ODhiM2E4MGFhNWYyNjYwNjJiMjI0ZWY0;Michal Hocko;Linus Torvalds;"mm, oom_reaper: hide oom reaped tasks from OOM killer more carefully Commit 36324a990cf5 (""oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space"") not only clears TIF_MEMDIE for oom reaped task but also set OOM_SCORE_ADJ_MIN for the target task to hide it from the oom killer. This works in simple cases but it is not sufficient for (unlikely) cases where the mm is shared between independent processes (as they do not share signal struct). If the mm had only small amount of memory which could be reaped then another task sharing the mm could be selected and that wouldn't help to move out from the oom situation. Introduce MMF_OOM_REAPED mm flag which is checked in oom_badness (same as OOM_SCORE_ADJ_MIN) and task is skipped if the flag is set. Set the flag after __oom_reap_task is done with a task. This will force the select_bad_process() to ignore all already oom reaped tasks as well as no such task is sacrificed for its parent. Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/bb8a4b7fd1266ef888b3a80aa5f266062b224ef4;" This will force the select_bad_process() to ignore all already oom reaped tasks as well as no such task is sacrificed for its parent.";yes;yes;no 141;MDY6Q29tbWl0MjMyNTI5ODozZWYyMmRmZmYyMzkwZTc1YjM3OWY5NzE1Mzg4YTg1MmFhNTZlMGQ1;Michal Hocko;Linus Torvalds;"oom, oom_reaper: try to reap tasks which skip regular OOM killer path If either the current task is already killed or PF_EXITING or a selected task is PF_EXITING then the oom killer is suppressed and so is the oom reaper. This patch adds try_oom_reaper which checks the given task and queues it for the oom reaper if that is safe to be done meaning that the task doesn't share the mm with an alive process. This might help to release the memory pressure while the task tries to exit. [akpm@linux-foundation.org: fix nommu build] Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Raushaniya Maksudova <rmaksudova@parallels.com> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Daniel Vetter <daniel.vetter@intel.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3ef22dfff2390e75b379f9715388a852aa56e0d5;oom, oom_reaper: try to reap tasks which skip regular OOM killer path;yes;yes;no 141;MDY6Q29tbWl0MjMyNTI5ODozZWYyMmRmZmYyMzkwZTc1YjM3OWY5NzE1Mzg4YTg1MmFhNTZlMGQ1;Michal Hocko;Linus Torvalds;"oom, oom_reaper: try to reap tasks which skip regular OOM killer path If either the current task is already killed or PF_EXITING or a selected task is PF_EXITING then the oom killer is suppressed and so is the oom reaper. This patch adds try_oom_reaper which checks the given task and queues it for the oom reaper if that is safe to be done meaning that the task doesn't share the mm with an alive process. This might help to release the memory pressure while the task tries to exit. [akpm@linux-foundation.org: fix nommu build] Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Raushaniya Maksudova <rmaksudova@parallels.com> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Daniel Vetter <daniel.vetter@intel.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3ef22dfff2390e75b379f9715388a852aa56e0d5;"If either the current task is already killed or PF_EXITING or a selected task is PF_EXITING then the oom killer is suppressed and so is the oom reaper";no;no;yes 141;MDY6Q29tbWl0MjMyNTI5ODozZWYyMmRmZmYyMzkwZTc1YjM3OWY5NzE1Mzg4YTg1MmFhNTZlMGQ1;Michal Hocko;Linus Torvalds;"oom, oom_reaper: try to reap tasks which skip regular OOM killer path If either the current task is already killed or PF_EXITING or a selected task is PF_EXITING then the oom killer is suppressed and so is the oom reaper. This patch adds try_oom_reaper which checks the given task and queues it for the oom reaper if that is safe to be done meaning that the task doesn't share the mm with an alive process. This might help to release the memory pressure while the task tries to exit. [akpm@linux-foundation.org: fix nommu build] Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Raushaniya Maksudova <rmaksudova@parallels.com> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Daniel Vetter <daniel.vetter@intel.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3ef22dfff2390e75b379f9715388a852aa56e0d5;" This patch adds try_oom_reaper which checks the given task and queues it for the oom reaper if that is safe to be done meaning that the task doesn't share the mm with an alive process";yes;yes;no 141;MDY6Q29tbWl0MjMyNTI5ODozZWYyMmRmZmYyMzkwZTc1YjM3OWY5NzE1Mzg4YTg1MmFhNTZlMGQ1;Michal Hocko;Linus Torvalds;"oom, oom_reaper: try to reap tasks which skip regular OOM killer path If either the current task is already killed or PF_EXITING or a selected task is PF_EXITING then the oom killer is suppressed and so is the oom reaper. This patch adds try_oom_reaper which checks the given task and queues it for the oom reaper if that is safe to be done meaning that the task doesn't share the mm with an alive process. This might help to release the memory pressure while the task tries to exit. [akpm@linux-foundation.org: fix nommu build] Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Raushaniya Maksudova <rmaksudova@parallels.com> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Daniel Vetter <daniel.vetter@intel.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3ef22dfff2390e75b379f9715388a852aa56e0d5;"This might help to release the memory pressure while the task tries to exit.";no;yes;no 142;MDY6Q29tbWl0MjMyNTI5ODozZGE4OGZiM2JhY2ZhYTMzZmY5ZDEzNzMwZDE3MTEwYmIyZDk2MDRk;Michal Hocko;Linus Torvalds;"mm, oom: move GFP_NOFS check to out_of_memory __alloc_pages_may_oom is the central place to decide when the out_of_memory should be invoked. This is a good approach for most checks there because they are page allocator specific and the allocation fails right after for all of them. The notable exception is GFP_NOFS context which is faking did_some_progress and keep the page allocator looping even though there couldn't have been any progress from the OOM killer. This patch doesn't change this behavior because we are not ready to allow those allocation requests to fail yet (and maybe we will face the reality that we will never manage to safely fail these request). Instead __GFP_FS check is moved down to out_of_memory and prevent from OOM victim selection there. There are two reasons for that - OOM notifiers might release some memory even from this context as none of the registered notifier seems to be FS related - this might help a dying thread to get an access to memory reserves and move on which will make the behavior more consistent with the case when the task gets killed from a different context. Keep a comment in __alloc_pages_may_oom to make sure we do not forget how GFP_NOFS is special and that we really want to do something about it. Note to the current oom_notifier users: The observable difference for you is that oom notifiers cannot depend on any fs locks because we could deadlock. Not that this would be allowed today because that would just lockup machine in most of the cases and ruling out the OOM killer along the way. Another difference is that callbacks might be invoked sooner now because GFP_NOFS is a weaker reclaim context and so there could be reclaimable memory which is just not reachable now. That would require GFP_NOFS only loads which are really rare and more importantly the observable result would be dropping of reconstructible object and potential performance drop which is not such a big deal when we are struggling to fulfill other important allocation requests. Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Raushaniya Maksudova <rmaksudova@parallels.com> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Daniel Vetter <daniel.vetter@intel.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3da88fb3bacfaa33ff9d13730d17110bb2d9604d;mm, oom: move GFP_NOFS check to out_of_memory;yes;no;no 142;MDY6Q29tbWl0MjMyNTI5ODozZGE4OGZiM2JhY2ZhYTMzZmY5ZDEzNzMwZDE3MTEwYmIyZDk2MDRk;Michal Hocko;Linus Torvalds;"mm, oom: move GFP_NOFS check to out_of_memory __alloc_pages_may_oom is the central place to decide when the out_of_memory should be invoked. This is a good approach for most checks there because they are page allocator specific and the allocation fails right after for all of them. The notable exception is GFP_NOFS context which is faking did_some_progress and keep the page allocator looping even though there couldn't have been any progress from the OOM killer. This patch doesn't change this behavior because we are not ready to allow those allocation requests to fail yet (and maybe we will face the reality that we will never manage to safely fail these request). Instead __GFP_FS check is moved down to out_of_memory and prevent from OOM victim selection there. There are two reasons for that - OOM notifiers might release some memory even from this context as none of the registered notifier seems to be FS related - this might help a dying thread to get an access to memory reserves and move on which will make the behavior more consistent with the case when the task gets killed from a different context. Keep a comment in __alloc_pages_may_oom to make sure we do not forget how GFP_NOFS is special and that we really want to do something about it. Note to the current oom_notifier users: The observable difference for you is that oom notifiers cannot depend on any fs locks because we could deadlock. Not that this would be allowed today because that would just lockup machine in most of the cases and ruling out the OOM killer along the way. Another difference is that callbacks might be invoked sooner now because GFP_NOFS is a weaker reclaim context and so there could be reclaimable memory which is just not reachable now. That would require GFP_NOFS only loads which are really rare and more importantly the observable result would be dropping of reconstructible object and potential performance drop which is not such a big deal when we are struggling to fulfill other important allocation requests. Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Raushaniya Maksudova <rmaksudova@parallels.com> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Daniel Vetter <daniel.vetter@intel.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3da88fb3bacfaa33ff9d13730d17110bb2d9604d;"__alloc_pages_may_oom is the central place to decide when the out_of_memory should be invoked";no;no;yes 142;MDY6Q29tbWl0MjMyNTI5ODozZGE4OGZiM2JhY2ZhYTMzZmY5ZDEzNzMwZDE3MTEwYmIyZDk2MDRk;Michal Hocko;Linus Torvalds;"mm, oom: move GFP_NOFS check to out_of_memory __alloc_pages_may_oom is the central place to decide when the out_of_memory should be invoked. This is a good approach for most checks there because they are page allocator specific and the allocation fails right after for all of them. The notable exception is GFP_NOFS context which is faking did_some_progress and keep the page allocator looping even though there couldn't have been any progress from the OOM killer. This patch doesn't change this behavior because we are not ready to allow those allocation requests to fail yet (and maybe we will face the reality that we will never manage to safely fail these request). Instead __GFP_FS check is moved down to out_of_memory and prevent from OOM victim selection there. There are two reasons for that - OOM notifiers might release some memory even from this context as none of the registered notifier seems to be FS related - this might help a dying thread to get an access to memory reserves and move on which will make the behavior more consistent with the case when the task gets killed from a different context. Keep a comment in __alloc_pages_may_oom to make sure we do not forget how GFP_NOFS is special and that we really want to do something about it. Note to the current oom_notifier users: The observable difference for you is that oom notifiers cannot depend on any fs locks because we could deadlock. Not that this would be allowed today because that would just lockup machine in most of the cases and ruling out the OOM killer along the way. Another difference is that callbacks might be invoked sooner now because GFP_NOFS is a weaker reclaim context and so there could be reclaimable memory which is just not reachable now. That would require GFP_NOFS only loads which are really rare and more importantly the observable result would be dropping of reconstructible object and potential performance drop which is not such a big deal when we are struggling to fulfill other important allocation requests. Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Raushaniya Maksudova <rmaksudova@parallels.com> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Daniel Vetter <daniel.vetter@intel.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3da88fb3bacfaa33ff9d13730d17110bb2d9604d;" This is a good approach for most checks there because they are page allocator specific and the allocation fails right after for all of them";no;no;yes 142;MDY6Q29tbWl0MjMyNTI5ODozZGE4OGZiM2JhY2ZhYTMzZmY5ZDEzNzMwZDE3MTEwYmIyZDk2MDRk;Michal Hocko;Linus Torvalds;"mm, oom: move GFP_NOFS check to out_of_memory __alloc_pages_may_oom is the central place to decide when the out_of_memory should be invoked. This is a good approach for most checks there because they are page allocator specific and the allocation fails right after for all of them. The notable exception is GFP_NOFS context which is faking did_some_progress and keep the page allocator looping even though there couldn't have been any progress from the OOM killer. This patch doesn't change this behavior because we are not ready to allow those allocation requests to fail yet (and maybe we will face the reality that we will never manage to safely fail these request). Instead __GFP_FS check is moved down to out_of_memory and prevent from OOM victim selection there. There are two reasons for that - OOM notifiers might release some memory even from this context as none of the registered notifier seems to be FS related - this might help a dying thread to get an access to memory reserves and move on which will make the behavior more consistent with the case when the task gets killed from a different context. Keep a comment in __alloc_pages_may_oom to make sure we do not forget how GFP_NOFS is special and that we really want to do something about it. Note to the current oom_notifier users: The observable difference for you is that oom notifiers cannot depend on any fs locks because we could deadlock. Not that this would be allowed today because that would just lockup machine in most of the cases and ruling out the OOM killer along the way. Another difference is that callbacks might be invoked sooner now because GFP_NOFS is a weaker reclaim context and so there could be reclaimable memory which is just not reachable now. That would require GFP_NOFS only loads which are really rare and more importantly the observable result would be dropping of reconstructible object and potential performance drop which is not such a big deal when we are struggling to fulfill other important allocation requests. Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Raushaniya Maksudova <rmaksudova@parallels.com> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Daniel Vetter <daniel.vetter@intel.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3da88fb3bacfaa33ff9d13730d17110bb2d9604d;"The notable exception is GFP_NOFS context which is faking did_some_progress and keep the page allocator looping even though there couldn't have been any progress from the OOM killer";no;no;yes 142;MDY6Q29tbWl0MjMyNTI5ODozZGE4OGZiM2JhY2ZhYTMzZmY5ZDEzNzMwZDE3MTEwYmIyZDk2MDRk;Michal Hocko;Linus Torvalds;"mm, oom: move GFP_NOFS check to out_of_memory __alloc_pages_may_oom is the central place to decide when the out_of_memory should be invoked. This is a good approach for most checks there because they are page allocator specific and the allocation fails right after for all of them. The notable exception is GFP_NOFS context which is faking did_some_progress and keep the page allocator looping even though there couldn't have been any progress from the OOM killer. This patch doesn't change this behavior because we are not ready to allow those allocation requests to fail yet (and maybe we will face the reality that we will never manage to safely fail these request). Instead __GFP_FS check is moved down to out_of_memory and prevent from OOM victim selection there. There are two reasons for that - OOM notifiers might release some memory even from this context as none of the registered notifier seems to be FS related - this might help a dying thread to get an access to memory reserves and move on which will make the behavior more consistent with the case when the task gets killed from a different context. Keep a comment in __alloc_pages_may_oom to make sure we do not forget how GFP_NOFS is special and that we really want to do something about it. Note to the current oom_notifier users: The observable difference for you is that oom notifiers cannot depend on any fs locks because we could deadlock. Not that this would be allowed today because that would just lockup machine in most of the cases and ruling out the OOM killer along the way. Another difference is that callbacks might be invoked sooner now because GFP_NOFS is a weaker reclaim context and so there could be reclaimable memory which is just not reachable now. That would require GFP_NOFS only loads which are really rare and more importantly the observable result would be dropping of reconstructible object and potential performance drop which is not such a big deal when we are struggling to fulfill other important allocation requests. Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Raushaniya Maksudova <rmaksudova@parallels.com> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Daniel Vetter <daniel.vetter@intel.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3da88fb3bacfaa33ff9d13730d17110bb2d9604d;" This patch doesn't change this behavior because we are not ready to allow those allocation requests to fail yet (and maybe we will face the reality that we will never manage to safely fail these request)";yes;yes;yes 142;MDY6Q29tbWl0MjMyNTI5ODozZGE4OGZiM2JhY2ZhYTMzZmY5ZDEzNzMwZDE3MTEwYmIyZDk2MDRk;Michal Hocko;Linus Torvalds;"mm, oom: move GFP_NOFS check to out_of_memory __alloc_pages_may_oom is the central place to decide when the out_of_memory should be invoked. This is a good approach for most checks there because they are page allocator specific and the allocation fails right after for all of them. The notable exception is GFP_NOFS context which is faking did_some_progress and keep the page allocator looping even though there couldn't have been any progress from the OOM killer. This patch doesn't change this behavior because we are not ready to allow those allocation requests to fail yet (and maybe we will face the reality that we will never manage to safely fail these request). Instead __GFP_FS check is moved down to out_of_memory and prevent from OOM victim selection there. There are two reasons for that - OOM notifiers might release some memory even from this context as none of the registered notifier seems to be FS related - this might help a dying thread to get an access to memory reserves and move on which will make the behavior more consistent with the case when the task gets killed from a different context. Keep a comment in __alloc_pages_may_oom to make sure we do not forget how GFP_NOFS is special and that we really want to do something about it. Note to the current oom_notifier users: The observable difference for you is that oom notifiers cannot depend on any fs locks because we could deadlock. Not that this would be allowed today because that would just lockup machine in most of the cases and ruling out the OOM killer along the way. Another difference is that callbacks might be invoked sooner now because GFP_NOFS is a weaker reclaim context and so there could be reclaimable memory which is just not reachable now. That would require GFP_NOFS only loads which are really rare and more importantly the observable result would be dropping of reconstructible object and potential performance drop which is not such a big deal when we are struggling to fulfill other important allocation requests. Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Raushaniya Maksudova <rmaksudova@parallels.com> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Daniel Vetter <daniel.vetter@intel.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3da88fb3bacfaa33ff9d13730d17110bb2d9604d;" Instead __GFP_FS check is moved down to out_of_memory and prevent from OOM victim selection there";yes;yes;no 142;MDY6Q29tbWl0MjMyNTI5ODozZGE4OGZiM2JhY2ZhYTMzZmY5ZDEzNzMwZDE3MTEwYmIyZDk2MDRk;Michal Hocko;Linus Torvalds;"mm, oom: move GFP_NOFS check to out_of_memory __alloc_pages_may_oom is the central place to decide when the out_of_memory should be invoked. This is a good approach for most checks there because they are page allocator specific and the allocation fails right after for all of them. The notable exception is GFP_NOFS context which is faking did_some_progress and keep the page allocator looping even though there couldn't have been any progress from the OOM killer. This patch doesn't change this behavior because we are not ready to allow those allocation requests to fail yet (and maybe we will face the reality that we will never manage to safely fail these request). Instead __GFP_FS check is moved down to out_of_memory and prevent from OOM victim selection there. There are two reasons for that - OOM notifiers might release some memory even from this context as none of the registered notifier seems to be FS related - this might help a dying thread to get an access to memory reserves and move on which will make the behavior more consistent with the case when the task gets killed from a different context. Keep a comment in __alloc_pages_may_oom to make sure we do not forget how GFP_NOFS is special and that we really want to do something about it. Note to the current oom_notifier users: The observable difference for you is that oom notifiers cannot depend on any fs locks because we could deadlock. Not that this would be allowed today because that would just lockup machine in most of the cases and ruling out the OOM killer along the way. Another difference is that callbacks might be invoked sooner now because GFP_NOFS is a weaker reclaim context and so there could be reclaimable memory which is just not reachable now. That would require GFP_NOFS only loads which are really rare and more importantly the observable result would be dropping of reconstructible object and potential performance drop which is not such a big deal when we are struggling to fulfill other important allocation requests. Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Raushaniya Maksudova <rmaksudova@parallels.com> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Daniel Vetter <daniel.vetter@intel.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3da88fb3bacfaa33ff9d13730d17110bb2d9604d;"There are two reasons for that - OOM notifiers might release some memory even from this context as none of the registered notifier seems to be FS related - this might help a dying thread to get an access to memory reserves and move on which will make the behavior more consistent with the case when the task gets killed from a different context";no;yes;yes 142;MDY6Q29tbWl0MjMyNTI5ODozZGE4OGZiM2JhY2ZhYTMzZmY5ZDEzNzMwZDE3MTEwYmIyZDk2MDRk;Michal Hocko;Linus Torvalds;"mm, oom: move GFP_NOFS check to out_of_memory __alloc_pages_may_oom is the central place to decide when the out_of_memory should be invoked. This is a good approach for most checks there because they are page allocator specific and the allocation fails right after for all of them. The notable exception is GFP_NOFS context which is faking did_some_progress and keep the page allocator looping even though there couldn't have been any progress from the OOM killer. This patch doesn't change this behavior because we are not ready to allow those allocation requests to fail yet (and maybe we will face the reality that we will never manage to safely fail these request). Instead __GFP_FS check is moved down to out_of_memory and prevent from OOM victim selection there. There are two reasons for that - OOM notifiers might release some memory even from this context as none of the registered notifier seems to be FS related - this might help a dying thread to get an access to memory reserves and move on which will make the behavior more consistent with the case when the task gets killed from a different context. Keep a comment in __alloc_pages_may_oom to make sure we do not forget how GFP_NOFS is special and that we really want to do something about it. Note to the current oom_notifier users: The observable difference for you is that oom notifiers cannot depend on any fs locks because we could deadlock. Not that this would be allowed today because that would just lockup machine in most of the cases and ruling out the OOM killer along the way. Another difference is that callbacks might be invoked sooner now because GFP_NOFS is a weaker reclaim context and so there could be reclaimable memory which is just not reachable now. That would require GFP_NOFS only loads which are really rare and more importantly the observable result would be dropping of reconstructible object and potential performance drop which is not such a big deal when we are struggling to fulfill other important allocation requests. Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Raushaniya Maksudova <rmaksudova@parallels.com> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Daniel Vetter <daniel.vetter@intel.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3da88fb3bacfaa33ff9d13730d17110bb2d9604d;"Keep a comment in __alloc_pages_may_oom to make sure we do not forget how GFP_NOFS is special and that we really want to do something about Note to the current oom_notifier users";yes;yes;no 142;MDY6Q29tbWl0MjMyNTI5ODozZGE4OGZiM2JhY2ZhYTMzZmY5ZDEzNzMwZDE3MTEwYmIyZDk2MDRk;Michal Hocko;Linus Torvalds;"mm, oom: move GFP_NOFS check to out_of_memory __alloc_pages_may_oom is the central place to decide when the out_of_memory should be invoked. This is a good approach for most checks there because they are page allocator specific and the allocation fails right after for all of them. The notable exception is GFP_NOFS context which is faking did_some_progress and keep the page allocator looping even though there couldn't have been any progress from the OOM killer. This patch doesn't change this behavior because we are not ready to allow those allocation requests to fail yet (and maybe we will face the reality that we will never manage to safely fail these request). Instead __GFP_FS check is moved down to out_of_memory and prevent from OOM victim selection there. There are two reasons for that - OOM notifiers might release some memory even from this context as none of the registered notifier seems to be FS related - this might help a dying thread to get an access to memory reserves and move on which will make the behavior more consistent with the case when the task gets killed from a different context. Keep a comment in __alloc_pages_may_oom to make sure we do not forget how GFP_NOFS is special and that we really want to do something about it. Note to the current oom_notifier users: The observable difference for you is that oom notifiers cannot depend on any fs locks because we could deadlock. Not that this would be allowed today because that would just lockup machine in most of the cases and ruling out the OOM killer along the way. Another difference is that callbacks might be invoked sooner now because GFP_NOFS is a weaker reclaim context and so there could be reclaimable memory which is just not reachable now. That would require GFP_NOFS only loads which are really rare and more importantly the observable result would be dropping of reconstructible object and potential performance drop which is not such a big deal when we are struggling to fulfill other important allocation requests. Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Raushaniya Maksudova <rmaksudova@parallels.com> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Daniel Vetter <daniel.vetter@intel.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3da88fb3bacfaa33ff9d13730d17110bb2d9604d;"The observable difference for you is that oom notifiers cannot depend on any fs locks because we could deadlock";yes;yes;yes 142;MDY6Q29tbWl0MjMyNTI5ODozZGE4OGZiM2JhY2ZhYTMzZmY5ZDEzNzMwZDE3MTEwYmIyZDk2MDRk;Michal Hocko;Linus Torvalds;"mm, oom: move GFP_NOFS check to out_of_memory __alloc_pages_may_oom is the central place to decide when the out_of_memory should be invoked. This is a good approach for most checks there because they are page allocator specific and the allocation fails right after for all of them. The notable exception is GFP_NOFS context which is faking did_some_progress and keep the page allocator looping even though there couldn't have been any progress from the OOM killer. This patch doesn't change this behavior because we are not ready to allow those allocation requests to fail yet (and maybe we will face the reality that we will never manage to safely fail these request). Instead __GFP_FS check is moved down to out_of_memory and prevent from OOM victim selection there. There are two reasons for that - OOM notifiers might release some memory even from this context as none of the registered notifier seems to be FS related - this might help a dying thread to get an access to memory reserves and move on which will make the behavior more consistent with the case when the task gets killed from a different context. Keep a comment in __alloc_pages_may_oom to make sure we do not forget how GFP_NOFS is special and that we really want to do something about it. Note to the current oom_notifier users: The observable difference for you is that oom notifiers cannot depend on any fs locks because we could deadlock. Not that this would be allowed today because that would just lockup machine in most of the cases and ruling out the OOM killer along the way. Another difference is that callbacks might be invoked sooner now because GFP_NOFS is a weaker reclaim context and so there could be reclaimable memory which is just not reachable now. That would require GFP_NOFS only loads which are really rare and more importantly the observable result would be dropping of reconstructible object and potential performance drop which is not such a big deal when we are struggling to fulfill other important allocation requests. Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Raushaniya Maksudova <rmaksudova@parallels.com> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Daniel Vetter <daniel.vetter@intel.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3da88fb3bacfaa33ff9d13730d17110bb2d9604d;" Not that this would be allowed today because that would just lockup machine in most of the cases and ruling out the OOM killer along the way";yes;yes;yes 142;MDY6Q29tbWl0MjMyNTI5ODozZGE4OGZiM2JhY2ZhYTMzZmY5ZDEzNzMwZDE3MTEwYmIyZDk2MDRk;Michal Hocko;Linus Torvalds;"mm, oom: move GFP_NOFS check to out_of_memory __alloc_pages_may_oom is the central place to decide when the out_of_memory should be invoked. This is a good approach for most checks there because they are page allocator specific and the allocation fails right after for all of them. The notable exception is GFP_NOFS context which is faking did_some_progress and keep the page allocator looping even though there couldn't have been any progress from the OOM killer. This patch doesn't change this behavior because we are not ready to allow those allocation requests to fail yet (and maybe we will face the reality that we will never manage to safely fail these request). Instead __GFP_FS check is moved down to out_of_memory and prevent from OOM victim selection there. There are two reasons for that - OOM notifiers might release some memory even from this context as none of the registered notifier seems to be FS related - this might help a dying thread to get an access to memory reserves and move on which will make the behavior more consistent with the case when the task gets killed from a different context. Keep a comment in __alloc_pages_may_oom to make sure we do not forget how GFP_NOFS is special and that we really want to do something about it. Note to the current oom_notifier users: The observable difference for you is that oom notifiers cannot depend on any fs locks because we could deadlock. Not that this would be allowed today because that would just lockup machine in most of the cases and ruling out the OOM killer along the way. Another difference is that callbacks might be invoked sooner now because GFP_NOFS is a weaker reclaim context and so there could be reclaimable memory which is just not reachable now. That would require GFP_NOFS only loads which are really rare and more importantly the observable result would be dropping of reconstructible object and potential performance drop which is not such a big deal when we are struggling to fulfill other important allocation requests. Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Raushaniya Maksudova <rmaksudova@parallels.com> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Daniel Vetter <daniel.vetter@intel.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3da88fb3bacfaa33ff9d13730d17110bb2d9604d;" Another difference is that callbacks might be invoked sooner now because GFP_NOFS is a weaker reclaim context and so there could be reclaimable memory which is just not reachable now";yes;yes;no 142;MDY6Q29tbWl0MjMyNTI5ODozZGE4OGZiM2JhY2ZhYTMzZmY5ZDEzNzMwZDE3MTEwYmIyZDk2MDRk;Michal Hocko;Linus Torvalds;"mm, oom: move GFP_NOFS check to out_of_memory __alloc_pages_may_oom is the central place to decide when the out_of_memory should be invoked. This is a good approach for most checks there because they are page allocator specific and the allocation fails right after for all of them. The notable exception is GFP_NOFS context which is faking did_some_progress and keep the page allocator looping even though there couldn't have been any progress from the OOM killer. This patch doesn't change this behavior because we are not ready to allow those allocation requests to fail yet (and maybe we will face the reality that we will never manage to safely fail these request). Instead __GFP_FS check is moved down to out_of_memory and prevent from OOM victim selection there. There are two reasons for that - OOM notifiers might release some memory even from this context as none of the registered notifier seems to be FS related - this might help a dying thread to get an access to memory reserves and move on which will make the behavior more consistent with the case when the task gets killed from a different context. Keep a comment in __alloc_pages_may_oom to make sure we do not forget how GFP_NOFS is special and that we really want to do something about it. Note to the current oom_notifier users: The observable difference for you is that oom notifiers cannot depend on any fs locks because we could deadlock. Not that this would be allowed today because that would just lockup machine in most of the cases and ruling out the OOM killer along the way. Another difference is that callbacks might be invoked sooner now because GFP_NOFS is a weaker reclaim context and so there could be reclaimable memory which is just not reachable now. That would require GFP_NOFS only loads which are really rare and more importantly the observable result would be dropping of reconstructible object and potential performance drop which is not such a big deal when we are struggling to fulfill other important allocation requests. Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Raushaniya Maksudova <rmaksudova@parallels.com> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Daniel Vetter <daniel.vetter@intel.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3da88fb3bacfaa33ff9d13730d17110bb2d9604d;" That would require GFP_NOFS only loads which are really rare and more importantly the observable result would be dropping of reconstructible object and potential performance drop which is not such a big deal when we are struggling to fulfill other important allocation requests.";yes;no;no 143;MDY6Q29tbWl0MjMyNTI5ODphZjhlMTVjYzg1YTI1MzE1NWZkY2VhNzA3NTg4YmY2ZGRmYzBiZTJl;Michal Hocko;Linus Torvalds;"oom, oom_reaper: do not enqueue task if it is on the oom_reaper_list head Commit bb29902a7515 (""oom, oom_reaper: protect oom_reaper_list using simpler way"") has simplified the check for tasks already enqueued for the oom reaper by checking tsk->oom_reaper_list != NULL. This check is not sufficient because the tsk might be the head of the queue without any other tasks queued and then we would simply lockup looping on the same task. Fix the condition by checking for the head as well. Fixes: bb29902a7515 (""oom, oom_reaper: protect oom_reaper_list using simpler way"") Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/af8e15cc85a253155fdcea707588bf6ddfc0be2e;oom, oom_reaper: do not enqueue task if it is on the oom_reaper_list head;yes;no;no 143;MDY6Q29tbWl0MjMyNTI5ODphZjhlMTVjYzg1YTI1MzE1NWZkY2VhNzA3NTg4YmY2ZGRmYzBiZTJl;Michal Hocko;Linus Torvalds;"oom, oom_reaper: do not enqueue task if it is on the oom_reaper_list head Commit bb29902a7515 (""oom, oom_reaper: protect oom_reaper_list using simpler way"") has simplified the check for tasks already enqueued for the oom reaper by checking tsk->oom_reaper_list != NULL. This check is not sufficient because the tsk might be the head of the queue without any other tasks queued and then we would simply lockup looping on the same task. Fix the condition by checking for the head as well. Fixes: bb29902a7515 (""oom, oom_reaper: protect oom_reaper_list using simpler way"") Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/af8e15cc85a253155fdcea707588bf6ddfc0be2e;"Commit bb29902a7515 (""oom, oom_reaper: protect oom_reaper_list using simpler way"") has simplified the check for tasks already enqueued for the oom reaper by checking tsk->oom_reaper_list != NULL";no;no;yes 143;MDY6Q29tbWl0MjMyNTI5ODphZjhlMTVjYzg1YTI1MzE1NWZkY2VhNzA3NTg4YmY2ZGRmYzBiZTJl;Michal Hocko;Linus Torvalds;"oom, oom_reaper: do not enqueue task if it is on the oom_reaper_list head Commit bb29902a7515 (""oom, oom_reaper: protect oom_reaper_list using simpler way"") has simplified the check for tasks already enqueued for the oom reaper by checking tsk->oom_reaper_list != NULL. This check is not sufficient because the tsk might be the head of the queue without any other tasks queued and then we would simply lockup looping on the same task. Fix the condition by checking for the head as well. Fixes: bb29902a7515 (""oom, oom_reaper: protect oom_reaper_list using simpler way"") Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/af8e15cc85a253155fdcea707588bf6ddfc0be2e;" This check is not sufficient because the tsk might be the head of the queue without any other tasks queued and then we would simply lockup looping on the same task";no;yes;yes 143;MDY6Q29tbWl0MjMyNTI5ODphZjhlMTVjYzg1YTI1MzE1NWZkY2VhNzA3NTg4YmY2ZGRmYzBiZTJl;Michal Hocko;Linus Torvalds;"oom, oom_reaper: do not enqueue task if it is on the oom_reaper_list head Commit bb29902a7515 (""oom, oom_reaper: protect oom_reaper_list using simpler way"") has simplified the check for tasks already enqueued for the oom reaper by checking tsk->oom_reaper_list != NULL. This check is not sufficient because the tsk might be the head of the queue without any other tasks queued and then we would simply lockup looping on the same task. Fix the condition by checking for the head as well. Fixes: bb29902a7515 (""oom, oom_reaper: protect oom_reaper_list using simpler way"") Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/af8e15cc85a253155fdcea707588bf6ddfc0be2e; Fix the condition by checking for the head as well;yes;yes;no 143;MDY6Q29tbWl0MjMyNTI5ODphZjhlMTVjYzg1YTI1MzE1NWZkY2VhNzA3NTg4YmY2ZGRmYzBiZTJl;Michal Hocko;Linus Torvalds;"oom, oom_reaper: do not enqueue task if it is on the oom_reaper_list head Commit bb29902a7515 (""oom, oom_reaper: protect oom_reaper_list using simpler way"") has simplified the check for tasks already enqueued for the oom reaper by checking tsk->oom_reaper_list != NULL. This check is not sufficient because the tsk might be the head of the queue without any other tasks queued and then we would simply lockup looping on the same task. Fix the condition by checking for the head as well. Fixes: bb29902a7515 (""oom, oom_reaper: protect oom_reaper_list using simpler way"") Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/af8e15cc85a253155fdcea707588bf6ddfc0be2e;"Fixes: bb29902a7515 (""oom, oom_reaper: protect oom_reaper_list using simpler way"")";yes;yes;no 144;MDY6Q29tbWl0MjMyNTI5ODpiYjI5OTAyYTc1MTUyMDg4NDYxMTRiM2IzNmE0MjgxYTliYmY3NjZh;Tetsuo Handa;Linus Torvalds;"oom, oom_reaper: protect oom_reaper_list using simpler way ""oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task"" tried to protect oom_reaper_list using MMF_OOM_KILLED flag. But we can do it by simply checking tsk->oom_reaper_list != NULL. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/bb29902a7515208846114b3b36a4281a9bbf766a;oom, oom_reaper: protect oom_reaper_list using simpler way;yes;yes;no 144;MDY6Q29tbWl0MjMyNTI5ODpiYjI5OTAyYTc1MTUyMDg4NDYxMTRiM2IzNmE0MjgxYTliYmY3NjZh;Tetsuo Handa;Linus Torvalds;"oom, oom_reaper: protect oom_reaper_list using simpler way ""oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task"" tried to protect oom_reaper_list using MMF_OOM_KILLED flag. But we can do it by simply checking tsk->oom_reaper_list != NULL. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/bb29902a7515208846114b3b36a4281a9bbf766a;"""oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task"" tried to protect oom_reaper_list using MMF_OOM_KILLED flag";no;no;yes 144;MDY6Q29tbWl0MjMyNTI5ODpiYjI5OTAyYTc1MTUyMDg4NDYxMTRiM2IzNmE0MjgxYTliYmY3NjZh;Tetsuo Handa;Linus Torvalds;"oom, oom_reaper: protect oom_reaper_list using simpler way ""oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task"" tried to protect oom_reaper_list using MMF_OOM_KILLED flag. But we can do it by simply checking tsk->oom_reaper_list != NULL. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/bb29902a7515208846114b3b36a4281a9bbf766a;" But we can do it by simply checking tsk->oom_reaper_list != NULL.";yes;yes;no 145;MDY6Q29tbWl0MjMyNTI5ODplMjY3OTYwNjZmZGY5MjljYmJhMjJkYWJiODAxODA4Zjk4NmFjZGI5;Michal Hocko;Linus Torvalds;"oom: make oom_reaper freezable After ""oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space"" oom_reaper will call exit_oom_victim on the target task after it is done. This might however race with the PM freezer: CPU0 CPU1 CPU2 freeze_processes try_to_freeze_tasks # Allocation request out_of_memory oom_killer_disable wake_oom_reaper(P1) __oom_reap_task exit_oom_victim(P1) wait_event(oom_victims==0) [...] do_exit(P1) perform IO/interfere with the freezer which breaks the oom_killer_disable semantic. We no longer have a guarantee that the oom victim won't interfere with the freezer because it might be anywhere on the way to do_exit while the freezer thinks the task has already terminated. It might trigger IO or touch devices which are frozen already. In order to close this race, make the oom_reaper thread freezable. This will work because a) already running oom_reaper will block freezer to enter the quiescent state b) wake_oom_reaper will not wake up the reaper after it has been frozen c) the only way to call exit_oom_victim after try_to_freeze_tasks is from the oom victim's context when we know the further interference shouldn't be possible Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/e26796066fdf929cbba22dabb801808f986acdb9;oom: make oom_reaper freezable;yes;no;no 145;MDY6Q29tbWl0MjMyNTI5ODplMjY3OTYwNjZmZGY5MjljYmJhMjJkYWJiODAxODA4Zjk4NmFjZGI5;Michal Hocko;Linus Torvalds;"oom: make oom_reaper freezable After ""oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space"" oom_reaper will call exit_oom_victim on the target task after it is done. This might however race with the PM freezer: CPU0 CPU1 CPU2 freeze_processes try_to_freeze_tasks # Allocation request out_of_memory oom_killer_disable wake_oom_reaper(P1) __oom_reap_task exit_oom_victim(P1) wait_event(oom_victims==0) [...] do_exit(P1) perform IO/interfere with the freezer which breaks the oom_killer_disable semantic. We no longer have a guarantee that the oom victim won't interfere with the freezer because it might be anywhere on the way to do_exit while the freezer thinks the task has already terminated. It might trigger IO or touch devices which are frozen already. In order to close this race, make the oom_reaper thread freezable. This will work because a) already running oom_reaper will block freezer to enter the quiescent state b) wake_oom_reaper will not wake up the reaper after it has been frozen c) the only way to call exit_oom_victim after try_to_freeze_tasks is from the oom victim's context when we know the further interference shouldn't be possible Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/e26796066fdf929cbba22dabb801808f986acdb9;"After ""oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space"" oom_reaper will call exit_oom_victim on the target task after it is done";no;no;yes 145;MDY6Q29tbWl0MjMyNTI5ODplMjY3OTYwNjZmZGY5MjljYmJhMjJkYWJiODAxODA4Zjk4NmFjZGI5;Michal Hocko;Linus Torvalds;"oom: make oom_reaper freezable After ""oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space"" oom_reaper will call exit_oom_victim on the target task after it is done. This might however race with the PM freezer: CPU0 CPU1 CPU2 freeze_processes try_to_freeze_tasks # Allocation request out_of_memory oom_killer_disable wake_oom_reaper(P1) __oom_reap_task exit_oom_victim(P1) wait_event(oom_victims==0) [...] do_exit(P1) perform IO/interfere with the freezer which breaks the oom_killer_disable semantic. We no longer have a guarantee that the oom victim won't interfere with the freezer because it might be anywhere on the way to do_exit while the freezer thinks the task has already terminated. It might trigger IO or touch devices which are frozen already. In order to close this race, make the oom_reaper thread freezable. This will work because a) already running oom_reaper will block freezer to enter the quiescent state b) wake_oom_reaper will not wake up the reaper after it has been frozen c) the only way to call exit_oom_victim after try_to_freeze_tasks is from the oom victim's context when we know the further interference shouldn't be possible Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/e26796066fdf929cbba22dabb801808f986acdb9; This might however race with the PM freezer;no;yes;yes 145;MDY6Q29tbWl0MjMyNTI5ODplMjY3OTYwNjZmZGY5MjljYmJhMjJkYWJiODAxODA4Zjk4NmFjZGI5;Michal Hocko;Linus Torvalds;"oom: make oom_reaper freezable After ""oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space"" oom_reaper will call exit_oom_victim on the target task after it is done. This might however race with the PM freezer: CPU0 CPU1 CPU2 freeze_processes try_to_freeze_tasks # Allocation request out_of_memory oom_killer_disable wake_oom_reaper(P1) __oom_reap_task exit_oom_victim(P1) wait_event(oom_victims==0) [...] do_exit(P1) perform IO/interfere with the freezer which breaks the oom_killer_disable semantic. We no longer have a guarantee that the oom victim won't interfere with the freezer because it might be anywhere on the way to do_exit while the freezer thinks the task has already terminated. It might trigger IO or touch devices which are frozen already. In order to close this race, make the oom_reaper thread freezable. This will work because a) already running oom_reaper will block freezer to enter the quiescent state b) wake_oom_reaper will not wake up the reaper after it has been frozen c) the only way to call exit_oom_victim after try_to_freeze_tasks is from the oom victim's context when we know the further interference shouldn't be possible Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/e26796066fdf929cbba22dabb801808f986acdb9;"CPU0 CPU1 CPU2 freeze_processes try_to_freeze_tasks # Allocation request out_of_memory oom_killer_disable wake_oom_reaper(P1) __oom_reap_task exit_oom_victim(P1) wait_event(oom_victims==0) do_exit(P1) perform IO/interfere with the freezer which breaks the oom_killer_disable semantic";no;yes;yes 145;MDY6Q29tbWl0MjMyNTI5ODplMjY3OTYwNjZmZGY5MjljYmJhMjJkYWJiODAxODA4Zjk4NmFjZGI5;Michal Hocko;Linus Torvalds;"oom: make oom_reaper freezable After ""oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space"" oom_reaper will call exit_oom_victim on the target task after it is done. This might however race with the PM freezer: CPU0 CPU1 CPU2 freeze_processes try_to_freeze_tasks # Allocation request out_of_memory oom_killer_disable wake_oom_reaper(P1) __oom_reap_task exit_oom_victim(P1) wait_event(oom_victims==0) [...] do_exit(P1) perform IO/interfere with the freezer which breaks the oom_killer_disable semantic. We no longer have a guarantee that the oom victim won't interfere with the freezer because it might be anywhere on the way to do_exit while the freezer thinks the task has already terminated. It might trigger IO or touch devices which are frozen already. In order to close this race, make the oom_reaper thread freezable. This will work because a) already running oom_reaper will block freezer to enter the quiescent state b) wake_oom_reaper will not wake up the reaper after it has been frozen c) the only way to call exit_oom_victim after try_to_freeze_tasks is from the oom victim's context when we know the further interference shouldn't be possible Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/e26796066fdf929cbba22dabb801808f986acdb9;" We no longer have a guarantee that the oom victim won't interfere with the freezer because it might be anywhere on the way to do_exit while the freezer thinks the task has already terminated";no;yes;yes 145;MDY6Q29tbWl0MjMyNTI5ODplMjY3OTYwNjZmZGY5MjljYmJhMjJkYWJiODAxODA4Zjk4NmFjZGI5;Michal Hocko;Linus Torvalds;"oom: make oom_reaper freezable After ""oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space"" oom_reaper will call exit_oom_victim on the target task after it is done. This might however race with the PM freezer: CPU0 CPU1 CPU2 freeze_processes try_to_freeze_tasks # Allocation request out_of_memory oom_killer_disable wake_oom_reaper(P1) __oom_reap_task exit_oom_victim(P1) wait_event(oom_victims==0) [...] do_exit(P1) perform IO/interfere with the freezer which breaks the oom_killer_disable semantic. We no longer have a guarantee that the oom victim won't interfere with the freezer because it might be anywhere on the way to do_exit while the freezer thinks the task has already terminated. It might trigger IO or touch devices which are frozen already. In order to close this race, make the oom_reaper thread freezable. This will work because a) already running oom_reaper will block freezer to enter the quiescent state b) wake_oom_reaper will not wake up the reaper after it has been frozen c) the only way to call exit_oom_victim after try_to_freeze_tasks is from the oom victim's context when we know the further interference shouldn't be possible Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/e26796066fdf929cbba22dabb801808f986acdb9;" It might trigger IO or touch devices which are frozen already";no;yes;yes 145;MDY6Q29tbWl0MjMyNTI5ODplMjY3OTYwNjZmZGY5MjljYmJhMjJkYWJiODAxODA4Zjk4NmFjZGI5;Michal Hocko;Linus Torvalds;"oom: make oom_reaper freezable After ""oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space"" oom_reaper will call exit_oom_victim on the target task after it is done. This might however race with the PM freezer: CPU0 CPU1 CPU2 freeze_processes try_to_freeze_tasks # Allocation request out_of_memory oom_killer_disable wake_oom_reaper(P1) __oom_reap_task exit_oom_victim(P1) wait_event(oom_victims==0) [...] do_exit(P1) perform IO/interfere with the freezer which breaks the oom_killer_disable semantic. We no longer have a guarantee that the oom victim won't interfere with the freezer because it might be anywhere on the way to do_exit while the freezer thinks the task has already terminated. It might trigger IO or touch devices which are frozen already. In order to close this race, make the oom_reaper thread freezable. This will work because a) already running oom_reaper will block freezer to enter the quiescent state b) wake_oom_reaper will not wake up the reaper after it has been frozen c) the only way to call exit_oom_victim after try_to_freeze_tasks is from the oom victim's context when we know the further interference shouldn't be possible Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/e26796066fdf929cbba22dabb801808f986acdb9;In order to close this race, make the oom_reaper thread freezable;yes;yes;no 145;MDY6Q29tbWl0MjMyNTI5ODplMjY3OTYwNjZmZGY5MjljYmJhMjJkYWJiODAxODA4Zjk4NmFjZGI5;Michal Hocko;Linus Torvalds;"oom: make oom_reaper freezable After ""oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space"" oom_reaper will call exit_oom_victim on the target task after it is done. This might however race with the PM freezer: CPU0 CPU1 CPU2 freeze_processes try_to_freeze_tasks # Allocation request out_of_memory oom_killer_disable wake_oom_reaper(P1) __oom_reap_task exit_oom_victim(P1) wait_event(oom_victims==0) [...] do_exit(P1) perform IO/interfere with the freezer which breaks the oom_killer_disable semantic. We no longer have a guarantee that the oom victim won't interfere with the freezer because it might be anywhere on the way to do_exit while the freezer thinks the task has already terminated. It might trigger IO or touch devices which are frozen already. In order to close this race, make the oom_reaper thread freezable. This will work because a) already running oom_reaper will block freezer to enter the quiescent state b) wake_oom_reaper will not wake up the reaper after it has been frozen c) the only way to call exit_oom_victim after try_to_freeze_tasks is from the oom victim's context when we know the further interference shouldn't be possible Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/e26796066fdf929cbba22dabb801808f986acdb9;" This will work because a) already running oom_reaper will block freezer to enter the quiescent state b) wake_oom_reaper will not wake up the reaper after it has been frozen c) the only way to call exit_oom_victim after try_to_freeze_tasks is from the oom victim's context when we know the further interference shouldn't be possible";yes;yes;yes 146;MDY6Q29tbWl0MjMyNTI5ODoyOWM2OTZlMWM2ZWNlYjVkYjZiMjFmMGM4OTQ5NWZjZmNkNDBjMGVi;Vladimir Davydov;Linus Torvalds;"oom: make oom_reaper_list single linked Entries are only added/removed from oom_reaper_list at head so we can use a single linked list and hence save a word in task_struct. Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/29c696e1c6eceb5db6b21f0c89495fcfcd40c0eb;oom: make oom_reaper_list single linked;yes;no;no 146;MDY6Q29tbWl0MjMyNTI5ODoyOWM2OTZlMWM2ZWNlYjVkYjZiMjFmMGM4OTQ5NWZjZmNkNDBjMGVi;Vladimir Davydov;Linus Torvalds;"oom: make oom_reaper_list single linked Entries are only added/removed from oom_reaper_list at head so we can use a single linked list and hence save a word in task_struct. Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/29c696e1c6eceb5db6b21f0c89495fcfcd40c0eb;"Entries are only added/removed from oom_reaper_list at head so we can use a single linked list and hence save a word in task_struct.";yes;yes;yes 147;MDY6Q29tbWl0MjMyNTI5ODo4NTViMDE4MzI1NzM3Zjc2OTFmOWI3ZDg2MzM5ZGY0MGFhNGU0N2Mz;Michal Hocko;Linus Torvalds;"oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task Tetsuo has reported that oom_kill_allocating_task=1 will cause oom_reaper_list corruption because oom_kill_process doesn't follow standard OOM exclusion (aka ignores TIF_MEMDIE) and allows to enqueue the same task multiple times - e.g. by sacrificing the same child multiple times. This patch fixes the issue by introducing a new MMF_OOM_KILLED mm flag which is set in oom_kill_process atomically and oom reaper is disabled if the flag was already set. Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/855b018325737f7691f9b7d86339df40aa4e47c3;oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task;yes;no;no 147;MDY6Q29tbWl0MjMyNTI5ODo4NTViMDE4MzI1NzM3Zjc2OTFmOWI3ZDg2MzM5ZGY0MGFhNGU0N2Mz;Michal Hocko;Linus Torvalds;"oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task Tetsuo has reported that oom_kill_allocating_task=1 will cause oom_reaper_list corruption because oom_kill_process doesn't follow standard OOM exclusion (aka ignores TIF_MEMDIE) and allows to enqueue the same task multiple times - e.g. by sacrificing the same child multiple times. This patch fixes the issue by introducing a new MMF_OOM_KILLED mm flag which is set in oom_kill_process atomically and oom reaper is disabled if the flag was already set. Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/855b018325737f7691f9b7d86339df40aa4e47c3;"Tetsuo has reported that oom_kill_allocating_task=1 will cause oom_reaper_list corruption because oom_kill_process doesn't follow standard OOM exclusion (aka ignores TIF_MEMDIE) and allows to enqueue the same task multiple times - e.g";no;yes;yes 147;MDY6Q29tbWl0MjMyNTI5ODo4NTViMDE4MzI1NzM3Zjc2OTFmOWI3ZDg2MzM5ZGY0MGFhNGU0N2Mz;Michal Hocko;Linus Torvalds;"oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task Tetsuo has reported that oom_kill_allocating_task=1 will cause oom_reaper_list corruption because oom_kill_process doesn't follow standard OOM exclusion (aka ignores TIF_MEMDIE) and allows to enqueue the same task multiple times - e.g. by sacrificing the same child multiple times. This patch fixes the issue by introducing a new MMF_OOM_KILLED mm flag which is set in oom_kill_process atomically and oom reaper is disabled if the flag was already set. Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/855b018325737f7691f9b7d86339df40aa4e47c3;" by sacrificing the same child multiple times";no;yes;yes 147;MDY6Q29tbWl0MjMyNTI5ODo4NTViMDE4MzI1NzM3Zjc2OTFmOWI3ZDg2MzM5ZGY0MGFhNGU0N2Mz;Michal Hocko;Linus Torvalds;"oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task Tetsuo has reported that oom_kill_allocating_task=1 will cause oom_reaper_list corruption because oom_kill_process doesn't follow standard OOM exclusion (aka ignores TIF_MEMDIE) and allows to enqueue the same task multiple times - e.g. by sacrificing the same child multiple times. This patch fixes the issue by introducing a new MMF_OOM_KILLED mm flag which is set in oom_kill_process atomically and oom reaper is disabled if the flag was already set. Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/855b018325737f7691f9b7d86339df40aa4e47c3;"This patch fixes the issue by introducing a new MMF_OOM_KILLED mm flag which is set in oom_kill_process atomically and oom reaper is disabled if the flag was already set.";yes;yes;no 148;MDY6Q29tbWl0MjMyNTI5ODowMzA0OTI2OWRlNDMzY2I1ZmUyODU5YmU5YWU0NDY5Y2ViMTE2M2Vk;Michal Hocko;Linus Torvalds;"mm, oom_reaper: implement OOM victims queuing wake_oom_reaper has allowed only 1 oom victim to be queued. The main reason for that was the simplicity as other solutions would require some way of queuing. The current approach is racy and that was deemed sufficient as the oom_reaper is considered a best effort approach to help with oom handling when the OOM victim cannot terminate in a reasonable time. The race could lead to missing an oom victim which can get stuck out_of_memory wake_oom_reaper cmpxchg // OK oom_reaper oom_reap_task __oom_reap_task oom_victim terminates atomic_inc_not_zero // fail out_of_memory wake_oom_reaper cmpxchg // fails task_to_reap = NULL This race requires 2 OOM invocations in a short time period which is not very likely but certainly not impossible. E.g. the original victim might have not released a lot of memory for some reason. The situation would improve considerably if wake_oom_reaper used a more robust queuing. This is what this patch implements. This means adding oom_reaper_list list_head into task_struct (eat a hole before embeded thread_struct for that purpose) and a oom_reaper_lock spinlock for queuing synchronization. wake_oom_reaper will then add the task on the queue and oom_reaper will dequeue it. Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Andrea Argangeli <andrea@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/03049269de433cb5fe2859be9ae4469ceb1163ed;mm, oom_reaper: implement OOM victims queuing;yes;no;no 148;MDY6Q29tbWl0MjMyNTI5ODowMzA0OTI2OWRlNDMzY2I1ZmUyODU5YmU5YWU0NDY5Y2ViMTE2M2Vk;Michal Hocko;Linus Torvalds;"mm, oom_reaper: implement OOM victims queuing wake_oom_reaper has allowed only 1 oom victim to be queued. The main reason for that was the simplicity as other solutions would require some way of queuing. The current approach is racy and that was deemed sufficient as the oom_reaper is considered a best effort approach to help with oom handling when the OOM victim cannot terminate in a reasonable time. The race could lead to missing an oom victim which can get stuck out_of_memory wake_oom_reaper cmpxchg // OK oom_reaper oom_reap_task __oom_reap_task oom_victim terminates atomic_inc_not_zero // fail out_of_memory wake_oom_reaper cmpxchg // fails task_to_reap = NULL This race requires 2 OOM invocations in a short time period which is not very likely but certainly not impossible. E.g. the original victim might have not released a lot of memory for some reason. The situation would improve considerably if wake_oom_reaper used a more robust queuing. This is what this patch implements. This means adding oom_reaper_list list_head into task_struct (eat a hole before embeded thread_struct for that purpose) and a oom_reaper_lock spinlock for queuing synchronization. wake_oom_reaper will then add the task on the queue and oom_reaper will dequeue it. Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Andrea Argangeli <andrea@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/03049269de433cb5fe2859be9ae4469ceb1163ed;wake_oom_reaper has allowed only 1 oom victim to be queued;no;no;yes 148;MDY6Q29tbWl0MjMyNTI5ODowMzA0OTI2OWRlNDMzY2I1ZmUyODU5YmU5YWU0NDY5Y2ViMTE2M2Vk;Michal Hocko;Linus Torvalds;"mm, oom_reaper: implement OOM victims queuing wake_oom_reaper has allowed only 1 oom victim to be queued. The main reason for that was the simplicity as other solutions would require some way of queuing. The current approach is racy and that was deemed sufficient as the oom_reaper is considered a best effort approach to help with oom handling when the OOM victim cannot terminate in a reasonable time. The race could lead to missing an oom victim which can get stuck out_of_memory wake_oom_reaper cmpxchg // OK oom_reaper oom_reap_task __oom_reap_task oom_victim terminates atomic_inc_not_zero // fail out_of_memory wake_oom_reaper cmpxchg // fails task_to_reap = NULL This race requires 2 OOM invocations in a short time period which is not very likely but certainly not impossible. E.g. the original victim might have not released a lot of memory for some reason. The situation would improve considerably if wake_oom_reaper used a more robust queuing. This is what this patch implements. This means adding oom_reaper_list list_head into task_struct (eat a hole before embeded thread_struct for that purpose) and a oom_reaper_lock spinlock for queuing synchronization. wake_oom_reaper will then add the task on the queue and oom_reaper will dequeue it. Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Andrea Argangeli <andrea@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/03049269de433cb5fe2859be9ae4469ceb1163ed;" The main reason for that was the simplicity as other solutions would require some way of queuing";no;no;yes 148;MDY6Q29tbWl0MjMyNTI5ODowMzA0OTI2OWRlNDMzY2I1ZmUyODU5YmU5YWU0NDY5Y2ViMTE2M2Vk;Michal Hocko;Linus Torvalds;"mm, oom_reaper: implement OOM victims queuing wake_oom_reaper has allowed only 1 oom victim to be queued. The main reason for that was the simplicity as other solutions would require some way of queuing. The current approach is racy and that was deemed sufficient as the oom_reaper is considered a best effort approach to help with oom handling when the OOM victim cannot terminate in a reasonable time. The race could lead to missing an oom victim which can get stuck out_of_memory wake_oom_reaper cmpxchg // OK oom_reaper oom_reap_task __oom_reap_task oom_victim terminates atomic_inc_not_zero // fail out_of_memory wake_oom_reaper cmpxchg // fails task_to_reap = NULL This race requires 2 OOM invocations in a short time period which is not very likely but certainly not impossible. E.g. the original victim might have not released a lot of memory for some reason. The situation would improve considerably if wake_oom_reaper used a more robust queuing. This is what this patch implements. This means adding oom_reaper_list list_head into task_struct (eat a hole before embeded thread_struct for that purpose) and a oom_reaper_lock spinlock for queuing synchronization. wake_oom_reaper will then add the task on the queue and oom_reaper will dequeue it. Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Andrea Argangeli <andrea@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/03049269de433cb5fe2859be9ae4469ceb1163ed;" The current approach is racy and that was deemed sufficient as the oom_reaper is considered a best effort approach to help with oom handling when the OOM victim cannot terminate in a reasonable time";no;yes;yes 148;MDY6Q29tbWl0MjMyNTI5ODowMzA0OTI2OWRlNDMzY2I1ZmUyODU5YmU5YWU0NDY5Y2ViMTE2M2Vk;Michal Hocko;Linus Torvalds;"mm, oom_reaper: implement OOM victims queuing wake_oom_reaper has allowed only 1 oom victim to be queued. The main reason for that was the simplicity as other solutions would require some way of queuing. The current approach is racy and that was deemed sufficient as the oom_reaper is considered a best effort approach to help with oom handling when the OOM victim cannot terminate in a reasonable time. The race could lead to missing an oom victim which can get stuck out_of_memory wake_oom_reaper cmpxchg // OK oom_reaper oom_reap_task __oom_reap_task oom_victim terminates atomic_inc_not_zero // fail out_of_memory wake_oom_reaper cmpxchg // fails task_to_reap = NULL This race requires 2 OOM invocations in a short time period which is not very likely but certainly not impossible. E.g. the original victim might have not released a lot of memory for some reason. The situation would improve considerably if wake_oom_reaper used a more robust queuing. This is what this patch implements. This means adding oom_reaper_list list_head into task_struct (eat a hole before embeded thread_struct for that purpose) and a oom_reaper_lock spinlock for queuing synchronization. wake_oom_reaper will then add the task on the queue and oom_reaper will dequeue it. Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Andrea Argangeli <andrea@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/03049269de433cb5fe2859be9ae4469ceb1163ed;" The race could lead to missing an oom victim which can get stuck out_of_memory wake_oom_reaper cmpxchg // OK oom_reaper oom_reap_task __oom_reap_task oom_victim terminates atomic_inc_not_zero // fail out_of_memory wake_oom_reaper cmpxchg // fails task_to_reap = NULL This race requires 2 OOM invocations in a short time period which is not very likely but certainly not impossible";no;yes;yes 148;MDY6Q29tbWl0MjMyNTI5ODowMzA0OTI2OWRlNDMzY2I1ZmUyODU5YmU5YWU0NDY5Y2ViMTE2M2Vk;Michal Hocko;Linus Torvalds;"mm, oom_reaper: implement OOM victims queuing wake_oom_reaper has allowed only 1 oom victim to be queued. The main reason for that was the simplicity as other solutions would require some way of queuing. The current approach is racy and that was deemed sufficient as the oom_reaper is considered a best effort approach to help with oom handling when the OOM victim cannot terminate in a reasonable time. The race could lead to missing an oom victim which can get stuck out_of_memory wake_oom_reaper cmpxchg // OK oom_reaper oom_reap_task __oom_reap_task oom_victim terminates atomic_inc_not_zero // fail out_of_memory wake_oom_reaper cmpxchg // fails task_to_reap = NULL This race requires 2 OOM invocations in a short time period which is not very likely but certainly not impossible. E.g. the original victim might have not released a lot of memory for some reason. The situation would improve considerably if wake_oom_reaper used a more robust queuing. This is what this patch implements. This means adding oom_reaper_list list_head into task_struct (eat a hole before embeded thread_struct for that purpose) and a oom_reaper_lock spinlock for queuing synchronization. wake_oom_reaper will then add the task on the queue and oom_reaper will dequeue it. Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Andrea Argangeli <andrea@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/03049269de433cb5fe2859be9ae4469ceb1163ed; E.g;no;no;yes 148;MDY6Q29tbWl0MjMyNTI5ODowMzA0OTI2OWRlNDMzY2I1ZmUyODU5YmU5YWU0NDY5Y2ViMTE2M2Vk;Michal Hocko;Linus Torvalds;"mm, oom_reaper: implement OOM victims queuing wake_oom_reaper has allowed only 1 oom victim to be queued. The main reason for that was the simplicity as other solutions would require some way of queuing. The current approach is racy and that was deemed sufficient as the oom_reaper is considered a best effort approach to help with oom handling when the OOM victim cannot terminate in a reasonable time. The race could lead to missing an oom victim which can get stuck out_of_memory wake_oom_reaper cmpxchg // OK oom_reaper oom_reap_task __oom_reap_task oom_victim terminates atomic_inc_not_zero // fail out_of_memory wake_oom_reaper cmpxchg // fails task_to_reap = NULL This race requires 2 OOM invocations in a short time period which is not very likely but certainly not impossible. E.g. the original victim might have not released a lot of memory for some reason. The situation would improve considerably if wake_oom_reaper used a more robust queuing. This is what this patch implements. This means adding oom_reaper_list list_head into task_struct (eat a hole before embeded thread_struct for that purpose) and a oom_reaper_lock spinlock for queuing synchronization. wake_oom_reaper will then add the task on the queue and oom_reaper will dequeue it. Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Andrea Argangeli <andrea@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/03049269de433cb5fe2859be9ae4469ceb1163ed;" the original victim might have not released a lot of memory for some reason";no;yes;yes 148;MDY6Q29tbWl0MjMyNTI5ODowMzA0OTI2OWRlNDMzY2I1ZmUyODU5YmU5YWU0NDY5Y2ViMTE2M2Vk;Michal Hocko;Linus Torvalds;"mm, oom_reaper: implement OOM victims queuing wake_oom_reaper has allowed only 1 oom victim to be queued. The main reason for that was the simplicity as other solutions would require some way of queuing. The current approach is racy and that was deemed sufficient as the oom_reaper is considered a best effort approach to help with oom handling when the OOM victim cannot terminate in a reasonable time. The race could lead to missing an oom victim which can get stuck out_of_memory wake_oom_reaper cmpxchg // OK oom_reaper oom_reap_task __oom_reap_task oom_victim terminates atomic_inc_not_zero // fail out_of_memory wake_oom_reaper cmpxchg // fails task_to_reap = NULL This race requires 2 OOM invocations in a short time period which is not very likely but certainly not impossible. E.g. the original victim might have not released a lot of memory for some reason. The situation would improve considerably if wake_oom_reaper used a more robust queuing. This is what this patch implements. This means adding oom_reaper_list list_head into task_struct (eat a hole before embeded thread_struct for that purpose) and a oom_reaper_lock spinlock for queuing synchronization. wake_oom_reaper will then add the task on the queue and oom_reaper will dequeue it. Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Andrea Argangeli <andrea@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/03049269de433cb5fe2859be9ae4469ceb1163ed;"The situation would improve considerably if wake_oom_reaper used a more robust queuing";yes;yes;yes 148;MDY6Q29tbWl0MjMyNTI5ODowMzA0OTI2OWRlNDMzY2I1ZmUyODU5YmU5YWU0NDY5Y2ViMTE2M2Vk;Michal Hocko;Linus Torvalds;"mm, oom_reaper: implement OOM victims queuing wake_oom_reaper has allowed only 1 oom victim to be queued. The main reason for that was the simplicity as other solutions would require some way of queuing. The current approach is racy and that was deemed sufficient as the oom_reaper is considered a best effort approach to help with oom handling when the OOM victim cannot terminate in a reasonable time. The race could lead to missing an oom victim which can get stuck out_of_memory wake_oom_reaper cmpxchg // OK oom_reaper oom_reap_task __oom_reap_task oom_victim terminates atomic_inc_not_zero // fail out_of_memory wake_oom_reaper cmpxchg // fails task_to_reap = NULL This race requires 2 OOM invocations in a short time period which is not very likely but certainly not impossible. E.g. the original victim might have not released a lot of memory for some reason. The situation would improve considerably if wake_oom_reaper used a more robust queuing. This is what this patch implements. This means adding oom_reaper_list list_head into task_struct (eat a hole before embeded thread_struct for that purpose) and a oom_reaper_lock spinlock for queuing synchronization. wake_oom_reaper will then add the task on the queue and oom_reaper will dequeue it. Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Andrea Argangeli <andrea@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/03049269de433cb5fe2859be9ae4469ceb1163ed; This is what this patch implements;yes;yes;no 148;MDY6Q29tbWl0MjMyNTI5ODowMzA0OTI2OWRlNDMzY2I1ZmUyODU5YmU5YWU0NDY5Y2ViMTE2M2Vk;Michal Hocko;Linus Torvalds;"mm, oom_reaper: implement OOM victims queuing wake_oom_reaper has allowed only 1 oom victim to be queued. The main reason for that was the simplicity as other solutions would require some way of queuing. The current approach is racy and that was deemed sufficient as the oom_reaper is considered a best effort approach to help with oom handling when the OOM victim cannot terminate in a reasonable time. The race could lead to missing an oom victim which can get stuck out_of_memory wake_oom_reaper cmpxchg // OK oom_reaper oom_reap_task __oom_reap_task oom_victim terminates atomic_inc_not_zero // fail out_of_memory wake_oom_reaper cmpxchg // fails task_to_reap = NULL This race requires 2 OOM invocations in a short time period which is not very likely but certainly not impossible. E.g. the original victim might have not released a lot of memory for some reason. The situation would improve considerably if wake_oom_reaper used a more robust queuing. This is what this patch implements. This means adding oom_reaper_list list_head into task_struct (eat a hole before embeded thread_struct for that purpose) and a oom_reaper_lock spinlock for queuing synchronization. wake_oom_reaper will then add the task on the queue and oom_reaper will dequeue it. Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Andrea Argangeli <andrea@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/03049269de433cb5fe2859be9ae4469ceb1163ed;" This means adding oom_reaper_list list_head into task_struct (eat a hole before embeded thread_struct for that purpose) and a oom_reaper_lock spinlock for queuing synchronization";yes;no;no 148;MDY6Q29tbWl0MjMyNTI5ODowMzA0OTI2OWRlNDMzY2I1ZmUyODU5YmU5YWU0NDY5Y2ViMTE2M2Vk;Michal Hocko;Linus Torvalds;"mm, oom_reaper: implement OOM victims queuing wake_oom_reaper has allowed only 1 oom victim to be queued. The main reason for that was the simplicity as other solutions would require some way of queuing. The current approach is racy and that was deemed sufficient as the oom_reaper is considered a best effort approach to help with oom handling when the OOM victim cannot terminate in a reasonable time. The race could lead to missing an oom victim which can get stuck out_of_memory wake_oom_reaper cmpxchg // OK oom_reaper oom_reap_task __oom_reap_task oom_victim terminates atomic_inc_not_zero // fail out_of_memory wake_oom_reaper cmpxchg // fails task_to_reap = NULL This race requires 2 OOM invocations in a short time period which is not very likely but certainly not impossible. E.g. the original victim might have not released a lot of memory for some reason. The situation would improve considerably if wake_oom_reaper used a more robust queuing. This is what this patch implements. This means adding oom_reaper_list list_head into task_struct (eat a hole before embeded thread_struct for that purpose) and a oom_reaper_lock spinlock for queuing synchronization. wake_oom_reaper will then add the task on the queue and oom_reaper will dequeue it. Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Andrea Argangeli <andrea@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/03049269de433cb5fe2859be9ae4469ceb1163ed;" wake_oom_reaper will then add the task on the queue and oom_reaper will dequeue it.";yes;no;no 149;MDY6Q29tbWl0MjMyNTI5ODpiYzQ0OGU4OTdiNmQyNGFhZTMyNzAxNzYzYjhhMWZlMTVkMjlmYTI2;Michal Hocko;Linus Torvalds;"mm, oom_reaper: report success/failure Inform about the successful/failed oom_reaper attempts and dump all the held locks to tell us more who is blocking the progress. [akpm@linux-foundation.org: fix CONFIG_MMU=n build] Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Andrea Argangeli <andrea@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/bc448e897b6d24aae32701763b8a1fe15d29fa26;mm, oom_reaper: report success/failure;yes;yes;no 149;MDY6Q29tbWl0MjMyNTI5ODpiYzQ0OGU4OTdiNmQyNGFhZTMyNzAxNzYzYjhhMWZlMTVkMjlmYTI2;Michal Hocko;Linus Torvalds;"mm, oom_reaper: report success/failure Inform about the successful/failed oom_reaper attempts and dump all the held locks to tell us more who is blocking the progress. [akpm@linux-foundation.org: fix CONFIG_MMU=n build] Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Andrea Argangeli <andrea@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/bc448e897b6d24aae32701763b8a1fe15d29fa26;"Inform about the successful/failed oom_reaper attempts and dump all the held locks to tell us more who is blocking the progress.";yes;yes;no 151;MDY6Q29tbWl0MjMyNTI5ODphYWM0NTM2MzU1NDk2OTljMTNhODRlYTE0NTZkNWIwZTU3NGVmODU1;Michal Hocko;Linus Torvalds;"mm, oom: introduce oom reaper This patch (of 5): This is based on the idea from Mel Gorman discussed during LSFMM 2015 and independently brought up by Oleg Nesterov. The OOM killer currently allows to kill only a single task in a good hope that the task will terminate in a reasonable time and frees up its memory. Such a task (oom victim) will get an access to memory reserves via mark_oom_victim to allow a forward progress should there be a need for additional memory during exit path. It has been shown (e.g. by Tetsuo Handa) that it is not that hard to construct workloads which break the core assumption mentioned above and the OOM victim might take unbounded amount of time to exit because it might be blocked in the uninterruptible state waiting for an event (e.g. lock) which is blocked by another task looping in the page allocator. This patch reduces the probability of such a lockup by introducing a specialized kernel thread (oom_reaper) which tries to reclaim additional memory by preemptively reaping the anonymous or swapped out memory owned by the oom victim under an assumption that such a memory won't be needed when its owner is killed and kicked from the userspace anyway. There is one notable exception to this, though, if the OOM victim was in the process of coredumping the result would be incomplete. This is considered a reasonable constrain because the overall system health is more important than debugability of a particular application. A kernel thread has been chosen because we need a reliable way of invocation so workqueue context is not appropriate because all the workers might be busy (e.g. allocating memory). Kswapd which sounds like another good fit is not appropriate as well because it might get blocked on locks during reclaim as well. oom_reaper has to take mmap_sem on the target task for reading so the solution is not 100% because the semaphore might be held or blocked for write but the probability is reduced considerably wrt. basically any lock blocking forward progress as described above. In order to prevent from blocking on the lock without any forward progress we are using only a trylock and retry 10 times with a short sleep in between. Users of mmap_sem which need it for write should be carefully reviewed to use _killable waiting as much as possible and reduce allocations requests done with the lock held to absolute minimum to reduce the risk even further. The API between oom killer and oom reaper is quite trivial. wake_oom_reaper updates mm_to_reap with cmpxchg to guarantee only NULL->mm transition and oom_reaper clear this atomically once it is done with the work. This means that only a single mm_struct can be reaped at the time. As the operation is potentially disruptive we are trying to limit it to the ncessary minimum and the reaper blocks any updates while it operates on an mm. mm_struct is pinned by mm_count to allow parallel exit_mmap and a race is detected by atomic_inc_not_zero(mm_users). Signed-off-by: Michal Hocko <mhocko@suse.com> Suggested-by: Oleg Nesterov <oleg@redhat.com> Suggested-by: Mel Gorman <mgorman@suse.de> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Argangeli <andrea@kernel.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/aac453635549699c13a84ea1456d5b0e574ef855;mm, oom: introduce oom reaper;yes;no;no 151;MDY6Q29tbWl0MjMyNTI5ODphYWM0NTM2MzU1NDk2OTljMTNhODRlYTE0NTZkNWIwZTU3NGVmODU1;Michal Hocko;Linus Torvalds;"mm, oom: introduce oom reaper This patch (of 5): This is based on the idea from Mel Gorman discussed during LSFMM 2015 and independently brought up by Oleg Nesterov. The OOM killer currently allows to kill only a single task in a good hope that the task will terminate in a reasonable time and frees up its memory. Such a task (oom victim) will get an access to memory reserves via mark_oom_victim to allow a forward progress should there be a need for additional memory during exit path. It has been shown (e.g. by Tetsuo Handa) that it is not that hard to construct workloads which break the core assumption mentioned above and the OOM victim might take unbounded amount of time to exit because it might be blocked in the uninterruptible state waiting for an event (e.g. lock) which is blocked by another task looping in the page allocator. This patch reduces the probability of such a lockup by introducing a specialized kernel thread (oom_reaper) which tries to reclaim additional memory by preemptively reaping the anonymous or swapped out memory owned by the oom victim under an assumption that such a memory won't be needed when its owner is killed and kicked from the userspace anyway. There is one notable exception to this, though, if the OOM victim was in the process of coredumping the result would be incomplete. This is considered a reasonable constrain because the overall system health is more important than debugability of a particular application. A kernel thread has been chosen because we need a reliable way of invocation so workqueue context is not appropriate because all the workers might be busy (e.g. allocating memory). Kswapd which sounds like another good fit is not appropriate as well because it might get blocked on locks during reclaim as well. oom_reaper has to take mmap_sem on the target task for reading so the solution is not 100% because the semaphore might be held or blocked for write but the probability is reduced considerably wrt. basically any lock blocking forward progress as described above. In order to prevent from blocking on the lock without any forward progress we are using only a trylock and retry 10 times with a short sleep in between. Users of mmap_sem which need it for write should be carefully reviewed to use _killable waiting as much as possible and reduce allocations requests done with the lock held to absolute minimum to reduce the risk even further. The API between oom killer and oom reaper is quite trivial. wake_oom_reaper updates mm_to_reap with cmpxchg to guarantee only NULL->mm transition and oom_reaper clear this atomically once it is done with the work. This means that only a single mm_struct can be reaped at the time. As the operation is potentially disruptive we are trying to limit it to the ncessary minimum and the reaper blocks any updates while it operates on an mm. mm_struct is pinned by mm_count to allow parallel exit_mmap and a race is detected by atomic_inc_not_zero(mm_users). Signed-off-by: Michal Hocko <mhocko@suse.com> Suggested-by: Oleg Nesterov <oleg@redhat.com> Suggested-by: Mel Gorman <mgorman@suse.de> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Argangeli <andrea@kernel.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/aac453635549699c13a84ea1456d5b0e574ef855;This patch (of 5);yes;no;no 151;MDY6Q29tbWl0MjMyNTI5ODphYWM0NTM2MzU1NDk2OTljMTNhODRlYTE0NTZkNWIwZTU3NGVmODU1;Michal Hocko;Linus Torvalds;"mm, oom: introduce oom reaper This patch (of 5): This is based on the idea from Mel Gorman discussed during LSFMM 2015 and independently brought up by Oleg Nesterov. The OOM killer currently allows to kill only a single task in a good hope that the task will terminate in a reasonable time and frees up its memory. Such a task (oom victim) will get an access to memory reserves via mark_oom_victim to allow a forward progress should there be a need for additional memory during exit path. It has been shown (e.g. by Tetsuo Handa) that it is not that hard to construct workloads which break the core assumption mentioned above and the OOM victim might take unbounded amount of time to exit because it might be blocked in the uninterruptible state waiting for an event (e.g. lock) which is blocked by another task looping in the page allocator. This patch reduces the probability of such a lockup by introducing a specialized kernel thread (oom_reaper) which tries to reclaim additional memory by preemptively reaping the anonymous or swapped out memory owned by the oom victim under an assumption that such a memory won't be needed when its owner is killed and kicked from the userspace anyway. There is one notable exception to this, though, if the OOM victim was in the process of coredumping the result would be incomplete. This is considered a reasonable constrain because the overall system health is more important than debugability of a particular application. A kernel thread has been chosen because we need a reliable way of invocation so workqueue context is not appropriate because all the workers might be busy (e.g. allocating memory). Kswapd which sounds like another good fit is not appropriate as well because it might get blocked on locks during reclaim as well. oom_reaper has to take mmap_sem on the target task for reading so the solution is not 100% because the semaphore might be held or blocked for write but the probability is reduced considerably wrt. basically any lock blocking forward progress as described above. In order to prevent from blocking on the lock without any forward progress we are using only a trylock and retry 10 times with a short sleep in between. Users of mmap_sem which need it for write should be carefully reviewed to use _killable waiting as much as possible and reduce allocations requests done with the lock held to absolute minimum to reduce the risk even further. The API between oom killer and oom reaper is quite trivial. wake_oom_reaper updates mm_to_reap with cmpxchg to guarantee only NULL->mm transition and oom_reaper clear this atomically once it is done with the work. This means that only a single mm_struct can be reaped at the time. As the operation is potentially disruptive we are trying to limit it to the ncessary minimum and the reaper blocks any updates while it operates on an mm. mm_struct is pinned by mm_count to allow parallel exit_mmap and a race is detected by atomic_inc_not_zero(mm_users). Signed-off-by: Michal Hocko <mhocko@suse.com> Suggested-by: Oleg Nesterov <oleg@redhat.com> Suggested-by: Mel Gorman <mgorman@suse.de> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Argangeli <andrea@kernel.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/aac453635549699c13a84ea1456d5b0e574ef855;"This is based on the idea from Mel Gorman discussed during LSFMM 2015 and independently brought up by Oleg Nesterov";yes;no;yes 151;MDY6Q29tbWl0MjMyNTI5ODphYWM0NTM2MzU1NDk2OTljMTNhODRlYTE0NTZkNWIwZTU3NGVmODU1;Michal Hocko;Linus Torvalds;"mm, oom: introduce oom reaper This patch (of 5): This is based on the idea from Mel Gorman discussed during LSFMM 2015 and independently brought up by Oleg Nesterov. The OOM killer currently allows to kill only a single task in a good hope that the task will terminate in a reasonable time and frees up its memory. Such a task (oom victim) will get an access to memory reserves via mark_oom_victim to allow a forward progress should there be a need for additional memory during exit path. It has been shown (e.g. by Tetsuo Handa) that it is not that hard to construct workloads which break the core assumption mentioned above and the OOM victim might take unbounded amount of time to exit because it might be blocked in the uninterruptible state waiting for an event (e.g. lock) which is blocked by another task looping in the page allocator. This patch reduces the probability of such a lockup by introducing a specialized kernel thread (oom_reaper) which tries to reclaim additional memory by preemptively reaping the anonymous or swapped out memory owned by the oom victim under an assumption that such a memory won't be needed when its owner is killed and kicked from the userspace anyway. There is one notable exception to this, though, if the OOM victim was in the process of coredumping the result would be incomplete. This is considered a reasonable constrain because the overall system health is more important than debugability of a particular application. A kernel thread has been chosen because we need a reliable way of invocation so workqueue context is not appropriate because all the workers might be busy (e.g. allocating memory). Kswapd which sounds like another good fit is not appropriate as well because it might get blocked on locks during reclaim as well. oom_reaper has to take mmap_sem on the target task for reading so the solution is not 100% because the semaphore might be held or blocked for write but the probability is reduced considerably wrt. basically any lock blocking forward progress as described above. In order to prevent from blocking on the lock without any forward progress we are using only a trylock and retry 10 times with a short sleep in between. Users of mmap_sem which need it for write should be carefully reviewed to use _killable waiting as much as possible and reduce allocations requests done with the lock held to absolute minimum to reduce the risk even further. The API between oom killer and oom reaper is quite trivial. wake_oom_reaper updates mm_to_reap with cmpxchg to guarantee only NULL->mm transition and oom_reaper clear this atomically once it is done with the work. This means that only a single mm_struct can be reaped at the time. As the operation is potentially disruptive we are trying to limit it to the ncessary minimum and the reaper blocks any updates while it operates on an mm. mm_struct is pinned by mm_count to allow parallel exit_mmap and a race is detected by atomic_inc_not_zero(mm_users). Signed-off-by: Michal Hocko <mhocko@suse.com> Suggested-by: Oleg Nesterov <oleg@redhat.com> Suggested-by: Mel Gorman <mgorman@suse.de> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Argangeli <andrea@kernel.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/aac453635549699c13a84ea1456d5b0e574ef855;"The OOM killer currently allows to kill only a single task in a good hope that the task will terminate in a reasonable time and frees up its memory";no;no;yes 151;MDY6Q29tbWl0MjMyNTI5ODphYWM0NTM2MzU1NDk2OTljMTNhODRlYTE0NTZkNWIwZTU3NGVmODU1;Michal Hocko;Linus Torvalds;"mm, oom: introduce oom reaper This patch (of 5): This is based on the idea from Mel Gorman discussed during LSFMM 2015 and independently brought up by Oleg Nesterov. The OOM killer currently allows to kill only a single task in a good hope that the task will terminate in a reasonable time and frees up its memory. Such a task (oom victim) will get an access to memory reserves via mark_oom_victim to allow a forward progress should there be a need for additional memory during exit path. It has been shown (e.g. by Tetsuo Handa) that it is not that hard to construct workloads which break the core assumption mentioned above and the OOM victim might take unbounded amount of time to exit because it might be blocked in the uninterruptible state waiting for an event (e.g. lock) which is blocked by another task looping in the page allocator. This patch reduces the probability of such a lockup by introducing a specialized kernel thread (oom_reaper) which tries to reclaim additional memory by preemptively reaping the anonymous or swapped out memory owned by the oom victim under an assumption that such a memory won't be needed when its owner is killed and kicked from the userspace anyway. There is one notable exception to this, though, if the OOM victim was in the process of coredumping the result would be incomplete. This is considered a reasonable constrain because the overall system health is more important than debugability of a particular application. A kernel thread has been chosen because we need a reliable way of invocation so workqueue context is not appropriate because all the workers might be busy (e.g. allocating memory). Kswapd which sounds like another good fit is not appropriate as well because it might get blocked on locks during reclaim as well. oom_reaper has to take mmap_sem on the target task for reading so the solution is not 100% because the semaphore might be held or blocked for write but the probability is reduced considerably wrt. basically any lock blocking forward progress as described above. In order to prevent from blocking on the lock without any forward progress we are using only a trylock and retry 10 times with a short sleep in between. Users of mmap_sem which need it for write should be carefully reviewed to use _killable waiting as much as possible and reduce allocations requests done with the lock held to absolute minimum to reduce the risk even further. The API between oom killer and oom reaper is quite trivial. wake_oom_reaper updates mm_to_reap with cmpxchg to guarantee only NULL->mm transition and oom_reaper clear this atomically once it is done with the work. This means that only a single mm_struct can be reaped at the time. As the operation is potentially disruptive we are trying to limit it to the ncessary minimum and the reaper blocks any updates while it operates on an mm. mm_struct is pinned by mm_count to allow parallel exit_mmap and a race is detected by atomic_inc_not_zero(mm_users). Signed-off-by: Michal Hocko <mhocko@suse.com> Suggested-by: Oleg Nesterov <oleg@redhat.com> Suggested-by: Mel Gorman <mgorman@suse.de> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Argangeli <andrea@kernel.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/aac453635549699c13a84ea1456d5b0e574ef855;" Such a task (oom victim) will get an access to memory reserves via mark_oom_victim to allow a forward progress should there be a need for additional memory during exit path";no;no;yes 151;MDY6Q29tbWl0MjMyNTI5ODphYWM0NTM2MzU1NDk2OTljMTNhODRlYTE0NTZkNWIwZTU3NGVmODU1;Michal Hocko;Linus Torvalds;"mm, oom: introduce oom reaper This patch (of 5): This is based on the idea from Mel Gorman discussed during LSFMM 2015 and independently brought up by Oleg Nesterov. The OOM killer currently allows to kill only a single task in a good hope that the task will terminate in a reasonable time and frees up its memory. Such a task (oom victim) will get an access to memory reserves via mark_oom_victim to allow a forward progress should there be a need for additional memory during exit path. It has been shown (e.g. by Tetsuo Handa) that it is not that hard to construct workloads which break the core assumption mentioned above and the OOM victim might take unbounded amount of time to exit because it might be blocked in the uninterruptible state waiting for an event (e.g. lock) which is blocked by another task looping in the page allocator. This patch reduces the probability of such a lockup by introducing a specialized kernel thread (oom_reaper) which tries to reclaim additional memory by preemptively reaping the anonymous or swapped out memory owned by the oom victim under an assumption that such a memory won't be needed when its owner is killed and kicked from the userspace anyway. There is one notable exception to this, though, if the OOM victim was in the process of coredumping the result would be incomplete. This is considered a reasonable constrain because the overall system health is more important than debugability of a particular application. A kernel thread has been chosen because we need a reliable way of invocation so workqueue context is not appropriate because all the workers might be busy (e.g. allocating memory). Kswapd which sounds like another good fit is not appropriate as well because it might get blocked on locks during reclaim as well. oom_reaper has to take mmap_sem on the target task for reading so the solution is not 100% because the semaphore might be held or blocked for write but the probability is reduced considerably wrt. basically any lock blocking forward progress as described above. In order to prevent from blocking on the lock without any forward progress we are using only a trylock and retry 10 times with a short sleep in between. Users of mmap_sem which need it for write should be carefully reviewed to use _killable waiting as much as possible and reduce allocations requests done with the lock held to absolute minimum to reduce the risk even further. The API between oom killer and oom reaper is quite trivial. wake_oom_reaper updates mm_to_reap with cmpxchg to guarantee only NULL->mm transition and oom_reaper clear this atomically once it is done with the work. This means that only a single mm_struct can be reaped at the time. As the operation is potentially disruptive we are trying to limit it to the ncessary minimum and the reaper blocks any updates while it operates on an mm. mm_struct is pinned by mm_count to allow parallel exit_mmap and a race is detected by atomic_inc_not_zero(mm_users). Signed-off-by: Michal Hocko <mhocko@suse.com> Suggested-by: Oleg Nesterov <oleg@redhat.com> Suggested-by: Mel Gorman <mgorman@suse.de> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Argangeli <andrea@kernel.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/aac453635549699c13a84ea1456d5b0e574ef855;It has been shown (e.g;no;yes;yes 151;MDY6Q29tbWl0MjMyNTI5ODphYWM0NTM2MzU1NDk2OTljMTNhODRlYTE0NTZkNWIwZTU3NGVmODU1;Michal Hocko;Linus Torvalds;"mm, oom: introduce oom reaper This patch (of 5): This is based on the idea from Mel Gorman discussed during LSFMM 2015 and independently brought up by Oleg Nesterov. The OOM killer currently allows to kill only a single task in a good hope that the task will terminate in a reasonable time and frees up its memory. Such a task (oom victim) will get an access to memory reserves via mark_oom_victim to allow a forward progress should there be a need for additional memory during exit path. It has been shown (e.g. by Tetsuo Handa) that it is not that hard to construct workloads which break the core assumption mentioned above and the OOM victim might take unbounded amount of time to exit because it might be blocked in the uninterruptible state waiting for an event (e.g. lock) which is blocked by another task looping in the page allocator. This patch reduces the probability of such a lockup by introducing a specialized kernel thread (oom_reaper) which tries to reclaim additional memory by preemptively reaping the anonymous or swapped out memory owned by the oom victim under an assumption that such a memory won't be needed when its owner is killed and kicked from the userspace anyway. There is one notable exception to this, though, if the OOM victim was in the process of coredumping the result would be incomplete. This is considered a reasonable constrain because the overall system health is more important than debugability of a particular application. A kernel thread has been chosen because we need a reliable way of invocation so workqueue context is not appropriate because all the workers might be busy (e.g. allocating memory). Kswapd which sounds like another good fit is not appropriate as well because it might get blocked on locks during reclaim as well. oom_reaper has to take mmap_sem on the target task for reading so the solution is not 100% because the semaphore might be held or blocked for write but the probability is reduced considerably wrt. basically any lock blocking forward progress as described above. In order to prevent from blocking on the lock without any forward progress we are using only a trylock and retry 10 times with a short sleep in between. Users of mmap_sem which need it for write should be carefully reviewed to use _killable waiting as much as possible and reduce allocations requests done with the lock held to absolute minimum to reduce the risk even further. The API between oom killer and oom reaper is quite trivial. wake_oom_reaper updates mm_to_reap with cmpxchg to guarantee only NULL->mm transition and oom_reaper clear this atomically once it is done with the work. This means that only a single mm_struct can be reaped at the time. As the operation is potentially disruptive we are trying to limit it to the ncessary minimum and the reaper blocks any updates while it operates on an mm. mm_struct is pinned by mm_count to allow parallel exit_mmap and a race is detected by atomic_inc_not_zero(mm_users). Signed-off-by: Michal Hocko <mhocko@suse.com> Suggested-by: Oleg Nesterov <oleg@redhat.com> Suggested-by: Mel Gorman <mgorman@suse.de> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Argangeli <andrea@kernel.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/aac453635549699c13a84ea1456d5b0e574ef855;" by Tetsuo Handa) that it is not that hard to construct workloads which break the core assumption mentioned above and the OOM victim might take unbounded amount of time to exit because it might be blocked in the uninterruptible state waiting for an event (e.g";no;yes;yes 151;MDY6Q29tbWl0MjMyNTI5ODphYWM0NTM2MzU1NDk2OTljMTNhODRlYTE0NTZkNWIwZTU3NGVmODU1;Michal Hocko;Linus Torvalds;"mm, oom: introduce oom reaper This patch (of 5): This is based on the idea from Mel Gorman discussed during LSFMM 2015 and independently brought up by Oleg Nesterov. The OOM killer currently allows to kill only a single task in a good hope that the task will terminate in a reasonable time and frees up its memory. Such a task (oom victim) will get an access to memory reserves via mark_oom_victim to allow a forward progress should there be a need for additional memory during exit path. It has been shown (e.g. by Tetsuo Handa) that it is not that hard to construct workloads which break the core assumption mentioned above and the OOM victim might take unbounded amount of time to exit because it might be blocked in the uninterruptible state waiting for an event (e.g. lock) which is blocked by another task looping in the page allocator. This patch reduces the probability of such a lockup by introducing a specialized kernel thread (oom_reaper) which tries to reclaim additional memory by preemptively reaping the anonymous or swapped out memory owned by the oom victim under an assumption that such a memory won't be needed when its owner is killed and kicked from the userspace anyway. There is one notable exception to this, though, if the OOM victim was in the process of coredumping the result would be incomplete. This is considered a reasonable constrain because the overall system health is more important than debugability of a particular application. A kernel thread has been chosen because we need a reliable way of invocation so workqueue context is not appropriate because all the workers might be busy (e.g. allocating memory). Kswapd which sounds like another good fit is not appropriate as well because it might get blocked on locks during reclaim as well. oom_reaper has to take mmap_sem on the target task for reading so the solution is not 100% because the semaphore might be held or blocked for write but the probability is reduced considerably wrt. basically any lock blocking forward progress as described above. In order to prevent from blocking on the lock without any forward progress we are using only a trylock and retry 10 times with a short sleep in between. Users of mmap_sem which need it for write should be carefully reviewed to use _killable waiting as much as possible and reduce allocations requests done with the lock held to absolute minimum to reduce the risk even further. The API between oom killer and oom reaper is quite trivial. wake_oom_reaper updates mm_to_reap with cmpxchg to guarantee only NULL->mm transition and oom_reaper clear this atomically once it is done with the work. This means that only a single mm_struct can be reaped at the time. As the operation is potentially disruptive we are trying to limit it to the ncessary minimum and the reaper blocks any updates while it operates on an mm. mm_struct is pinned by mm_count to allow parallel exit_mmap and a race is detected by atomic_inc_not_zero(mm_users). Signed-off-by: Michal Hocko <mhocko@suse.com> Suggested-by: Oleg Nesterov <oleg@redhat.com> Suggested-by: Mel Gorman <mgorman@suse.de> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Argangeli <andrea@kernel.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/aac453635549699c13a84ea1456d5b0e574ef855;lock) which is blocked by another task looping in the page allocator;no;yes;yes 151;MDY6Q29tbWl0MjMyNTI5ODphYWM0NTM2MzU1NDk2OTljMTNhODRlYTE0NTZkNWIwZTU3NGVmODU1;Michal Hocko;Linus Torvalds;"mm, oom: introduce oom reaper This patch (of 5): This is based on the idea from Mel Gorman discussed during LSFMM 2015 and independently brought up by Oleg Nesterov. The OOM killer currently allows to kill only a single task in a good hope that the task will terminate in a reasonable time and frees up its memory. Such a task (oom victim) will get an access to memory reserves via mark_oom_victim to allow a forward progress should there be a need for additional memory during exit path. It has been shown (e.g. by Tetsuo Handa) that it is not that hard to construct workloads which break the core assumption mentioned above and the OOM victim might take unbounded amount of time to exit because it might be blocked in the uninterruptible state waiting for an event (e.g. lock) which is blocked by another task looping in the page allocator. This patch reduces the probability of such a lockup by introducing a specialized kernel thread (oom_reaper) which tries to reclaim additional memory by preemptively reaping the anonymous or swapped out memory owned by the oom victim under an assumption that such a memory won't be needed when its owner is killed and kicked from the userspace anyway. There is one notable exception to this, though, if the OOM victim was in the process of coredumping the result would be incomplete. This is considered a reasonable constrain because the overall system health is more important than debugability of a particular application. A kernel thread has been chosen because we need a reliable way of invocation so workqueue context is not appropriate because all the workers might be busy (e.g. allocating memory). Kswapd which sounds like another good fit is not appropriate as well because it might get blocked on locks during reclaim as well. oom_reaper has to take mmap_sem on the target task for reading so the solution is not 100% because the semaphore might be held or blocked for write but the probability is reduced considerably wrt. basically any lock blocking forward progress as described above. In order to prevent from blocking on the lock without any forward progress we are using only a trylock and retry 10 times with a short sleep in between. Users of mmap_sem which need it for write should be carefully reviewed to use _killable waiting as much as possible and reduce allocations requests done with the lock held to absolute minimum to reduce the risk even further. The API between oom killer and oom reaper is quite trivial. wake_oom_reaper updates mm_to_reap with cmpxchg to guarantee only NULL->mm transition and oom_reaper clear this atomically once it is done with the work. This means that only a single mm_struct can be reaped at the time. As the operation is potentially disruptive we are trying to limit it to the ncessary minimum and the reaper blocks any updates while it operates on an mm. mm_struct is pinned by mm_count to allow parallel exit_mmap and a race is detected by atomic_inc_not_zero(mm_users). Signed-off-by: Michal Hocko <mhocko@suse.com> Suggested-by: Oleg Nesterov <oleg@redhat.com> Suggested-by: Mel Gorman <mgorman@suse.de> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Argangeli <andrea@kernel.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/aac453635549699c13a84ea1456d5b0e574ef855;"This patch reduces the probability of such a lockup by introducing a specialized kernel thread (oom_reaper) which tries to reclaim additional memory by preemptively reaping the anonymous or swapped out memory owned by the oom victim under an assumption that such a memory won't be needed when its owner is killed and kicked from the userspace anyway";yes;yes;no 151;MDY6Q29tbWl0MjMyNTI5ODphYWM0NTM2MzU1NDk2OTljMTNhODRlYTE0NTZkNWIwZTU3NGVmODU1;Michal Hocko;Linus Torvalds;"mm, oom: introduce oom reaper This patch (of 5): This is based on the idea from Mel Gorman discussed during LSFMM 2015 and independently brought up by Oleg Nesterov. The OOM killer currently allows to kill only a single task in a good hope that the task will terminate in a reasonable time and frees up its memory. Such a task (oom victim) will get an access to memory reserves via mark_oom_victim to allow a forward progress should there be a need for additional memory during exit path. It has been shown (e.g. by Tetsuo Handa) that it is not that hard to construct workloads which break the core assumption mentioned above and the OOM victim might take unbounded amount of time to exit because it might be blocked in the uninterruptible state waiting for an event (e.g. lock) which is blocked by another task looping in the page allocator. This patch reduces the probability of such a lockup by introducing a specialized kernel thread (oom_reaper) which tries to reclaim additional memory by preemptively reaping the anonymous or swapped out memory owned by the oom victim under an assumption that such a memory won't be needed when its owner is killed and kicked from the userspace anyway. There is one notable exception to this, though, if the OOM victim was in the process of coredumping the result would be incomplete. This is considered a reasonable constrain because the overall system health is more important than debugability of a particular application. A kernel thread has been chosen because we need a reliable way of invocation so workqueue context is not appropriate because all the workers might be busy (e.g. allocating memory). Kswapd which sounds like another good fit is not appropriate as well because it might get blocked on locks during reclaim as well. oom_reaper has to take mmap_sem on the target task for reading so the solution is not 100% because the semaphore might be held or blocked for write but the probability is reduced considerably wrt. basically any lock blocking forward progress as described above. In order to prevent from blocking on the lock without any forward progress we are using only a trylock and retry 10 times with a short sleep in between. Users of mmap_sem which need it for write should be carefully reviewed to use _killable waiting as much as possible and reduce allocations requests done with the lock held to absolute minimum to reduce the risk even further. The API between oom killer and oom reaper is quite trivial. wake_oom_reaper updates mm_to_reap with cmpxchg to guarantee only NULL->mm transition and oom_reaper clear this atomically once it is done with the work. This means that only a single mm_struct can be reaped at the time. As the operation is potentially disruptive we are trying to limit it to the ncessary minimum and the reaper blocks any updates while it operates on an mm. mm_struct is pinned by mm_count to allow parallel exit_mmap and a race is detected by atomic_inc_not_zero(mm_users). Signed-off-by: Michal Hocko <mhocko@suse.com> Suggested-by: Oleg Nesterov <oleg@redhat.com> Suggested-by: Mel Gorman <mgorman@suse.de> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Argangeli <andrea@kernel.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/aac453635549699c13a84ea1456d5b0e574ef855;" There is one notable exception to this, though, if the OOM victim was in the process of coredumping the result would be incomplete";yes;no;yes 151;MDY6Q29tbWl0MjMyNTI5ODphYWM0NTM2MzU1NDk2OTljMTNhODRlYTE0NTZkNWIwZTU3NGVmODU1;Michal Hocko;Linus Torvalds;"mm, oom: introduce oom reaper This patch (of 5): This is based on the idea from Mel Gorman discussed during LSFMM 2015 and independently brought up by Oleg Nesterov. The OOM killer currently allows to kill only a single task in a good hope that the task will terminate in a reasonable time and frees up its memory. Such a task (oom victim) will get an access to memory reserves via mark_oom_victim to allow a forward progress should there be a need for additional memory during exit path. It has been shown (e.g. by Tetsuo Handa) that it is not that hard to construct workloads which break the core assumption mentioned above and the OOM victim might take unbounded amount of time to exit because it might be blocked in the uninterruptible state waiting for an event (e.g. lock) which is blocked by another task looping in the page allocator. This patch reduces the probability of such a lockup by introducing a specialized kernel thread (oom_reaper) which tries to reclaim additional memory by preemptively reaping the anonymous or swapped out memory owned by the oom victim under an assumption that such a memory won't be needed when its owner is killed and kicked from the userspace anyway. There is one notable exception to this, though, if the OOM victim was in the process of coredumping the result would be incomplete. This is considered a reasonable constrain because the overall system health is more important than debugability of a particular application. A kernel thread has been chosen because we need a reliable way of invocation so workqueue context is not appropriate because all the workers might be busy (e.g. allocating memory). Kswapd which sounds like another good fit is not appropriate as well because it might get blocked on locks during reclaim as well. oom_reaper has to take mmap_sem on the target task for reading so the solution is not 100% because the semaphore might be held or blocked for write but the probability is reduced considerably wrt. basically any lock blocking forward progress as described above. In order to prevent from blocking on the lock without any forward progress we are using only a trylock and retry 10 times with a short sleep in between. Users of mmap_sem which need it for write should be carefully reviewed to use _killable waiting as much as possible and reduce allocations requests done with the lock held to absolute minimum to reduce the risk even further. The API between oom killer and oom reaper is quite trivial. wake_oom_reaper updates mm_to_reap with cmpxchg to guarantee only NULL->mm transition and oom_reaper clear this atomically once it is done with the work. This means that only a single mm_struct can be reaped at the time. As the operation is potentially disruptive we are trying to limit it to the ncessary minimum and the reaper blocks any updates while it operates on an mm. mm_struct is pinned by mm_count to allow parallel exit_mmap and a race is detected by atomic_inc_not_zero(mm_users). Signed-off-by: Michal Hocko <mhocko@suse.com> Suggested-by: Oleg Nesterov <oleg@redhat.com> Suggested-by: Mel Gorman <mgorman@suse.de> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Argangeli <andrea@kernel.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/aac453635549699c13a84ea1456d5b0e574ef855;" This is considered a reasonable constrain because the overall system health is more important than debugability of a particular application";yes;yes;no 151;MDY6Q29tbWl0MjMyNTI5ODphYWM0NTM2MzU1NDk2OTljMTNhODRlYTE0NTZkNWIwZTU3NGVmODU1;Michal Hocko;Linus Torvalds;"mm, oom: introduce oom reaper This patch (of 5): This is based on the idea from Mel Gorman discussed during LSFMM 2015 and independently brought up by Oleg Nesterov. The OOM killer currently allows to kill only a single task in a good hope that the task will terminate in a reasonable time and frees up its memory. Such a task (oom victim) will get an access to memory reserves via mark_oom_victim to allow a forward progress should there be a need for additional memory during exit path. It has been shown (e.g. by Tetsuo Handa) that it is not that hard to construct workloads which break the core assumption mentioned above and the OOM victim might take unbounded amount of time to exit because it might be blocked in the uninterruptible state waiting for an event (e.g. lock) which is blocked by another task looping in the page allocator. This patch reduces the probability of such a lockup by introducing a specialized kernel thread (oom_reaper) which tries to reclaim additional memory by preemptively reaping the anonymous or swapped out memory owned by the oom victim under an assumption that such a memory won't be needed when its owner is killed and kicked from the userspace anyway. There is one notable exception to this, though, if the OOM victim was in the process of coredumping the result would be incomplete. This is considered a reasonable constrain because the overall system health is more important than debugability of a particular application. A kernel thread has been chosen because we need a reliable way of invocation so workqueue context is not appropriate because all the workers might be busy (e.g. allocating memory). Kswapd which sounds like another good fit is not appropriate as well because it might get blocked on locks during reclaim as well. oom_reaper has to take mmap_sem on the target task for reading so the solution is not 100% because the semaphore might be held or blocked for write but the probability is reduced considerably wrt. basically any lock blocking forward progress as described above. In order to prevent from blocking on the lock without any forward progress we are using only a trylock and retry 10 times with a short sleep in between. Users of mmap_sem which need it for write should be carefully reviewed to use _killable waiting as much as possible and reduce allocations requests done with the lock held to absolute minimum to reduce the risk even further. The API between oom killer and oom reaper is quite trivial. wake_oom_reaper updates mm_to_reap with cmpxchg to guarantee only NULL->mm transition and oom_reaper clear this atomically once it is done with the work. This means that only a single mm_struct can be reaped at the time. As the operation is potentially disruptive we are trying to limit it to the ncessary minimum and the reaper blocks any updates while it operates on an mm. mm_struct is pinned by mm_count to allow parallel exit_mmap and a race is detected by atomic_inc_not_zero(mm_users). Signed-off-by: Michal Hocko <mhocko@suse.com> Suggested-by: Oleg Nesterov <oleg@redhat.com> Suggested-by: Mel Gorman <mgorman@suse.de> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Argangeli <andrea@kernel.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/aac453635549699c13a84ea1456d5b0e574ef855;"A kernel thread has been chosen because we need a reliable way of invocation so workqueue context is not appropriate because all the workers might be busy (e.g";yes;yes;no 151;MDY6Q29tbWl0MjMyNTI5ODphYWM0NTM2MzU1NDk2OTljMTNhODRlYTE0NTZkNWIwZTU3NGVmODU1;Michal Hocko;Linus Torvalds;"mm, oom: introduce oom reaper This patch (of 5): This is based on the idea from Mel Gorman discussed during LSFMM 2015 and independently brought up by Oleg Nesterov. The OOM killer currently allows to kill only a single task in a good hope that the task will terminate in a reasonable time and frees up its memory. Such a task (oom victim) will get an access to memory reserves via mark_oom_victim to allow a forward progress should there be a need for additional memory during exit path. It has been shown (e.g. by Tetsuo Handa) that it is not that hard to construct workloads which break the core assumption mentioned above and the OOM victim might take unbounded amount of time to exit because it might be blocked in the uninterruptible state waiting for an event (e.g. lock) which is blocked by another task looping in the page allocator. This patch reduces the probability of such a lockup by introducing a specialized kernel thread (oom_reaper) which tries to reclaim additional memory by preemptively reaping the anonymous or swapped out memory owned by the oom victim under an assumption that such a memory won't be needed when its owner is killed and kicked from the userspace anyway. There is one notable exception to this, though, if the OOM victim was in the process of coredumping the result would be incomplete. This is considered a reasonable constrain because the overall system health is more important than debugability of a particular application. A kernel thread has been chosen because we need a reliable way of invocation so workqueue context is not appropriate because all the workers might be busy (e.g. allocating memory). Kswapd which sounds like another good fit is not appropriate as well because it might get blocked on locks during reclaim as well. oom_reaper has to take mmap_sem on the target task for reading so the solution is not 100% because the semaphore might be held or blocked for write but the probability is reduced considerably wrt. basically any lock blocking forward progress as described above. In order to prevent from blocking on the lock without any forward progress we are using only a trylock and retry 10 times with a short sleep in between. Users of mmap_sem which need it for write should be carefully reviewed to use _killable waiting as much as possible and reduce allocations requests done with the lock held to absolute minimum to reduce the risk even further. The API between oom killer and oom reaper is quite trivial. wake_oom_reaper updates mm_to_reap with cmpxchg to guarantee only NULL->mm transition and oom_reaper clear this atomically once it is done with the work. This means that only a single mm_struct can be reaped at the time. As the operation is potentially disruptive we are trying to limit it to the ncessary minimum and the reaper blocks any updates while it operates on an mm. mm_struct is pinned by mm_count to allow parallel exit_mmap and a race is detected by atomic_inc_not_zero(mm_users). Signed-off-by: Michal Hocko <mhocko@suse.com> Suggested-by: Oleg Nesterov <oleg@redhat.com> Suggested-by: Mel Gorman <mgorman@suse.de> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Argangeli <andrea@kernel.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/aac453635549699c13a84ea1456d5b0e574ef855; allocating memory);yes;yes;no 151;MDY6Q29tbWl0MjMyNTI5ODphYWM0NTM2MzU1NDk2OTljMTNhODRlYTE0NTZkNWIwZTU3NGVmODU1;Michal Hocko;Linus Torvalds;"mm, oom: introduce oom reaper This patch (of 5): This is based on the idea from Mel Gorman discussed during LSFMM 2015 and independently brought up by Oleg Nesterov. The OOM killer currently allows to kill only a single task in a good hope that the task will terminate in a reasonable time and frees up its memory. Such a task (oom victim) will get an access to memory reserves via mark_oom_victim to allow a forward progress should there be a need for additional memory during exit path. It has been shown (e.g. by Tetsuo Handa) that it is not that hard to construct workloads which break the core assumption mentioned above and the OOM victim might take unbounded amount of time to exit because it might be blocked in the uninterruptible state waiting for an event (e.g. lock) which is blocked by another task looping in the page allocator. This patch reduces the probability of such a lockup by introducing a specialized kernel thread (oom_reaper) which tries to reclaim additional memory by preemptively reaping the anonymous or swapped out memory owned by the oom victim under an assumption that such a memory won't be needed when its owner is killed and kicked from the userspace anyway. There is one notable exception to this, though, if the OOM victim was in the process of coredumping the result would be incomplete. This is considered a reasonable constrain because the overall system health is more important than debugability of a particular application. A kernel thread has been chosen because we need a reliable way of invocation so workqueue context is not appropriate because all the workers might be busy (e.g. allocating memory). Kswapd which sounds like another good fit is not appropriate as well because it might get blocked on locks during reclaim as well. oom_reaper has to take mmap_sem on the target task for reading so the solution is not 100% because the semaphore might be held or blocked for write but the probability is reduced considerably wrt. basically any lock blocking forward progress as described above. In order to prevent from blocking on the lock without any forward progress we are using only a trylock and retry 10 times with a short sleep in between. Users of mmap_sem which need it for write should be carefully reviewed to use _killable waiting as much as possible and reduce allocations requests done with the lock held to absolute minimum to reduce the risk even further. The API between oom killer and oom reaper is quite trivial. wake_oom_reaper updates mm_to_reap with cmpxchg to guarantee only NULL->mm transition and oom_reaper clear this atomically once it is done with the work. This means that only a single mm_struct can be reaped at the time. As the operation is potentially disruptive we are trying to limit it to the ncessary minimum and the reaper blocks any updates while it operates on an mm. mm_struct is pinned by mm_count to allow parallel exit_mmap and a race is detected by atomic_inc_not_zero(mm_users). Signed-off-by: Michal Hocko <mhocko@suse.com> Suggested-by: Oleg Nesterov <oleg@redhat.com> Suggested-by: Mel Gorman <mgorman@suse.de> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Argangeli <andrea@kernel.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/aac453635549699c13a84ea1456d5b0e574ef855;" Kswapd which sounds like another good fit is not appropriate as well because it might get blocked on locks during reclaim as well";yes;yes;no 151;MDY6Q29tbWl0MjMyNTI5ODphYWM0NTM2MzU1NDk2OTljMTNhODRlYTE0NTZkNWIwZTU3NGVmODU1;Michal Hocko;Linus Torvalds;"mm, oom: introduce oom reaper This patch (of 5): This is based on the idea from Mel Gorman discussed during LSFMM 2015 and independently brought up by Oleg Nesterov. The OOM killer currently allows to kill only a single task in a good hope that the task will terminate in a reasonable time and frees up its memory. Such a task (oom victim) will get an access to memory reserves via mark_oom_victim to allow a forward progress should there be a need for additional memory during exit path. It has been shown (e.g. by Tetsuo Handa) that it is not that hard to construct workloads which break the core assumption mentioned above and the OOM victim might take unbounded amount of time to exit because it might be blocked in the uninterruptible state waiting for an event (e.g. lock) which is blocked by another task looping in the page allocator. This patch reduces the probability of such a lockup by introducing a specialized kernel thread (oom_reaper) which tries to reclaim additional memory by preemptively reaping the anonymous or swapped out memory owned by the oom victim under an assumption that such a memory won't be needed when its owner is killed and kicked from the userspace anyway. There is one notable exception to this, though, if the OOM victim was in the process of coredumping the result would be incomplete. This is considered a reasonable constrain because the overall system health is more important than debugability of a particular application. A kernel thread has been chosen because we need a reliable way of invocation so workqueue context is not appropriate because all the workers might be busy (e.g. allocating memory). Kswapd which sounds like another good fit is not appropriate as well because it might get blocked on locks during reclaim as well. oom_reaper has to take mmap_sem on the target task for reading so the solution is not 100% because the semaphore might be held or blocked for write but the probability is reduced considerably wrt. basically any lock blocking forward progress as described above. In order to prevent from blocking on the lock without any forward progress we are using only a trylock and retry 10 times with a short sleep in between. Users of mmap_sem which need it for write should be carefully reviewed to use _killable waiting as much as possible and reduce allocations requests done with the lock held to absolute minimum to reduce the risk even further. The API between oom killer and oom reaper is quite trivial. wake_oom_reaper updates mm_to_reap with cmpxchg to guarantee only NULL->mm transition and oom_reaper clear this atomically once it is done with the work. This means that only a single mm_struct can be reaped at the time. As the operation is potentially disruptive we are trying to limit it to the ncessary minimum and the reaper blocks any updates while it operates on an mm. mm_struct is pinned by mm_count to allow parallel exit_mmap and a race is detected by atomic_inc_not_zero(mm_users). Signed-off-by: Michal Hocko <mhocko@suse.com> Suggested-by: Oleg Nesterov <oleg@redhat.com> Suggested-by: Mel Gorman <mgorman@suse.de> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Argangeli <andrea@kernel.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/aac453635549699c13a84ea1456d5b0e574ef855;"oom_reaper has to take mmap_sem on the target task for reading so the solution is not 100% because the semaphore might be held or blocked for write but the probability is reduced considerably wrt";yes;yes;yes 151;MDY6Q29tbWl0MjMyNTI5ODphYWM0NTM2MzU1NDk2OTljMTNhODRlYTE0NTZkNWIwZTU3NGVmODU1;Michal Hocko;Linus Torvalds;"mm, oom: introduce oom reaper This patch (of 5): This is based on the idea from Mel Gorman discussed during LSFMM 2015 and independently brought up by Oleg Nesterov. The OOM killer currently allows to kill only a single task in a good hope that the task will terminate in a reasonable time and frees up its memory. Such a task (oom victim) will get an access to memory reserves via mark_oom_victim to allow a forward progress should there be a need for additional memory during exit path. It has been shown (e.g. by Tetsuo Handa) that it is not that hard to construct workloads which break the core assumption mentioned above and the OOM victim might take unbounded amount of time to exit because it might be blocked in the uninterruptible state waiting for an event (e.g. lock) which is blocked by another task looping in the page allocator. This patch reduces the probability of such a lockup by introducing a specialized kernel thread (oom_reaper) which tries to reclaim additional memory by preemptively reaping the anonymous or swapped out memory owned by the oom victim under an assumption that such a memory won't be needed when its owner is killed and kicked from the userspace anyway. There is one notable exception to this, though, if the OOM victim was in the process of coredumping the result would be incomplete. This is considered a reasonable constrain because the overall system health is more important than debugability of a particular application. A kernel thread has been chosen because we need a reliable way of invocation so workqueue context is not appropriate because all the workers might be busy (e.g. allocating memory). Kswapd which sounds like another good fit is not appropriate as well because it might get blocked on locks during reclaim as well. oom_reaper has to take mmap_sem on the target task for reading so the solution is not 100% because the semaphore might be held or blocked for write but the probability is reduced considerably wrt. basically any lock blocking forward progress as described above. In order to prevent from blocking on the lock without any forward progress we are using only a trylock and retry 10 times with a short sleep in between. Users of mmap_sem which need it for write should be carefully reviewed to use _killable waiting as much as possible and reduce allocations requests done with the lock held to absolute minimum to reduce the risk even further. The API between oom killer and oom reaper is quite trivial. wake_oom_reaper updates mm_to_reap with cmpxchg to guarantee only NULL->mm transition and oom_reaper clear this atomically once it is done with the work. This means that only a single mm_struct can be reaped at the time. As the operation is potentially disruptive we are trying to limit it to the ncessary minimum and the reaper blocks any updates while it operates on an mm. mm_struct is pinned by mm_count to allow parallel exit_mmap and a race is detected by atomic_inc_not_zero(mm_users). Signed-off-by: Michal Hocko <mhocko@suse.com> Suggested-by: Oleg Nesterov <oleg@redhat.com> Suggested-by: Mel Gorman <mgorman@suse.de> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Argangeli <andrea@kernel.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/aac453635549699c13a84ea1456d5b0e574ef855;" basically any lock blocking forward progress as described above";yes;yes;yes 151;MDY6Q29tbWl0MjMyNTI5ODphYWM0NTM2MzU1NDk2OTljMTNhODRlYTE0NTZkNWIwZTU3NGVmODU1;Michal Hocko;Linus Torvalds;"mm, oom: introduce oom reaper This patch (of 5): This is based on the idea from Mel Gorman discussed during LSFMM 2015 and independently brought up by Oleg Nesterov. The OOM killer currently allows to kill only a single task in a good hope that the task will terminate in a reasonable time and frees up its memory. Such a task (oom victim) will get an access to memory reserves via mark_oom_victim to allow a forward progress should there be a need for additional memory during exit path. It has been shown (e.g. by Tetsuo Handa) that it is not that hard to construct workloads which break the core assumption mentioned above and the OOM victim might take unbounded amount of time to exit because it might be blocked in the uninterruptible state waiting for an event (e.g. lock) which is blocked by another task looping in the page allocator. This patch reduces the probability of such a lockup by introducing a specialized kernel thread (oom_reaper) which tries to reclaim additional memory by preemptively reaping the anonymous or swapped out memory owned by the oom victim under an assumption that such a memory won't be needed when its owner is killed and kicked from the userspace anyway. There is one notable exception to this, though, if the OOM victim was in the process of coredumping the result would be incomplete. This is considered a reasonable constrain because the overall system health is more important than debugability of a particular application. A kernel thread has been chosen because we need a reliable way of invocation so workqueue context is not appropriate because all the workers might be busy (e.g. allocating memory). Kswapd which sounds like another good fit is not appropriate as well because it might get blocked on locks during reclaim as well. oom_reaper has to take mmap_sem on the target task for reading so the solution is not 100% because the semaphore might be held or blocked for write but the probability is reduced considerably wrt. basically any lock blocking forward progress as described above. In order to prevent from blocking on the lock without any forward progress we are using only a trylock and retry 10 times with a short sleep in between. Users of mmap_sem which need it for write should be carefully reviewed to use _killable waiting as much as possible and reduce allocations requests done with the lock held to absolute minimum to reduce the risk even further. The API between oom killer and oom reaper is quite trivial. wake_oom_reaper updates mm_to_reap with cmpxchg to guarantee only NULL->mm transition and oom_reaper clear this atomically once it is done with the work. This means that only a single mm_struct can be reaped at the time. As the operation is potentially disruptive we are trying to limit it to the ncessary minimum and the reaper blocks any updates while it operates on an mm. mm_struct is pinned by mm_count to allow parallel exit_mmap and a race is detected by atomic_inc_not_zero(mm_users). Signed-off-by: Michal Hocko <mhocko@suse.com> Suggested-by: Oleg Nesterov <oleg@redhat.com> Suggested-by: Mel Gorman <mgorman@suse.de> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Argangeli <andrea@kernel.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/aac453635549699c13a84ea1456d5b0e574ef855;" In order to prevent from blocking on the lock without any forward progress we are using only a trylock and retry 10 times with a short sleep in between";yes;yes;no 151;MDY6Q29tbWl0MjMyNTI5ODphYWM0NTM2MzU1NDk2OTljMTNhODRlYTE0NTZkNWIwZTU3NGVmODU1;Michal Hocko;Linus Torvalds;"mm, oom: introduce oom reaper This patch (of 5): This is based on the idea from Mel Gorman discussed during LSFMM 2015 and independently brought up by Oleg Nesterov. The OOM killer currently allows to kill only a single task in a good hope that the task will terminate in a reasonable time and frees up its memory. Such a task (oom victim) will get an access to memory reserves via mark_oom_victim to allow a forward progress should there be a need for additional memory during exit path. It has been shown (e.g. by Tetsuo Handa) that it is not that hard to construct workloads which break the core assumption mentioned above and the OOM victim might take unbounded amount of time to exit because it might be blocked in the uninterruptible state waiting for an event (e.g. lock) which is blocked by another task looping in the page allocator. This patch reduces the probability of such a lockup by introducing a specialized kernel thread (oom_reaper) which tries to reclaim additional memory by preemptively reaping the anonymous or swapped out memory owned by the oom victim under an assumption that such a memory won't be needed when its owner is killed and kicked from the userspace anyway. There is one notable exception to this, though, if the OOM victim was in the process of coredumping the result would be incomplete. This is considered a reasonable constrain because the overall system health is more important than debugability of a particular application. A kernel thread has been chosen because we need a reliable way of invocation so workqueue context is not appropriate because all the workers might be busy (e.g. allocating memory). Kswapd which sounds like another good fit is not appropriate as well because it might get blocked on locks during reclaim as well. oom_reaper has to take mmap_sem on the target task for reading so the solution is not 100% because the semaphore might be held or blocked for write but the probability is reduced considerably wrt. basically any lock blocking forward progress as described above. In order to prevent from blocking on the lock without any forward progress we are using only a trylock and retry 10 times with a short sleep in between. Users of mmap_sem which need it for write should be carefully reviewed to use _killable waiting as much as possible and reduce allocations requests done with the lock held to absolute minimum to reduce the risk even further. The API between oom killer and oom reaper is quite trivial. wake_oom_reaper updates mm_to_reap with cmpxchg to guarantee only NULL->mm transition and oom_reaper clear this atomically once it is done with the work. This means that only a single mm_struct can be reaped at the time. As the operation is potentially disruptive we are trying to limit it to the ncessary minimum and the reaper blocks any updates while it operates on an mm. mm_struct is pinned by mm_count to allow parallel exit_mmap and a race is detected by atomic_inc_not_zero(mm_users). Signed-off-by: Michal Hocko <mhocko@suse.com> Suggested-by: Oleg Nesterov <oleg@redhat.com> Suggested-by: Mel Gorman <mgorman@suse.de> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Argangeli <andrea@kernel.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/aac453635549699c13a84ea1456d5b0e574ef855;" Users of mmap_sem which need it for write should be carefully reviewed to use _killable waiting as much as possible and reduce allocations requests done with the lock held to absolute minimum to reduce the risk even further";yes;yes;no 151;MDY6Q29tbWl0MjMyNTI5ODphYWM0NTM2MzU1NDk2OTljMTNhODRlYTE0NTZkNWIwZTU3NGVmODU1;Michal Hocko;Linus Torvalds;"mm, oom: introduce oom reaper This patch (of 5): This is based on the idea from Mel Gorman discussed during LSFMM 2015 and independently brought up by Oleg Nesterov. The OOM killer currently allows to kill only a single task in a good hope that the task will terminate in a reasonable time and frees up its memory. Such a task (oom victim) will get an access to memory reserves via mark_oom_victim to allow a forward progress should there be a need for additional memory during exit path. It has been shown (e.g. by Tetsuo Handa) that it is not that hard to construct workloads which break the core assumption mentioned above and the OOM victim might take unbounded amount of time to exit because it might be blocked in the uninterruptible state waiting for an event (e.g. lock) which is blocked by another task looping in the page allocator. This patch reduces the probability of such a lockup by introducing a specialized kernel thread (oom_reaper) which tries to reclaim additional memory by preemptively reaping the anonymous or swapped out memory owned by the oom victim under an assumption that such a memory won't be needed when its owner is killed and kicked from the userspace anyway. There is one notable exception to this, though, if the OOM victim was in the process of coredumping the result would be incomplete. This is considered a reasonable constrain because the overall system health is more important than debugability of a particular application. A kernel thread has been chosen because we need a reliable way of invocation so workqueue context is not appropriate because all the workers might be busy (e.g. allocating memory). Kswapd which sounds like another good fit is not appropriate as well because it might get blocked on locks during reclaim as well. oom_reaper has to take mmap_sem on the target task for reading so the solution is not 100% because the semaphore might be held or blocked for write but the probability is reduced considerably wrt. basically any lock blocking forward progress as described above. In order to prevent from blocking on the lock without any forward progress we are using only a trylock and retry 10 times with a short sleep in between. Users of mmap_sem which need it for write should be carefully reviewed to use _killable waiting as much as possible and reduce allocations requests done with the lock held to absolute minimum to reduce the risk even further. The API between oom killer and oom reaper is quite trivial. wake_oom_reaper updates mm_to_reap with cmpxchg to guarantee only NULL->mm transition and oom_reaper clear this atomically once it is done with the work. This means that only a single mm_struct can be reaped at the time. As the operation is potentially disruptive we are trying to limit it to the ncessary minimum and the reaper blocks any updates while it operates on an mm. mm_struct is pinned by mm_count to allow parallel exit_mmap and a race is detected by atomic_inc_not_zero(mm_users). Signed-off-by: Michal Hocko <mhocko@suse.com> Suggested-by: Oleg Nesterov <oleg@redhat.com> Suggested-by: Mel Gorman <mgorman@suse.de> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Argangeli <andrea@kernel.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/aac453635549699c13a84ea1456d5b0e574ef855;The API between oom killer and oom reaper is quite trivial;yes;no;yes 151;MDY6Q29tbWl0MjMyNTI5ODphYWM0NTM2MzU1NDk2OTljMTNhODRlYTE0NTZkNWIwZTU3NGVmODU1;Michal Hocko;Linus Torvalds;"mm, oom: introduce oom reaper This patch (of 5): This is based on the idea from Mel Gorman discussed during LSFMM 2015 and independently brought up by Oleg Nesterov. The OOM killer currently allows to kill only a single task in a good hope that the task will terminate in a reasonable time and frees up its memory. Such a task (oom victim) will get an access to memory reserves via mark_oom_victim to allow a forward progress should there be a need for additional memory during exit path. It has been shown (e.g. by Tetsuo Handa) that it is not that hard to construct workloads which break the core assumption mentioned above and the OOM victim might take unbounded amount of time to exit because it might be blocked in the uninterruptible state waiting for an event (e.g. lock) which is blocked by another task looping in the page allocator. This patch reduces the probability of such a lockup by introducing a specialized kernel thread (oom_reaper) which tries to reclaim additional memory by preemptively reaping the anonymous or swapped out memory owned by the oom victim under an assumption that such a memory won't be needed when its owner is killed and kicked from the userspace anyway. There is one notable exception to this, though, if the OOM victim was in the process of coredumping the result would be incomplete. This is considered a reasonable constrain because the overall system health is more important than debugability of a particular application. A kernel thread has been chosen because we need a reliable way of invocation so workqueue context is not appropriate because all the workers might be busy (e.g. allocating memory). Kswapd which sounds like another good fit is not appropriate as well because it might get blocked on locks during reclaim as well. oom_reaper has to take mmap_sem on the target task for reading so the solution is not 100% because the semaphore might be held or blocked for write but the probability is reduced considerably wrt. basically any lock blocking forward progress as described above. In order to prevent from blocking on the lock without any forward progress we are using only a trylock and retry 10 times with a short sleep in between. Users of mmap_sem which need it for write should be carefully reviewed to use _killable waiting as much as possible and reduce allocations requests done with the lock held to absolute minimum to reduce the risk even further. The API between oom killer and oom reaper is quite trivial. wake_oom_reaper updates mm_to_reap with cmpxchg to guarantee only NULL->mm transition and oom_reaper clear this atomically once it is done with the work. This means that only a single mm_struct can be reaped at the time. As the operation is potentially disruptive we are trying to limit it to the ncessary minimum and the reaper blocks any updates while it operates on an mm. mm_struct is pinned by mm_count to allow parallel exit_mmap and a race is detected by atomic_inc_not_zero(mm_users). Signed-off-by: Michal Hocko <mhocko@suse.com> Suggested-by: Oleg Nesterov <oleg@redhat.com> Suggested-by: Mel Gorman <mgorman@suse.de> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Argangeli <andrea@kernel.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/aac453635549699c13a84ea1456d5b0e574ef855;"wake_oom_reaper updates mm_to_reap with cmpxchg to guarantee only NULL->mm transition and oom_reaper clear this atomically once it is done with the work";yes;no;no 151;MDY6Q29tbWl0MjMyNTI5ODphYWM0NTM2MzU1NDk2OTljMTNhODRlYTE0NTZkNWIwZTU3NGVmODU1;Michal Hocko;Linus Torvalds;"mm, oom: introduce oom reaper This patch (of 5): This is based on the idea from Mel Gorman discussed during LSFMM 2015 and independently brought up by Oleg Nesterov. The OOM killer currently allows to kill only a single task in a good hope that the task will terminate in a reasonable time and frees up its memory. Such a task (oom victim) will get an access to memory reserves via mark_oom_victim to allow a forward progress should there be a need for additional memory during exit path. It has been shown (e.g. by Tetsuo Handa) that it is not that hard to construct workloads which break the core assumption mentioned above and the OOM victim might take unbounded amount of time to exit because it might be blocked in the uninterruptible state waiting for an event (e.g. lock) which is blocked by another task looping in the page allocator. This patch reduces the probability of such a lockup by introducing a specialized kernel thread (oom_reaper) which tries to reclaim additional memory by preemptively reaping the anonymous or swapped out memory owned by the oom victim under an assumption that such a memory won't be needed when its owner is killed and kicked from the userspace anyway. There is one notable exception to this, though, if the OOM victim was in the process of coredumping the result would be incomplete. This is considered a reasonable constrain because the overall system health is more important than debugability of a particular application. A kernel thread has been chosen because we need a reliable way of invocation so workqueue context is not appropriate because all the workers might be busy (e.g. allocating memory). Kswapd which sounds like another good fit is not appropriate as well because it might get blocked on locks during reclaim as well. oom_reaper has to take mmap_sem on the target task for reading so the solution is not 100% because the semaphore might be held or blocked for write but the probability is reduced considerably wrt. basically any lock blocking forward progress as described above. In order to prevent from blocking on the lock without any forward progress we are using only a trylock and retry 10 times with a short sleep in between. Users of mmap_sem which need it for write should be carefully reviewed to use _killable waiting as much as possible and reduce allocations requests done with the lock held to absolute minimum to reduce the risk even further. The API between oom killer and oom reaper is quite trivial. wake_oom_reaper updates mm_to_reap with cmpxchg to guarantee only NULL->mm transition and oom_reaper clear this atomically once it is done with the work. This means that only a single mm_struct can be reaped at the time. As the operation is potentially disruptive we are trying to limit it to the ncessary minimum and the reaper blocks any updates while it operates on an mm. mm_struct is pinned by mm_count to allow parallel exit_mmap and a race is detected by atomic_inc_not_zero(mm_users). Signed-off-by: Michal Hocko <mhocko@suse.com> Suggested-by: Oleg Nesterov <oleg@redhat.com> Suggested-by: Mel Gorman <mgorman@suse.de> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Argangeli <andrea@kernel.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/aac453635549699c13a84ea1456d5b0e574ef855;" This means that only a single mm_struct can be reaped at the time";yes;no;no 151;MDY6Q29tbWl0MjMyNTI5ODphYWM0NTM2MzU1NDk2OTljMTNhODRlYTE0NTZkNWIwZTU3NGVmODU1;Michal Hocko;Linus Torvalds;"mm, oom: introduce oom reaper This patch (of 5): This is based on the idea from Mel Gorman discussed during LSFMM 2015 and independently brought up by Oleg Nesterov. The OOM killer currently allows to kill only a single task in a good hope that the task will terminate in a reasonable time and frees up its memory. Such a task (oom victim) will get an access to memory reserves via mark_oom_victim to allow a forward progress should there be a need for additional memory during exit path. It has been shown (e.g. by Tetsuo Handa) that it is not that hard to construct workloads which break the core assumption mentioned above and the OOM victim might take unbounded amount of time to exit because it might be blocked in the uninterruptible state waiting for an event (e.g. lock) which is blocked by another task looping in the page allocator. This patch reduces the probability of such a lockup by introducing a specialized kernel thread (oom_reaper) which tries to reclaim additional memory by preemptively reaping the anonymous or swapped out memory owned by the oom victim under an assumption that such a memory won't be needed when its owner is killed and kicked from the userspace anyway. There is one notable exception to this, though, if the OOM victim was in the process of coredumping the result would be incomplete. This is considered a reasonable constrain because the overall system health is more important than debugability of a particular application. A kernel thread has been chosen because we need a reliable way of invocation so workqueue context is not appropriate because all the workers might be busy (e.g. allocating memory). Kswapd which sounds like another good fit is not appropriate as well because it might get blocked on locks during reclaim as well. oom_reaper has to take mmap_sem on the target task for reading so the solution is not 100% because the semaphore might be held or blocked for write but the probability is reduced considerably wrt. basically any lock blocking forward progress as described above. In order to prevent from blocking on the lock without any forward progress we are using only a trylock and retry 10 times with a short sleep in between. Users of mmap_sem which need it for write should be carefully reviewed to use _killable waiting as much as possible and reduce allocations requests done with the lock held to absolute minimum to reduce the risk even further. The API between oom killer and oom reaper is quite trivial. wake_oom_reaper updates mm_to_reap with cmpxchg to guarantee only NULL->mm transition and oom_reaper clear this atomically once it is done with the work. This means that only a single mm_struct can be reaped at the time. As the operation is potentially disruptive we are trying to limit it to the ncessary minimum and the reaper blocks any updates while it operates on an mm. mm_struct is pinned by mm_count to allow parallel exit_mmap and a race is detected by atomic_inc_not_zero(mm_users). Signed-off-by: Michal Hocko <mhocko@suse.com> Suggested-by: Oleg Nesterov <oleg@redhat.com> Suggested-by: Mel Gorman <mgorman@suse.de> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Argangeli <andrea@kernel.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/aac453635549699c13a84ea1456d5b0e574ef855;" As the operation is potentially disruptive we are trying to limit it to the ncessary minimum and the reaper blocks any updates while it operates on an mm";yes;yes;no 151;MDY6Q29tbWl0MjMyNTI5ODphYWM0NTM2MzU1NDk2OTljMTNhODRlYTE0NTZkNWIwZTU3NGVmODU1;Michal Hocko;Linus Torvalds;"mm, oom: introduce oom reaper This patch (of 5): This is based on the idea from Mel Gorman discussed during LSFMM 2015 and independently brought up by Oleg Nesterov. The OOM killer currently allows to kill only a single task in a good hope that the task will terminate in a reasonable time and frees up its memory. Such a task (oom victim) will get an access to memory reserves via mark_oom_victim to allow a forward progress should there be a need for additional memory during exit path. It has been shown (e.g. by Tetsuo Handa) that it is not that hard to construct workloads which break the core assumption mentioned above and the OOM victim might take unbounded amount of time to exit because it might be blocked in the uninterruptible state waiting for an event (e.g. lock) which is blocked by another task looping in the page allocator. This patch reduces the probability of such a lockup by introducing a specialized kernel thread (oom_reaper) which tries to reclaim additional memory by preemptively reaping the anonymous or swapped out memory owned by the oom victim under an assumption that such a memory won't be needed when its owner is killed and kicked from the userspace anyway. There is one notable exception to this, though, if the OOM victim was in the process of coredumping the result would be incomplete. This is considered a reasonable constrain because the overall system health is more important than debugability of a particular application. A kernel thread has been chosen because we need a reliable way of invocation so workqueue context is not appropriate because all the workers might be busy (e.g. allocating memory). Kswapd which sounds like another good fit is not appropriate as well because it might get blocked on locks during reclaim as well. oom_reaper has to take mmap_sem on the target task for reading so the solution is not 100% because the semaphore might be held or blocked for write but the probability is reduced considerably wrt. basically any lock blocking forward progress as described above. In order to prevent from blocking on the lock without any forward progress we are using only a trylock and retry 10 times with a short sleep in between. Users of mmap_sem which need it for write should be carefully reviewed to use _killable waiting as much as possible and reduce allocations requests done with the lock held to absolute minimum to reduce the risk even further. The API between oom killer and oom reaper is quite trivial. wake_oom_reaper updates mm_to_reap with cmpxchg to guarantee only NULL->mm transition and oom_reaper clear this atomically once it is done with the work. This means that only a single mm_struct can be reaped at the time. As the operation is potentially disruptive we are trying to limit it to the ncessary minimum and the reaper blocks any updates while it operates on an mm. mm_struct is pinned by mm_count to allow parallel exit_mmap and a race is detected by atomic_inc_not_zero(mm_users). Signed-off-by: Michal Hocko <mhocko@suse.com> Suggested-by: Oleg Nesterov <oleg@redhat.com> Suggested-by: Mel Gorman <mgorman@suse.de> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Argangeli <andrea@kernel.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/aac453635549699c13a84ea1456d5b0e574ef855;" mm_struct is pinned by mm_count to allow parallel exit_mmap and a race is detected by atomic_inc_not_zero(mm_users).";yes;yes;no 152;MDY6Q29tbWl0MjMyNTI5ODo2YWZjZjI4OTVlNmYyMjk0NzZiZDBkYzk1ZmU5MTVlNjM5YWUwYTQ5;Tetsuo Handa;Linus Torvalds;"mm,oom: make oom_killer_disable() killable While oom_killer_disable() is called by freeze_processes() after all user threads except the current thread are frozen, it is possible that kernel threads invoke the OOM killer and sends SIGKILL to the current thread due to sharing the thawed victim's memory. Therefore, checking for SIGKILL is preferable than TIF_MEMDIE. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/6afcf2895e6f229476bd0dc95fe915e639ae0a49;mm,oom: make oom_killer_disable() killable;yes;no;no 152;MDY6Q29tbWl0MjMyNTI5ODo2YWZjZjI4OTVlNmYyMjk0NzZiZDBkYzk1ZmU5MTVlNjM5YWUwYTQ5;Tetsuo Handa;Linus Torvalds;"mm,oom: make oom_killer_disable() killable While oom_killer_disable() is called by freeze_processes() after all user threads except the current thread are frozen, it is possible that kernel threads invoke the OOM killer and sends SIGKILL to the current thread due to sharing the thawed victim's memory. Therefore, checking for SIGKILL is preferable than TIF_MEMDIE. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/6afcf2895e6f229476bd0dc95fe915e639ae0a49;"While oom_killer_disable() is called by freeze_processes() after all user threads except the current thread are frozen, it is possible that kernel threads invoke the OOM killer and sends SIGKILL to the current thread due to sharing the thawed victim's memory";no;yes;yes 152;MDY6Q29tbWl0MjMyNTI5ODo2YWZjZjI4OTVlNmYyMjk0NzZiZDBkYzk1ZmU5MTVlNjM5YWUwYTQ5;Tetsuo Handa;Linus Torvalds;"mm,oom: make oom_killer_disable() killable While oom_killer_disable() is called by freeze_processes() after all user threads except the current thread are frozen, it is possible that kernel threads invoke the OOM killer and sends SIGKILL to the current thread due to sharing the thawed victim's memory. Therefore, checking for SIGKILL is preferable than TIF_MEMDIE. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/6afcf2895e6f229476bd0dc95fe915e639ae0a49;" Therefore, checking for SIGKILL is preferable than TIF_MEMDIE.";yes;yes;yes 153;MDY6Q29tbWl0MjMyNTI5ODo3NTZhMDI1ZjAwMDkxOTE4ZDlkMDljYTMyMjlkZWZiMTYwYjQwOWMw;Joe Perches;Linus Torvalds;"mm: coalesce split strings Kernel style prefers a single string over split strings when the string is 'user-visible'. Miscellanea: - Add a missing newline - Realign arguments Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Tejun Heo <tj@kernel.org> [percpu] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/756a025f00091918d9d09ca3229defb160b409c0;mm: coalesce split strings;yes;no;no 153;MDY6Q29tbWl0MjMyNTI5ODo3NTZhMDI1ZjAwMDkxOTE4ZDlkMDljYTMyMjlkZWZiMTYwYjQwOWMw;Joe Perches;Linus Torvalds;"mm: coalesce split strings Kernel style prefers a single string over split strings when the string is 'user-visible'. Miscellanea: - Add a missing newline - Realign arguments Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Tejun Heo <tj@kernel.org> [percpu] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/756a025f00091918d9d09ca3229defb160b409c0;"Kernel style prefers a single string over split strings when the string is 'user-visible'";no;yes;yes 153;MDY6Q29tbWl0MjMyNTI5ODo3NTZhMDI1ZjAwMDkxOTE4ZDlkMDljYTMyMjlkZWZiMTYwYjQwOWMw;Joe Perches;Linus Torvalds;"mm: coalesce split strings Kernel style prefers a single string over split strings when the string is 'user-visible'. Miscellanea: - Add a missing newline - Realign arguments Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Tejun Heo <tj@kernel.org> [percpu] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/756a025f00091918d9d09ca3229defb160b409c0;Miscellanea;no;no;no 153;MDY6Q29tbWl0MjMyNTI5ODo3NTZhMDI1ZjAwMDkxOTE4ZDlkMDljYTMyMjlkZWZiMTYwYjQwOWMw;Joe Perches;Linus Torvalds;"mm: coalesce split strings Kernel style prefers a single string over split strings when the string is 'user-visible'. Miscellanea: - Add a missing newline - Realign arguments Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Tejun Heo <tj@kernel.org> [percpu] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/756a025f00091918d9d09ca3229defb160b409c0;" - Add a missing newline - Realign arguments";yes;yes;no 154;MDY6Q29tbWl0MjMyNTI5ODo2YTYxODk1N2FkMTdkOGY0ZjRjN2VlZWRlNzUyNjg1Mzc0YjFiMTc2;Johannes Weiner;Linus Torvalds;"mm: oom_kill: don't ignore oom score on exiting tasks When the OOM killer scans tasks and encounters a PF_EXITING one, it force-selects that task regardless of the score. The problem is that if that task got stuck waiting for some state the allocation site is holding, the OOM reaper can not move on to the next best victim. Frankly, I don't even know why we check for exiting tasks in the OOM killer. We've tried direct reclaim at least 15 times by the time we decide the system is OOM, there was plenty of time to exit and free memory; and a task might exit voluntarily right after we issue a kill. This is testing pure noise. Remove it. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Argangeli <andrea@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/6a618957ad17d8f4f4c7eeede752685374b1b176;mm: oom_kill: don't ignore oom score on exiting tasks;yes;no;no 154;MDY6Q29tbWl0MjMyNTI5ODo2YTYxODk1N2FkMTdkOGY0ZjRjN2VlZWRlNzUyNjg1Mzc0YjFiMTc2;Johannes Weiner;Linus Torvalds;"mm: oom_kill: don't ignore oom score on exiting tasks When the OOM killer scans tasks and encounters a PF_EXITING one, it force-selects that task regardless of the score. The problem is that if that task got stuck waiting for some state the allocation site is holding, the OOM reaper can not move on to the next best victim. Frankly, I don't even know why we check for exiting tasks in the OOM killer. We've tried direct reclaim at least 15 times by the time we decide the system is OOM, there was plenty of time to exit and free memory; and a task might exit voluntarily right after we issue a kill. This is testing pure noise. Remove it. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Argangeli <andrea@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/6a618957ad17d8f4f4c7eeede752685374b1b176;"When the OOM killer scans tasks and encounters a PF_EXITING one, it force-selects that task regardless of the score";no;no;yes 154;MDY6Q29tbWl0MjMyNTI5ODo2YTYxODk1N2FkMTdkOGY0ZjRjN2VlZWRlNzUyNjg1Mzc0YjFiMTc2;Johannes Weiner;Linus Torvalds;"mm: oom_kill: don't ignore oom score on exiting tasks When the OOM killer scans tasks and encounters a PF_EXITING one, it force-selects that task regardless of the score. The problem is that if that task got stuck waiting for some state the allocation site is holding, the OOM reaper can not move on to the next best victim. Frankly, I don't even know why we check for exiting tasks in the OOM killer. We've tried direct reclaim at least 15 times by the time we decide the system is OOM, there was plenty of time to exit and free memory; and a task might exit voluntarily right after we issue a kill. This is testing pure noise. Remove it. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Argangeli <andrea@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/6a618957ad17d8f4f4c7eeede752685374b1b176;" The problem is that if that task got stuck waiting for some state the allocation site is holding, the OOM reaper can not move on to the next best victim";no;yes;yes 154;MDY6Q29tbWl0MjMyNTI5ODo2YTYxODk1N2FkMTdkOGY0ZjRjN2VlZWRlNzUyNjg1Mzc0YjFiMTc2;Johannes Weiner;Linus Torvalds;"mm: oom_kill: don't ignore oom score on exiting tasks When the OOM killer scans tasks and encounters a PF_EXITING one, it force-selects that task regardless of the score. The problem is that if that task got stuck waiting for some state the allocation site is holding, the OOM reaper can not move on to the next best victim. Frankly, I don't even know why we check for exiting tasks in the OOM killer. We've tried direct reclaim at least 15 times by the time we decide the system is OOM, there was plenty of time to exit and free memory; and a task might exit voluntarily right after we issue a kill. This is testing pure noise. Remove it. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Argangeli <andrea@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/6a618957ad17d8f4f4c7eeede752685374b1b176;"Frankly, I don't even know why we check for exiting tasks in the OOM killer";no;yes;yes 154;MDY6Q29tbWl0MjMyNTI5ODo2YTYxODk1N2FkMTdkOGY0ZjRjN2VlZWRlNzUyNjg1Mzc0YjFiMTc2;Johannes Weiner;Linus Torvalds;"mm: oom_kill: don't ignore oom score on exiting tasks When the OOM killer scans tasks and encounters a PF_EXITING one, it force-selects that task regardless of the score. The problem is that if that task got stuck waiting for some state the allocation site is holding, the OOM reaper can not move on to the next best victim. Frankly, I don't even know why we check for exiting tasks in the OOM killer. We've tried direct reclaim at least 15 times by the time we decide the system is OOM, there was plenty of time to exit and free memory; and a task might exit voluntarily right after we issue a kill. This is testing pure noise. Remove it. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Argangeli <andrea@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/6a618957ad17d8f4f4c7eeede752685374b1b176;" We've tried direct reclaim at least 15 times by the time we decide the system is OOM, there was plenty of time to exit and free memory; and a task might exit voluntarily right after we issue a kill";no;yes;yes 154;MDY6Q29tbWl0MjMyNTI5ODo2YTYxODk1N2FkMTdkOGY0ZjRjN2VlZWRlNzUyNjg1Mzc0YjFiMTc2;Johannes Weiner;Linus Torvalds;"mm: oom_kill: don't ignore oom score on exiting tasks When the OOM killer scans tasks and encounters a PF_EXITING one, it force-selects that task regardless of the score. The problem is that if that task got stuck waiting for some state the allocation site is holding, the OOM reaper can not move on to the next best victim. Frankly, I don't even know why we check for exiting tasks in the OOM killer. We've tried direct reclaim at least 15 times by the time we decide the system is OOM, there was plenty of time to exit and free memory; and a task might exit voluntarily right after we issue a kill. This is testing pure noise. Remove it. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Argangeli <andrea@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/6a618957ad17d8f4f4c7eeede752685374b1b176;This is testing pure noise;no;yes;yes 154;MDY6Q29tbWl0MjMyNTI5ODo2YTYxODk1N2FkMTdkOGY0ZjRjN2VlZWRlNzUyNjg1Mzc0YjFiMTc2;Johannes Weiner;Linus Torvalds;"mm: oom_kill: don't ignore oom score on exiting tasks When the OOM killer scans tasks and encounters a PF_EXITING one, it force-selects that task regardless of the score. The problem is that if that task got stuck waiting for some state the allocation site is holding, the OOM reaper can not move on to the next best victim. Frankly, I don't even know why we check for exiting tasks in the OOM killer. We've tried direct reclaim at least 15 times by the time we decide the system is OOM, there was plenty of time to exit and free memory; and a task might exit voluntarily right after we issue a kill. This is testing pure noise. Remove it. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Argangeli <andrea@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/6a618957ad17d8f4f4c7eeede752685374b1b176; Remove it.;yes;no;no 155;MDY6Q29tbWl0MjMyNTI5ODphMDc5NWNkNDE2ZDExNDIxMTc2OTVmOTMyYTk2OTA2MTFhZTBlZGJi;Vlastimil Babka;Linus Torvalds;"mm, oom: print symbolic gfp_flags in oom warning It would be useful to translate gfp_flags into string representation when printing in case of an OOM, especially as the flags have been undergoing some changes recently and the script ./scripts/gfp-translate needs a matching source version to be accurate. Example output: a.out invoked oom-killer: gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|GFP_ZERO), order=0, om_score_adj=0 Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: ""Kirill A. Shutemov"" <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/a0795cd416d1142117695f932a9690611ae0edbb;mm, oom: print symbolic gfp_flags in oom warning;yes;no;no 155;MDY6Q29tbWl0MjMyNTI5ODphMDc5NWNkNDE2ZDExNDIxMTc2OTVmOTMyYTk2OTA2MTFhZTBlZGJi;Vlastimil Babka;Linus Torvalds;"mm, oom: print symbolic gfp_flags in oom warning It would be useful to translate gfp_flags into string representation when printing in case of an OOM, especially as the flags have been undergoing some changes recently and the script ./scripts/gfp-translate needs a matching source version to be accurate. Example output: a.out invoked oom-killer: gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|GFP_ZERO), order=0, om_score_adj=0 Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: ""Kirill A. Shutemov"" <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/a0795cd416d1142117695f932a9690611ae0edbb;"It would be useful to translate gfp_flags into string representation when printing in case of an OOM, especially as the flags have been undergoing some changes recently and the script ./scripts/gfp-translate needs a matching source version to be accurate";no;yes;yes 155;MDY6Q29tbWl0MjMyNTI5ODphMDc5NWNkNDE2ZDExNDIxMTc2OTVmOTMyYTk2OTA2MTFhZTBlZGJi;Vlastimil Babka;Linus Torvalds;"mm, oom: print symbolic gfp_flags in oom warning It would be useful to translate gfp_flags into string representation when printing in case of an OOM, especially as the flags have been undergoing some changes recently and the script ./scripts/gfp-translate needs a matching source version to be accurate. Example output: a.out invoked oom-killer: gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|GFP_ZERO), order=0, om_score_adj=0 Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: ""Kirill A. Shutemov"" <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/a0795cd416d1142117695f932a9690611ae0edbb;Example output;no;yes;yes 156;MDY6Q29tbWl0MjMyNTI5ODplY2E1NmZmOTA2YmRkMDIzOTQ4NWU4YjQ3MTU0YTZlNzNkZDlhMmYz;Jerome Marchand;Linus Torvalds;"mm, shmem: add internal shmem resident memory accounting Currently looking at /proc/<pid>/status or statm, there is no way to distinguish shmem pages from pages mapped to a regular file (shmem pages are mapped to /dev/zero), even though their implication in actual memory use is quite different. The internal accounting currently counts shmem pages together with regular files. As a preparation to extend the userspace interfaces, this patch adds MM_SHMEMPAGES counter to mm_rss_stat to account for shmem pages separately from MM_FILEPAGES. The next patch will expose it to userspace - this patch doesn't change the exported values yet, by adding up MM_SHMEMPAGES to MM_FILEPAGES at places where MM_FILEPAGES was used before. The only user-visible change after this patch is the OOM killer message that separates the reported ""shmem-rss"" from ""file-rss"". [vbabka@suse.cz: forward-porting, tweak changelog] Signed-off-by: Jerome Marchand <jmarchan@redhat.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/eca56ff906bdd0239485e8b47154a6e73dd9a2f3;mm, shmem: add internal shmem resident memory accounting;yes;no;no 156;MDY6Q29tbWl0MjMyNTI5ODplY2E1NmZmOTA2YmRkMDIzOTQ4NWU4YjQ3MTU0YTZlNzNkZDlhMmYz;Jerome Marchand;Linus Torvalds;"mm, shmem: add internal shmem resident memory accounting Currently looking at /proc/<pid>/status or statm, there is no way to distinguish shmem pages from pages mapped to a regular file (shmem pages are mapped to /dev/zero), even though their implication in actual memory use is quite different. The internal accounting currently counts shmem pages together with regular files. As a preparation to extend the userspace interfaces, this patch adds MM_SHMEMPAGES counter to mm_rss_stat to account for shmem pages separately from MM_FILEPAGES. The next patch will expose it to userspace - this patch doesn't change the exported values yet, by adding up MM_SHMEMPAGES to MM_FILEPAGES at places where MM_FILEPAGES was used before. The only user-visible change after this patch is the OOM killer message that separates the reported ""shmem-rss"" from ""file-rss"". [vbabka@suse.cz: forward-porting, tweak changelog] Signed-off-by: Jerome Marchand <jmarchan@redhat.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/eca56ff906bdd0239485e8b47154a6e73dd9a2f3;"Currently looking at /proc/<pid>/status or statm, there is no way to distinguish shmem pages from pages mapped to a regular file (shmem pages are mapped to /dev/zero), even though their implication in actual memory use is quite different";no;yes;yes 156;MDY6Q29tbWl0MjMyNTI5ODplY2E1NmZmOTA2YmRkMDIzOTQ4NWU4YjQ3MTU0YTZlNzNkZDlhMmYz;Jerome Marchand;Linus Torvalds;"mm, shmem: add internal shmem resident memory accounting Currently looking at /proc/<pid>/status or statm, there is no way to distinguish shmem pages from pages mapped to a regular file (shmem pages are mapped to /dev/zero), even though their implication in actual memory use is quite different. The internal accounting currently counts shmem pages together with regular files. As a preparation to extend the userspace interfaces, this patch adds MM_SHMEMPAGES counter to mm_rss_stat to account for shmem pages separately from MM_FILEPAGES. The next patch will expose it to userspace - this patch doesn't change the exported values yet, by adding up MM_SHMEMPAGES to MM_FILEPAGES at places where MM_FILEPAGES was used before. The only user-visible change after this patch is the OOM killer message that separates the reported ""shmem-rss"" from ""file-rss"". [vbabka@suse.cz: forward-porting, tweak changelog] Signed-off-by: Jerome Marchand <jmarchan@redhat.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/eca56ff906bdd0239485e8b47154a6e73dd9a2f3;"The internal accounting currently counts shmem pages together with regular files";no;no;yes 156;MDY6Q29tbWl0MjMyNTI5ODplY2E1NmZmOTA2YmRkMDIzOTQ4NWU4YjQ3MTU0YTZlNzNkZDlhMmYz;Jerome Marchand;Linus Torvalds;"mm, shmem: add internal shmem resident memory accounting Currently looking at /proc/<pid>/status or statm, there is no way to distinguish shmem pages from pages mapped to a regular file (shmem pages are mapped to /dev/zero), even though their implication in actual memory use is quite different. The internal accounting currently counts shmem pages together with regular files. As a preparation to extend the userspace interfaces, this patch adds MM_SHMEMPAGES counter to mm_rss_stat to account for shmem pages separately from MM_FILEPAGES. The next patch will expose it to userspace - this patch doesn't change the exported values yet, by adding up MM_SHMEMPAGES to MM_FILEPAGES at places where MM_FILEPAGES was used before. The only user-visible change after this patch is the OOM killer message that separates the reported ""shmem-rss"" from ""file-rss"". [vbabka@suse.cz: forward-porting, tweak changelog] Signed-off-by: Jerome Marchand <jmarchan@redhat.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/eca56ff906bdd0239485e8b47154a6e73dd9a2f3;" As a preparation to extend the userspace interfaces, this patch adds MM_SHMEMPAGES counter to mm_rss_stat to account for shmem pages separately from MM_FILEPAGES";yes;yes;no 156;MDY6Q29tbWl0MjMyNTI5ODplY2E1NmZmOTA2YmRkMDIzOTQ4NWU4YjQ3MTU0YTZlNzNkZDlhMmYz;Jerome Marchand;Linus Torvalds;"mm, shmem: add internal shmem resident memory accounting Currently looking at /proc/<pid>/status or statm, there is no way to distinguish shmem pages from pages mapped to a regular file (shmem pages are mapped to /dev/zero), even though their implication in actual memory use is quite different. The internal accounting currently counts shmem pages together with regular files. As a preparation to extend the userspace interfaces, this patch adds MM_SHMEMPAGES counter to mm_rss_stat to account for shmem pages separately from MM_FILEPAGES. The next patch will expose it to userspace - this patch doesn't change the exported values yet, by adding up MM_SHMEMPAGES to MM_FILEPAGES at places where MM_FILEPAGES was used before. The only user-visible change after this patch is the OOM killer message that separates the reported ""shmem-rss"" from ""file-rss"". [vbabka@suse.cz: forward-porting, tweak changelog] Signed-off-by: Jerome Marchand <jmarchan@redhat.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/eca56ff906bdd0239485e8b47154a6e73dd9a2f3;" The next patch will expose it to userspace - this patch doesn't change the exported values yet, by adding up MM_SHMEMPAGES to MM_FILEPAGES at places where MM_FILEPAGES was used before";yes;yes;no 156;MDY6Q29tbWl0MjMyNTI5ODplY2E1NmZmOTA2YmRkMDIzOTQ4NWU4YjQ3MTU0YTZlNzNkZDlhMmYz;Jerome Marchand;Linus Torvalds;"mm, shmem: add internal shmem resident memory accounting Currently looking at /proc/<pid>/status or statm, there is no way to distinguish shmem pages from pages mapped to a regular file (shmem pages are mapped to /dev/zero), even though their implication in actual memory use is quite different. The internal accounting currently counts shmem pages together with regular files. As a preparation to extend the userspace interfaces, this patch adds MM_SHMEMPAGES counter to mm_rss_stat to account for shmem pages separately from MM_FILEPAGES. The next patch will expose it to userspace - this patch doesn't change the exported values yet, by adding up MM_SHMEMPAGES to MM_FILEPAGES at places where MM_FILEPAGES was used before. The only user-visible change after this patch is the OOM killer message that separates the reported ""shmem-rss"" from ""file-rss"". [vbabka@suse.cz: forward-porting, tweak changelog] Signed-off-by: Jerome Marchand <jmarchan@redhat.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/eca56ff906bdd0239485e8b47154a6e73dd9a2f3;" The only user-visible change after this patch is the OOM killer message that separates the reported ""shmem-rss"" from ""file-rss"".";yes;no;no 157;MDY6Q29tbWl0MjMyNTI5ODphMmI4MjlkOTU5NThkYTIwMjVlZjg0NGMwZjUzYWMxNWFkNzIwZmFj;Chen Jie;Linus Torvalds;"mm/oom_kill.c: avoid attempting to kill init sharing same memory It's possible that an oom killed victim shares an ->mm with the init process and thus oom_kill_process() would end up trying to kill init as well. This has been shown in practice: Out of memory: Kill process 9134 (init) score 3 or sacrifice child Killed process 9134 (init) total-vm:1868kB, anon-rss:84kB, file-rss:572kB Kill process 1 (init) sharing same memory ... Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009 And this will result in a kernel panic. If a process is forked by init and selected for oom kill while still sharing init_mm, then it's likely this system is in a recoverable state. However, it's better not to try to kill init and allow the machine to panic due to unkillable processes. [rientjes@google.com: rewrote changelog] [akpm@linux-foundation.org: fix inverted test, per Ben] Signed-off-by: Chen Jie <chenjie6@huawei.com> Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Ben Hutchings <ben@decadent.org.uk> Cc: Li Zefan <lizefan@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/a2b829d95958da2025ef844c0f53ac15ad720fac;mm/oom_kill.c: avoid attempting to kill init sharing same memory;yes;yes;no 157;MDY6Q29tbWl0MjMyNTI5ODphMmI4MjlkOTU5NThkYTIwMjVlZjg0NGMwZjUzYWMxNWFkNzIwZmFj;Chen Jie;Linus Torvalds;"mm/oom_kill.c: avoid attempting to kill init sharing same memory It's possible that an oom killed victim shares an ->mm with the init process and thus oom_kill_process() would end up trying to kill init as well. This has been shown in practice: Out of memory: Kill process 9134 (init) score 3 or sacrifice child Killed process 9134 (init) total-vm:1868kB, anon-rss:84kB, file-rss:572kB Kill process 1 (init) sharing same memory ... Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009 And this will result in a kernel panic. If a process is forked by init and selected for oom kill while still sharing init_mm, then it's likely this system is in a recoverable state. However, it's better not to try to kill init and allow the machine to panic due to unkillable processes. [rientjes@google.com: rewrote changelog] [akpm@linux-foundation.org: fix inverted test, per Ben] Signed-off-by: Chen Jie <chenjie6@huawei.com> Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Ben Hutchings <ben@decadent.org.uk> Cc: Li Zefan <lizefan@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/a2b829d95958da2025ef844c0f53ac15ad720fac;"It's possible that an oom killed victim shares an ->mm with the init process and thus oom_kill_process() would end up trying to kill init as well";no;yes;yes 157;MDY6Q29tbWl0MjMyNTI5ODphMmI4MjlkOTU5NThkYTIwMjVlZjg0NGMwZjUzYWMxNWFkNzIwZmFj;Chen Jie;Linus Torvalds;"mm/oom_kill.c: avoid attempting to kill init sharing same memory It's possible that an oom killed victim shares an ->mm with the init process and thus oom_kill_process() would end up trying to kill init as well. This has been shown in practice: Out of memory: Kill process 9134 (init) score 3 or sacrifice child Killed process 9134 (init) total-vm:1868kB, anon-rss:84kB, file-rss:572kB Kill process 1 (init) sharing same memory ... Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009 And this will result in a kernel panic. If a process is forked by init and selected for oom kill while still sharing init_mm, then it's likely this system is in a recoverable state. However, it's better not to try to kill init and allow the machine to panic due to unkillable processes. [rientjes@google.com: rewrote changelog] [akpm@linux-foundation.org: fix inverted test, per Ben] Signed-off-by: Chen Jie <chenjie6@huawei.com> Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Ben Hutchings <ben@decadent.org.uk> Cc: Li Zefan <lizefan@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/a2b829d95958da2025ef844c0f53ac15ad720fac;This has been shown in practice;no;yes;yes 157;MDY6Q29tbWl0MjMyNTI5ODphMmI4MjlkOTU5NThkYTIwMjVlZjg0NGMwZjUzYWMxNWFkNzIwZmFj;Chen Jie;Linus Torvalds;"mm/oom_kill.c: avoid attempting to kill init sharing same memory It's possible that an oom killed victim shares an ->mm with the init process and thus oom_kill_process() would end up trying to kill init as well. This has been shown in practice: Out of memory: Kill process 9134 (init) score 3 or sacrifice child Killed process 9134 (init) total-vm:1868kB, anon-rss:84kB, file-rss:572kB Kill process 1 (init) sharing same memory ... Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009 And this will result in a kernel panic. If a process is forked by init and selected for oom kill while still sharing init_mm, then it's likely this system is in a recoverable state. However, it's better not to try to kill init and allow the machine to panic due to unkillable processes. [rientjes@google.com: rewrote changelog] [akpm@linux-foundation.org: fix inverted test, per Ben] Signed-off-by: Chen Jie <chenjie6@huawei.com> Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Ben Hutchings <ben@decadent.org.uk> Cc: Li Zefan <lizefan@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/a2b829d95958da2025ef844c0f53ac15ad720fac;" Out of memory: Kill process 9134 (init) score 3 or sacrifice child Killed process 9134 (init) total-vm:1868kB, anon-rss:84kB, file-rss:572kB Kill process 1 (init) sharing same memory Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009 And this will result in a kernel panic";no;yes;yes 157;MDY6Q29tbWl0MjMyNTI5ODphMmI4MjlkOTU5NThkYTIwMjVlZjg0NGMwZjUzYWMxNWFkNzIwZmFj;Chen Jie;Linus Torvalds;"mm/oom_kill.c: avoid attempting to kill init sharing same memory It's possible that an oom killed victim shares an ->mm with the init process and thus oom_kill_process() would end up trying to kill init as well. This has been shown in practice: Out of memory: Kill process 9134 (init) score 3 or sacrifice child Killed process 9134 (init) total-vm:1868kB, anon-rss:84kB, file-rss:572kB Kill process 1 (init) sharing same memory ... Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009 And this will result in a kernel panic. If a process is forked by init and selected for oom kill while still sharing init_mm, then it's likely this system is in a recoverable state. However, it's better not to try to kill init and allow the machine to panic due to unkillable processes. [rientjes@google.com: rewrote changelog] [akpm@linux-foundation.org: fix inverted test, per Ben] Signed-off-by: Chen Jie <chenjie6@huawei.com> Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Ben Hutchings <ben@decadent.org.uk> Cc: Li Zefan <lizefan@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/a2b829d95958da2025ef844c0f53ac15ad720fac;"If a process is forked by init and selected for oom kill while still sharing init_mm, then it's likely this system is in a recoverable state";no;no;yes 157;MDY6Q29tbWl0MjMyNTI5ODphMmI4MjlkOTU5NThkYTIwMjVlZjg0NGMwZjUzYWMxNWFkNzIwZmFj;Chen Jie;Linus Torvalds;"mm/oom_kill.c: avoid attempting to kill init sharing same memory It's possible that an oom killed victim shares an ->mm with the init process and thus oom_kill_process() would end up trying to kill init as well. This has been shown in practice: Out of memory: Kill process 9134 (init) score 3 or sacrifice child Killed process 9134 (init) total-vm:1868kB, anon-rss:84kB, file-rss:572kB Kill process 1 (init) sharing same memory ... Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009 And this will result in a kernel panic. If a process is forked by init and selected for oom kill while still sharing init_mm, then it's likely this system is in a recoverable state. However, it's better not to try to kill init and allow the machine to panic due to unkillable processes. [rientjes@google.com: rewrote changelog] [akpm@linux-foundation.org: fix inverted test, per Ben] Signed-off-by: Chen Jie <chenjie6@huawei.com> Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Ben Hutchings <ben@decadent.org.uk> Cc: Li Zefan <lizefan@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/a2b829d95958da2025ef844c0f53ac15ad720fac;"However, it's better not to try to kill init and allow the machine to panic due to unkillable processes.";yes;yes;yes 158;MDY6Q29tbWl0MjMyNTI5ODpkYjJhMGRkN2E0M2RlNTk1ZDNmMDU0Mjk4NmJiMTdjY2I2Y2MzNjRj;Yaowei Bai;Linus Torvalds;"mm/oom_kill.c: introduce is_sysrq_oom helper Introduce is_sysrq_oom helper function indicating oom kill triggered by sysrq to improve readability. No functional changes. Signed-off-by: Yaowei Bai <bywxiaobai@163.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/db2a0dd7a43de595d3f0542986bb17ccb6cc364c;mm/oom_kill.c: introduce is_sysrq_oom helper;yes;no;no 158;MDY6Q29tbWl0MjMyNTI5ODpkYjJhMGRkN2E0M2RlNTk1ZDNmMDU0Mjk4NmJiMTdjY2I2Y2MzNjRj;Yaowei Bai;Linus Torvalds;"mm/oom_kill.c: introduce is_sysrq_oom helper Introduce is_sysrq_oom helper function indicating oom kill triggered by sysrq to improve readability. No functional changes. Signed-off-by: Yaowei Bai <bywxiaobai@163.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/db2a0dd7a43de595d3f0542986bb17ccb6cc364c;"Introduce is_sysrq_oom helper function indicating oom kill triggered by sysrq to improve readability";yes;yes;no 158;MDY6Q29tbWl0MjMyNTI5ODpkYjJhMGRkN2E0M2RlNTk1ZDNmMDU0Mjk4NmJiMTdjY2I2Y2MzNjRj;Yaowei Bai;Linus Torvalds;"mm/oom_kill.c: introduce is_sysrq_oom helper Introduce is_sysrq_oom helper function indicating oom kill triggered by sysrq to improve readability. No functional changes. Signed-off-by: Yaowei Bai <bywxiaobai@163.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/db2a0dd7a43de595d3f0542986bb17ccb6cc364c;No functional changes.;yes;no;no 159;MDY6Q29tbWl0MjMyNTI5ODo0ZDdiMzM5NGY3NmVkNzJjZmRlYzIzY2E1NTcxZGJhYjZlYzQxNzkz;Oleg Nesterov;Linus Torvalds;"mm/oom_kill: fix the wrong task->mm == mm checks in oom_kill_process() Both ""child->mm == mm"" and ""p->mm != mm"" checks in oom_kill_process() are wrong. task->mm can be NULL if the task is the exited group leader. This means in particular that ""kill sharing same memory"" loop can miss a process with a zombie leader which uses the same ->mm. Note: the process_has_mm(child, p->mm) check is still not 100% correct, p->mm can be NULL too. This is minor, but probably deserves a fix or a comment anyway. [akpm@linux-foundation.org: document process_shares_mm() a bit] Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Kyle Walker <kwalker@redhat.com> Cc: Stanislav Kozina <skozina@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/4d7b3394f76ed72cfdec23ca5571dbab6ec41793;mm/oom_kill: fix the wrong task->mm == mm checks in oom_kill_process();yes;yes;no 159;MDY6Q29tbWl0MjMyNTI5ODo0ZDdiMzM5NGY3NmVkNzJjZmRlYzIzY2E1NTcxZGJhYjZlYzQxNzkz;Oleg Nesterov;Linus Torvalds;"mm/oom_kill: fix the wrong task->mm == mm checks in oom_kill_process() Both ""child->mm == mm"" and ""p->mm != mm"" checks in oom_kill_process() are wrong. task->mm can be NULL if the task is the exited group leader. This means in particular that ""kill sharing same memory"" loop can miss a process with a zombie leader which uses the same ->mm. Note: the process_has_mm(child, p->mm) check is still not 100% correct, p->mm can be NULL too. This is minor, but probably deserves a fix or a comment anyway. [akpm@linux-foundation.org: document process_shares_mm() a bit] Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Kyle Walker <kwalker@redhat.com> Cc: Stanislav Kozina <skozina@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/4d7b3394f76ed72cfdec23ca5571dbab6ec41793;"Both ""child->mm == mm"" and ""p->mm != mm"" checks in oom_kill_process() are wrong";no;yes;yes 159;MDY6Q29tbWl0MjMyNTI5ODo0ZDdiMzM5NGY3NmVkNzJjZmRlYzIzY2E1NTcxZGJhYjZlYzQxNzkz;Oleg Nesterov;Linus Torvalds;"mm/oom_kill: fix the wrong task->mm == mm checks in oom_kill_process() Both ""child->mm == mm"" and ""p->mm != mm"" checks in oom_kill_process() are wrong. task->mm can be NULL if the task is the exited group leader. This means in particular that ""kill sharing same memory"" loop can miss a process with a zombie leader which uses the same ->mm. Note: the process_has_mm(child, p->mm) check is still not 100% correct, p->mm can be NULL too. This is minor, but probably deserves a fix or a comment anyway. [akpm@linux-foundation.org: document process_shares_mm() a bit] Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Kyle Walker <kwalker@redhat.com> Cc: Stanislav Kozina <skozina@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/4d7b3394f76ed72cfdec23ca5571dbab6ec41793; task->mm can be NULL if the task is the exited group leader;no;no;yes 159;MDY6Q29tbWl0MjMyNTI5ODo0ZDdiMzM5NGY3NmVkNzJjZmRlYzIzY2E1NTcxZGJhYjZlYzQxNzkz;Oleg Nesterov;Linus Torvalds;"mm/oom_kill: fix the wrong task->mm == mm checks in oom_kill_process() Both ""child->mm == mm"" and ""p->mm != mm"" checks in oom_kill_process() are wrong. task->mm can be NULL if the task is the exited group leader. This means in particular that ""kill sharing same memory"" loop can miss a process with a zombie leader which uses the same ->mm. Note: the process_has_mm(child, p->mm) check is still not 100% correct, p->mm can be NULL too. This is minor, but probably deserves a fix or a comment anyway. [akpm@linux-foundation.org: document process_shares_mm() a bit] Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Kyle Walker <kwalker@redhat.com> Cc: Stanislav Kozina <skozina@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/4d7b3394f76ed72cfdec23ca5571dbab6ec41793;" This means in particular that ""kill sharing same memory"" loop can miss a process with a zombie leader which uses the same ->mm";no;yes;yes 159;MDY6Q29tbWl0MjMyNTI5ODo0ZDdiMzM5NGY3NmVkNzJjZmRlYzIzY2E1NTcxZGJhYjZlYzQxNzkz;Oleg Nesterov;Linus Torvalds;"mm/oom_kill: fix the wrong task->mm == mm checks in oom_kill_process() Both ""child->mm == mm"" and ""p->mm != mm"" checks in oom_kill_process() are wrong. task->mm can be NULL if the task is the exited group leader. This means in particular that ""kill sharing same memory"" loop can miss a process with a zombie leader which uses the same ->mm. Note: the process_has_mm(child, p->mm) check is still not 100% correct, p->mm can be NULL too. This is minor, but probably deserves a fix or a comment anyway. [akpm@linux-foundation.org: document process_shares_mm() a bit] Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Kyle Walker <kwalker@redhat.com> Cc: Stanislav Kozina <skozina@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/4d7b3394f76ed72cfdec23ca5571dbab6ec41793;"Note: the process_has_mm(child, p->mm) check is still not 100% correct, p->mm can be NULL too";no;no;yes 159;MDY6Q29tbWl0MjMyNTI5ODo0ZDdiMzM5NGY3NmVkNzJjZmRlYzIzY2E1NTcxZGJhYjZlYzQxNzkz;Oleg Nesterov;Linus Torvalds;"mm/oom_kill: fix the wrong task->mm == mm checks in oom_kill_process() Both ""child->mm == mm"" and ""p->mm != mm"" checks in oom_kill_process() are wrong. task->mm can be NULL if the task is the exited group leader. This means in particular that ""kill sharing same memory"" loop can miss a process with a zombie leader which uses the same ->mm. Note: the process_has_mm(child, p->mm) check is still not 100% correct, p->mm can be NULL too. This is minor, but probably deserves a fix or a comment anyway. [akpm@linux-foundation.org: document process_shares_mm() a bit] Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Kyle Walker <kwalker@redhat.com> Cc: Stanislav Kozina <skozina@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/4d7b3394f76ed72cfdec23ca5571dbab6ec41793;" This is minor, but probably deserves a fix or a comment anyway.";yes;no;yes 161;MDY6Q29tbWl0MjMyNTI5ODowYzFiMmQ3ODNjZjM0MzI0OTBiZjFlNTMyYzc0MmZmZmVhZGMwYmYz;Oleg Nesterov;Linus Torvalds;"mm/oom_kill: remove the wrong fatal_signal_pending() check in oom_kill_process() The fatal_signal_pending() was added to suppress unnecessary ""sharing same memory"" message, but it can't 100% help anyway because it can be false-negative; SIGKILL can be already dequeued. And worse, it can be false-positive due to exec or coredump. exec is mostly fine, but coredump is not. It is possible that the group leader has the pending SIGKILL because its sub-thread originated the coredump, in this case we must not skip this process. We could probably add the additional ->group_exit_task check but this patch just removes the wrong check along with pr_info(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Kyle Walker <kwalker@redhat.com> Cc: Stanislav Kozina <skozina@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/0c1b2d783cf3432490bf1e532c742fffeadc0bf3;mm/oom_kill: remove the wrong fatal_signal_pending() check in oom_kill_process();yes;yes;no 161;MDY6Q29tbWl0MjMyNTI5ODowYzFiMmQ3ODNjZjM0MzI0OTBiZjFlNTMyYzc0MmZmZmVhZGMwYmYz;Oleg Nesterov;Linus Torvalds;"mm/oom_kill: remove the wrong fatal_signal_pending() check in oom_kill_process() The fatal_signal_pending() was added to suppress unnecessary ""sharing same memory"" message, but it can't 100% help anyway because it can be false-negative; SIGKILL can be already dequeued. And worse, it can be false-positive due to exec or coredump. exec is mostly fine, but coredump is not. It is possible that the group leader has the pending SIGKILL because its sub-thread originated the coredump, in this case we must not skip this process. We could probably add the additional ->group_exit_task check but this patch just removes the wrong check along with pr_info(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Kyle Walker <kwalker@redhat.com> Cc: Stanislav Kozina <skozina@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/0c1b2d783cf3432490bf1e532c742fffeadc0bf3;"The fatal_signal_pending() was added to suppress unnecessary ""sharing same memory"" message, but it can't 100% help anyway because it can be false-negative; SIGKILL can be already dequeued";no;yes;yes 161;MDY6Q29tbWl0MjMyNTI5ODowYzFiMmQ3ODNjZjM0MzI0OTBiZjFlNTMyYzc0MmZmZmVhZGMwYmYz;Oleg Nesterov;Linus Torvalds;"mm/oom_kill: remove the wrong fatal_signal_pending() check in oom_kill_process() The fatal_signal_pending() was added to suppress unnecessary ""sharing same memory"" message, but it can't 100% help anyway because it can be false-negative; SIGKILL can be already dequeued. And worse, it can be false-positive due to exec or coredump. exec is mostly fine, but coredump is not. It is possible that the group leader has the pending SIGKILL because its sub-thread originated the coredump, in this case we must not skip this process. We could probably add the additional ->group_exit_task check but this patch just removes the wrong check along with pr_info(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Kyle Walker <kwalker@redhat.com> Cc: Stanislav Kozina <skozina@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/0c1b2d783cf3432490bf1e532c742fffeadc0bf3;And worse, it can be false-positive due to exec or coredump;no;yes;yes 161;MDY6Q29tbWl0MjMyNTI5ODowYzFiMmQ3ODNjZjM0MzI0OTBiZjFlNTMyYzc0MmZmZmVhZGMwYmYz;Oleg Nesterov;Linus Torvalds;"mm/oom_kill: remove the wrong fatal_signal_pending() check in oom_kill_process() The fatal_signal_pending() was added to suppress unnecessary ""sharing same memory"" message, but it can't 100% help anyway because it can be false-negative; SIGKILL can be already dequeued. And worse, it can be false-positive due to exec or coredump. exec is mostly fine, but coredump is not. It is possible that the group leader has the pending SIGKILL because its sub-thread originated the coredump, in this case we must not skip this process. We could probably add the additional ->group_exit_task check but this patch just removes the wrong check along with pr_info(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Kyle Walker <kwalker@redhat.com> Cc: Stanislav Kozina <skozina@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/0c1b2d783cf3432490bf1e532c742fffeadc0bf3;" exec is mostly fine, but coredump is not";no;yes;yes 161;MDY6Q29tbWl0MjMyNTI5ODowYzFiMmQ3ODNjZjM0MzI0OTBiZjFlNTMyYzc0MmZmZmVhZGMwYmYz;Oleg Nesterov;Linus Torvalds;"mm/oom_kill: remove the wrong fatal_signal_pending() check in oom_kill_process() The fatal_signal_pending() was added to suppress unnecessary ""sharing same memory"" message, but it can't 100% help anyway because it can be false-negative; SIGKILL can be already dequeued. And worse, it can be false-positive due to exec or coredump. exec is mostly fine, but coredump is not. It is possible that the group leader has the pending SIGKILL because its sub-thread originated the coredump, in this case we must not skip this process. We could probably add the additional ->group_exit_task check but this patch just removes the wrong check along with pr_info(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Kyle Walker <kwalker@redhat.com> Cc: Stanislav Kozina <skozina@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/0c1b2d783cf3432490bf1e532c742fffeadc0bf3;" It is possible that the group leader has the pending SIGKILL because its sub-thread originated the coredump, in this case we must not skip this process";yes;yes;yes 161;MDY6Q29tbWl0MjMyNTI5ODowYzFiMmQ3ODNjZjM0MzI0OTBiZjFlNTMyYzc0MmZmZmVhZGMwYmYz;Oleg Nesterov;Linus Torvalds;"mm/oom_kill: remove the wrong fatal_signal_pending() check in oom_kill_process() The fatal_signal_pending() was added to suppress unnecessary ""sharing same memory"" message, but it can't 100% help anyway because it can be false-negative; SIGKILL can be already dequeued. And worse, it can be false-positive due to exec or coredump. exec is mostly fine, but coredump is not. It is possible that the group leader has the pending SIGKILL because its sub-thread originated the coredump, in this case we must not skip this process. We could probably add the additional ->group_exit_task check but this patch just removes the wrong check along with pr_info(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Kyle Walker <kwalker@redhat.com> Cc: Stanislav Kozina <skozina@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/0c1b2d783cf3432490bf1e532c742fffeadc0bf3;"We could probably add the additional ->group_exit_task check but this patch just removes the wrong check along with pr_info().";yes;yes;no 162;MDY6Q29tbWl0MjMyNTI5ODpkYTM5ZGEzYTU0ZmVkODhlMjkwMjRmMmYxZjZjZDczNTdjZDAzYTQ0;David Rientjes;Linus Torvalds;"mm, oom: remove task_lock protecting comm printing The oom killer takes task_lock() in a couple of places solely to protect printing the task's comm. A process's comm, including current's comm, may change due to /proc/pid/comm or PR_SET_NAME. The comm will always be NULL-terminated, so the worst race scenario would only be during update. We can tolerate a comm being printed that is in the middle of an update to avoid taking the lock. Other locations in the kernel have already dropped task_lock() when printing comm, so this is consistent. Signed-off-by: David Rientjes <rientjes@google.com> Suggested-by: Oleg Nesterov <oleg@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov@parallels.com> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/da39da3a54fed88e29024f2f1f6cd7357cd03a44;mm, oom: remove task_lock protecting comm printing;yes;no;no 162;MDY6Q29tbWl0MjMyNTI5ODpkYTM5ZGEzYTU0ZmVkODhlMjkwMjRmMmYxZjZjZDczNTdjZDAzYTQ0;David Rientjes;Linus Torvalds;"mm, oom: remove task_lock protecting comm printing The oom killer takes task_lock() in a couple of places solely to protect printing the task's comm. A process's comm, including current's comm, may change due to /proc/pid/comm or PR_SET_NAME. The comm will always be NULL-terminated, so the worst race scenario would only be during update. We can tolerate a comm being printed that is in the middle of an update to avoid taking the lock. Other locations in the kernel have already dropped task_lock() when printing comm, so this is consistent. Signed-off-by: David Rientjes <rientjes@google.com> Suggested-by: Oleg Nesterov <oleg@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov@parallels.com> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/da39da3a54fed88e29024f2f1f6cd7357cd03a44;"The oom killer takes task_lock() in a couple of places solely to protect printing the task's comm";no;no;yes 162;MDY6Q29tbWl0MjMyNTI5ODpkYTM5ZGEzYTU0ZmVkODhlMjkwMjRmMmYxZjZjZDczNTdjZDAzYTQ0;David Rientjes;Linus Torvalds;"mm, oom: remove task_lock protecting comm printing The oom killer takes task_lock() in a couple of places solely to protect printing the task's comm. A process's comm, including current's comm, may change due to /proc/pid/comm or PR_SET_NAME. The comm will always be NULL-terminated, so the worst race scenario would only be during update. We can tolerate a comm being printed that is in the middle of an update to avoid taking the lock. Other locations in the kernel have already dropped task_lock() when printing comm, so this is consistent. Signed-off-by: David Rientjes <rientjes@google.com> Suggested-by: Oleg Nesterov <oleg@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov@parallels.com> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/da39da3a54fed88e29024f2f1f6cd7357cd03a44;"A process's comm, including current's comm, may change due to /proc/pid/comm or PR_SET_NAME";no;no;yes 162;MDY6Q29tbWl0MjMyNTI5ODpkYTM5ZGEzYTU0ZmVkODhlMjkwMjRmMmYxZjZjZDczNTdjZDAzYTQ0;David Rientjes;Linus Torvalds;"mm, oom: remove task_lock protecting comm printing The oom killer takes task_lock() in a couple of places solely to protect printing the task's comm. A process's comm, including current's comm, may change due to /proc/pid/comm or PR_SET_NAME. The comm will always be NULL-terminated, so the worst race scenario would only be during update. We can tolerate a comm being printed that is in the middle of an update to avoid taking the lock. Other locations in the kernel have already dropped task_lock() when printing comm, so this is consistent. Signed-off-by: David Rientjes <rientjes@google.com> Suggested-by: Oleg Nesterov <oleg@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov@parallels.com> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/da39da3a54fed88e29024f2f1f6cd7357cd03a44;"The comm will always be NULL-terminated, so the worst race scenario would only be during update";no;no;yes 162;MDY6Q29tbWl0MjMyNTI5ODpkYTM5ZGEzYTU0ZmVkODhlMjkwMjRmMmYxZjZjZDczNTdjZDAzYTQ0;David Rientjes;Linus Torvalds;"mm, oom: remove task_lock protecting comm printing The oom killer takes task_lock() in a couple of places solely to protect printing the task's comm. A process's comm, including current's comm, may change due to /proc/pid/comm or PR_SET_NAME. The comm will always be NULL-terminated, so the worst race scenario would only be during update. We can tolerate a comm being printed that is in the middle of an update to avoid taking the lock. Other locations in the kernel have already dropped task_lock() when printing comm, so this is consistent. Signed-off-by: David Rientjes <rientjes@google.com> Suggested-by: Oleg Nesterov <oleg@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov@parallels.com> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/da39da3a54fed88e29024f2f1f6cd7357cd03a44;" We can tolerate a comm being printed that is in the middle of an update to avoid taking the lock";yes;yes;yes 162;MDY6Q29tbWl0MjMyNTI5ODpkYTM5ZGEzYTU0ZmVkODhlMjkwMjRmMmYxZjZjZDczNTdjZDAzYTQ0;David Rientjes;Linus Torvalds;"mm, oom: remove task_lock protecting comm printing The oom killer takes task_lock() in a couple of places solely to protect printing the task's comm. A process's comm, including current's comm, may change due to /proc/pid/comm or PR_SET_NAME. The comm will always be NULL-terminated, so the worst race scenario would only be during update. We can tolerate a comm being printed that is in the middle of an update to avoid taking the lock. Other locations in the kernel have already dropped task_lock() when printing comm, so this is consistent. Signed-off-by: David Rientjes <rientjes@google.com> Suggested-by: Oleg Nesterov <oleg@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov@parallels.com> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/da39da3a54fed88e29024f2f1f6cd7357cd03a44;"Other locations in the kernel have already dropped task_lock() when printing comm, so this is consistent.";no;yes;yes 163;MDY6Q29tbWl0MjMyNTI5ODo4NDA4MDdhOGY0MGJiMjVhOGRmNWI2NDEyYmJhNmJjMTU2NjQzYmU1;Tetsuo Handa;Linus Torvalds;"mm/oom_kill.c: suppress unnecessary ""sharing same memory"" message oom_kill_process() sends SIGKILL to other thread groups sharing victim's mm. But printing ""Kill process %d (%s) sharing same memory\n"" lines makes no sense if they already have pending SIGKILL. This patch reduces the ""Kill process"" lines by printing that line with info level only if SIGKILL is not pending. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/840807a8f40bb25a8df5b6412bba6bc156643be5;"mm/oom_kill.c: suppress unnecessary ""sharing same memory"" message";yes;yes;no 163;MDY6Q29tbWl0MjMyNTI5ODo4NDA4MDdhOGY0MGJiMjVhOGRmNWI2NDEyYmJhNmJjMTU2NjQzYmU1;Tetsuo Handa;Linus Torvalds;"mm/oom_kill.c: suppress unnecessary ""sharing same memory"" message oom_kill_process() sends SIGKILL to other thread groups sharing victim's mm. But printing ""Kill process %d (%s) sharing same memory\n"" lines makes no sense if they already have pending SIGKILL. This patch reduces the ""Kill process"" lines by printing that line with info level only if SIGKILL is not pending. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/840807a8f40bb25a8df5b6412bba6bc156643be5;"oom_kill_process() sends SIGKILL to other thread groups sharing victim's mm";no;no;yes 163;MDY6Q29tbWl0MjMyNTI5ODo4NDA4MDdhOGY0MGJiMjVhOGRmNWI2NDEyYmJhNmJjMTU2NjQzYmU1;Tetsuo Handa;Linus Torvalds;"mm/oom_kill.c: suppress unnecessary ""sharing same memory"" message oom_kill_process() sends SIGKILL to other thread groups sharing victim's mm. But printing ""Kill process %d (%s) sharing same memory\n"" lines makes no sense if they already have pending SIGKILL. This patch reduces the ""Kill process"" lines by printing that line with info level only if SIGKILL is not pending. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/840807a8f40bb25a8df5b6412bba6bc156643be5;" But printing ""Kill process %d (%s) sharing same memory\n"" lines makes no sense if they already have pending SIGKILL";no;yes;yes 163;MDY6Q29tbWl0MjMyNTI5ODo4NDA4MDdhOGY0MGJiMjVhOGRmNWI2NDEyYmJhNmJjMTU2NjQzYmU1;Tetsuo Handa;Linus Torvalds;"mm/oom_kill.c: suppress unnecessary ""sharing same memory"" message oom_kill_process() sends SIGKILL to other thread groups sharing victim's mm. But printing ""Kill process %d (%s) sharing same memory\n"" lines makes no sense if they already have pending SIGKILL. This patch reduces the ""Kill process"" lines by printing that line with info level only if SIGKILL is not pending. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/840807a8f40bb25a8df5b6412bba6bc156643be5;" This patch reduces the ""Kill process"" lines by printing that line with info level only if SIGKILL is not pending.";yes;yes;no 164;MDY6Q29tbWl0MjMyNTI5ODo4ODBiNzY4OTM3ZTkwYzQzM2MwYzgyNTRhMjJiMWViNjNkZjAwNWE0;Tetsuo Handa;Linus Torvalds;"mm/oom_kill.c: fix potentially killing unrelated process At the for_each_process() loop in oom_kill_process(), we are comparing address of OOM victim's mm without holding a reference to that mm. If there are a lot of processes to compare or a lot of ""Kill process %d (%s) sharing same memory"" messages to print, for_each_process() loop could take very long time. It is possible that meanwhile the OOM victim exits and releases its mm, and then mm is allocated with the same address and assigned to some unrelated process. When we hit such race, the unrelated process will be killed by error. To make sure that the OOM victim's mm does not go away until for_each_process() loop finishes, get a reference on the OOM victim's mm before calling task_unlock(victim). [oleg@redhat.com: several fixes] Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/880b768937e90c433c0c8254a22b1eb63df005a4;mm/oom_kill.c: fix potentially killing unrelated process;yes;yes;no 164;MDY6Q29tbWl0MjMyNTI5ODo4ODBiNzY4OTM3ZTkwYzQzM2MwYzgyNTRhMjJiMWViNjNkZjAwNWE0;Tetsuo Handa;Linus Torvalds;"mm/oom_kill.c: fix potentially killing unrelated process At the for_each_process() loop in oom_kill_process(), we are comparing address of OOM victim's mm without holding a reference to that mm. If there are a lot of processes to compare or a lot of ""Kill process %d (%s) sharing same memory"" messages to print, for_each_process() loop could take very long time. It is possible that meanwhile the OOM victim exits and releases its mm, and then mm is allocated with the same address and assigned to some unrelated process. When we hit such race, the unrelated process will be killed by error. To make sure that the OOM victim's mm does not go away until for_each_process() loop finishes, get a reference on the OOM victim's mm before calling task_unlock(victim). [oleg@redhat.com: several fixes] Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/880b768937e90c433c0c8254a22b1eb63df005a4;"At the for_each_process() loop in oom_kill_process(), we are comparing address of OOM victim's mm without holding a reference to that mm";no;no;yes 164;MDY6Q29tbWl0MjMyNTI5ODo4ODBiNzY4OTM3ZTkwYzQzM2MwYzgyNTRhMjJiMWViNjNkZjAwNWE0;Tetsuo Handa;Linus Torvalds;"mm/oom_kill.c: fix potentially killing unrelated process At the for_each_process() loop in oom_kill_process(), we are comparing address of OOM victim's mm without holding a reference to that mm. If there are a lot of processes to compare or a lot of ""Kill process %d (%s) sharing same memory"" messages to print, for_each_process() loop could take very long time. It is possible that meanwhile the OOM victim exits and releases its mm, and then mm is allocated with the same address and assigned to some unrelated process. When we hit such race, the unrelated process will be killed by error. To make sure that the OOM victim's mm does not go away until for_each_process() loop finishes, get a reference on the OOM victim's mm before calling task_unlock(victim). [oleg@redhat.com: several fixes] Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/880b768937e90c433c0c8254a22b1eb63df005a4;" If there are a lot of processes to compare or a lot of ""Kill process %d (%s) sharing same memory"" messages to print, for_each_process() loop could take very long time";no;yes;yes 164;MDY6Q29tbWl0MjMyNTI5ODo4ODBiNzY4OTM3ZTkwYzQzM2MwYzgyNTRhMjJiMWViNjNkZjAwNWE0;Tetsuo Handa;Linus Torvalds;"mm/oom_kill.c: fix potentially killing unrelated process At the for_each_process() loop in oom_kill_process(), we are comparing address of OOM victim's mm without holding a reference to that mm. If there are a lot of processes to compare or a lot of ""Kill process %d (%s) sharing same memory"" messages to print, for_each_process() loop could take very long time. It is possible that meanwhile the OOM victim exits and releases its mm, and then mm is allocated with the same address and assigned to some unrelated process. When we hit such race, the unrelated process will be killed by error. To make sure that the OOM victim's mm does not go away until for_each_process() loop finishes, get a reference on the OOM victim's mm before calling task_unlock(victim). [oleg@redhat.com: several fixes] Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/880b768937e90c433c0c8254a22b1eb63df005a4;"It is possible that meanwhile the OOM victim exits and releases its mm, and then mm is allocated with the same address and assigned to some unrelated process";no;yes;yes 164;MDY6Q29tbWl0MjMyNTI5ODo4ODBiNzY4OTM3ZTkwYzQzM2MwYzgyNTRhMjJiMWViNjNkZjAwNWE0;Tetsuo Handa;Linus Torvalds;"mm/oom_kill.c: fix potentially killing unrelated process At the for_each_process() loop in oom_kill_process(), we are comparing address of OOM victim's mm without holding a reference to that mm. If there are a lot of processes to compare or a lot of ""Kill process %d (%s) sharing same memory"" messages to print, for_each_process() loop could take very long time. It is possible that meanwhile the OOM victim exits and releases its mm, and then mm is allocated with the same address and assigned to some unrelated process. When we hit such race, the unrelated process will be killed by error. To make sure that the OOM victim's mm does not go away until for_each_process() loop finishes, get a reference on the OOM victim's mm before calling task_unlock(victim). [oleg@redhat.com: several fixes] Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/880b768937e90c433c0c8254a22b1eb63df005a4;" When we hit such race, the unrelated process will be killed by error";no;yes;yes 164;MDY6Q29tbWl0MjMyNTI5ODo4ODBiNzY4OTM3ZTkwYzQzM2MwYzgyNTRhMjJiMWViNjNkZjAwNWE0;Tetsuo Handa;Linus Torvalds;"mm/oom_kill.c: fix potentially killing unrelated process At the for_each_process() loop in oom_kill_process(), we are comparing address of OOM victim's mm without holding a reference to that mm. If there are a lot of processes to compare or a lot of ""Kill process %d (%s) sharing same memory"" messages to print, for_each_process() loop could take very long time. It is possible that meanwhile the OOM victim exits and releases its mm, and then mm is allocated with the same address and assigned to some unrelated process. When we hit such race, the unrelated process will be killed by error. To make sure that the OOM victim's mm does not go away until for_each_process() loop finishes, get a reference on the OOM victim's mm before calling task_unlock(victim). [oleg@redhat.com: several fixes] Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/880b768937e90c433c0c8254a22b1eb63df005a4;" To make sure that the OOM victim's mm does not go away until for_each_process() loop finishes, get a reference on the OOM victim's mm before calling task_unlock(victim).";yes;yes;no 165;MDY6Q29tbWl0MjMyNTI5ODo0MjZmYjVlNzJkOTJiODY4OTEyZTQ3YTFlM2NhMmRmNmVhYmMzODcy;Tetsuo Handa;Linus Torvalds;"mm/oom_kill.c: reverse the order of setting TIF_MEMDIE and sending SIGKILL It was confirmed that a local unprivileged user can consume all memory reserves and hang up that system using time lag between the OOM killer sets TIF_MEMDIE on an OOM victim and sends SIGKILL to that victim, for printk() inside for_each_process() loop at oom_kill_process() can consume many seconds when there are many thread groups sharing the same memory. Before starting oom-depleter process: Node 0 DMA: 3*4kB (UM) 6*8kB (U) 4*16kB (UEM) 0*32kB 0*64kB 1*128kB (M) 2*256kB (EM) 2*512kB (UE) 2*1024kB (EM) 1*2048kB (E) 1*4096kB (M) = 9980kB Node 0 DMA32: 31*4kB (UEM) 27*8kB (UE) 32*16kB (UE) 13*32kB (UE) 14*64kB (UM) 7*128kB (UM) 8*256kB (UM) 8*512kB (UM) 3*1024kB (U) 4*2048kB (UM) 362*4096kB (UM) = 1503220kB As of invoking the OOM killer: Node 0 DMA: 11*4kB (UE) 8*8kB (UEM) 6*16kB (UE) 2*32kB (EM) 0*64kB 1*128kB (U) 3*256kB (UEM) 2*512kB (UE) 3*1024kB (UEM) 1*2048kB (U) 0*4096kB = 7308kB Node 0 DMA32: 1049*4kB (UEM) 507*8kB (UE) 151*16kB (UE) 53*32kB (UEM) 83*64kB (UEM) 52*128kB (EM) 25*256kB (UEM) 11*512kB (M) 6*1024kB (UM) 1*2048kB (M) 0*4096kB = 44556kB Between the thread group leader got TIF_MEMDIE and receives SIGKILL: Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB Node 0 DMA32: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB The oom-depleter's thread group leader which got TIF_MEMDIE started memset() in user space after the OOM killer set TIF_MEMDIE, and it was free to abuse ALLOC_NO_WATERMARKS by TIF_MEMDIE for memset() in user space until SIGKILL is delivered. If SIGKILL is delivered before TIF_MEMDIE is set, the oom-depleter can terminate without touching memory reserves. Although the possibility of hitting this time lag is very small for 3.19 and earlier kernels because TIF_MEMDIE is set immediately before sending SIGKILL, preemption or long interrupts (an extreme example is SysRq-t) can step between and allow memory allocations which are not needed for terminating the OOM victim. Fixes: 83363b917a29 (""oom: make sure that TIF_MEMDIE is set under task_lock"") Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> [4.0+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/426fb5e72d92b868912e47a1e3ca2df6eabc3872;mm/oom_kill.c: reverse the order of setting TIF_MEMDIE and sending SIGKILL;yes;no;no 165;MDY6Q29tbWl0MjMyNTI5ODo0MjZmYjVlNzJkOTJiODY4OTEyZTQ3YTFlM2NhMmRmNmVhYmMzODcy;Tetsuo Handa;Linus Torvalds;"mm/oom_kill.c: reverse the order of setting TIF_MEMDIE and sending SIGKILL It was confirmed that a local unprivileged user can consume all memory reserves and hang up that system using time lag between the OOM killer sets TIF_MEMDIE on an OOM victim and sends SIGKILL to that victim, for printk() inside for_each_process() loop at oom_kill_process() can consume many seconds when there are many thread groups sharing the same memory. Before starting oom-depleter process: Node 0 DMA: 3*4kB (UM) 6*8kB (U) 4*16kB (UEM) 0*32kB 0*64kB 1*128kB (M) 2*256kB (EM) 2*512kB (UE) 2*1024kB (EM) 1*2048kB (E) 1*4096kB (M) = 9980kB Node 0 DMA32: 31*4kB (UEM) 27*8kB (UE) 32*16kB (UE) 13*32kB (UE) 14*64kB (UM) 7*128kB (UM) 8*256kB (UM) 8*512kB (UM) 3*1024kB (U) 4*2048kB (UM) 362*4096kB (UM) = 1503220kB As of invoking the OOM killer: Node 0 DMA: 11*4kB (UE) 8*8kB (UEM) 6*16kB (UE) 2*32kB (EM) 0*64kB 1*128kB (U) 3*256kB (UEM) 2*512kB (UE) 3*1024kB (UEM) 1*2048kB (U) 0*4096kB = 7308kB Node 0 DMA32: 1049*4kB (UEM) 507*8kB (UE) 151*16kB (UE) 53*32kB (UEM) 83*64kB (UEM) 52*128kB (EM) 25*256kB (UEM) 11*512kB (M) 6*1024kB (UM) 1*2048kB (M) 0*4096kB = 44556kB Between the thread group leader got TIF_MEMDIE and receives SIGKILL: Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB Node 0 DMA32: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB The oom-depleter's thread group leader which got TIF_MEMDIE started memset() in user space after the OOM killer set TIF_MEMDIE, and it was free to abuse ALLOC_NO_WATERMARKS by TIF_MEMDIE for memset() in user space until SIGKILL is delivered. If SIGKILL is delivered before TIF_MEMDIE is set, the oom-depleter can terminate without touching memory reserves. Although the possibility of hitting this time lag is very small for 3.19 and earlier kernels because TIF_MEMDIE is set immediately before sending SIGKILL, preemption or long interrupts (an extreme example is SysRq-t) can step between and allow memory allocations which are not needed for terminating the OOM victim. Fixes: 83363b917a29 (""oom: make sure that TIF_MEMDIE is set under task_lock"") Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> [4.0+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/426fb5e72d92b868912e47a1e3ca2df6eabc3872;"It was confirmed that a local unprivileged user can consume all memory reserves and hang up that system using time lag between the OOM killer sets TIF_MEMDIE on an OOM victim and sends SIGKILL to that victim, for printk() inside for_each_process() loop at oom_kill_process() can consume many seconds when there are many thread groups sharing the same memory";no;yes;yes 165;MDY6Q29tbWl0MjMyNTI5ODo0MjZmYjVlNzJkOTJiODY4OTEyZTQ3YTFlM2NhMmRmNmVhYmMzODcy;Tetsuo Handa;Linus Torvalds;"mm/oom_kill.c: reverse the order of setting TIF_MEMDIE and sending SIGKILL It was confirmed that a local unprivileged user can consume all memory reserves and hang up that system using time lag between the OOM killer sets TIF_MEMDIE on an OOM victim and sends SIGKILL to that victim, for printk() inside for_each_process() loop at oom_kill_process() can consume many seconds when there are many thread groups sharing the same memory. Before starting oom-depleter process: Node 0 DMA: 3*4kB (UM) 6*8kB (U) 4*16kB (UEM) 0*32kB 0*64kB 1*128kB (M) 2*256kB (EM) 2*512kB (UE) 2*1024kB (EM) 1*2048kB (E) 1*4096kB (M) = 9980kB Node 0 DMA32: 31*4kB (UEM) 27*8kB (UE) 32*16kB (UE) 13*32kB (UE) 14*64kB (UM) 7*128kB (UM) 8*256kB (UM) 8*512kB (UM) 3*1024kB (U) 4*2048kB (UM) 362*4096kB (UM) = 1503220kB As of invoking the OOM killer: Node 0 DMA: 11*4kB (UE) 8*8kB (UEM) 6*16kB (UE) 2*32kB (EM) 0*64kB 1*128kB (U) 3*256kB (UEM) 2*512kB (UE) 3*1024kB (UEM) 1*2048kB (U) 0*4096kB = 7308kB Node 0 DMA32: 1049*4kB (UEM) 507*8kB (UE) 151*16kB (UE) 53*32kB (UEM) 83*64kB (UEM) 52*128kB (EM) 25*256kB (UEM) 11*512kB (M) 6*1024kB (UM) 1*2048kB (M) 0*4096kB = 44556kB Between the thread group leader got TIF_MEMDIE and receives SIGKILL: Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB Node 0 DMA32: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB The oom-depleter's thread group leader which got TIF_MEMDIE started memset() in user space after the OOM killer set TIF_MEMDIE, and it was free to abuse ALLOC_NO_WATERMARKS by TIF_MEMDIE for memset() in user space until SIGKILL is delivered. If SIGKILL is delivered before TIF_MEMDIE is set, the oom-depleter can terminate without touching memory reserves. Although the possibility of hitting this time lag is very small for 3.19 and earlier kernels because TIF_MEMDIE is set immediately before sending SIGKILL, preemption or long interrupts (an extreme example is SysRq-t) can step between and allow memory allocations which are not needed for terminating the OOM victim. Fixes: 83363b917a29 (""oom: make sure that TIF_MEMDIE is set under task_lock"") Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> [4.0+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/426fb5e72d92b868912e47a1e3ca2df6eabc3872;Before starting oom-depleter process;no;no;yes 165;MDY6Q29tbWl0MjMyNTI5ODo0MjZmYjVlNzJkOTJiODY4OTEyZTQ3YTFlM2NhMmRmNmVhYmMzODcy;Tetsuo Handa;Linus Torvalds;"mm/oom_kill.c: reverse the order of setting TIF_MEMDIE and sending SIGKILL It was confirmed that a local unprivileged user can consume all memory reserves and hang up that system using time lag between the OOM killer sets TIF_MEMDIE on an OOM victim and sends SIGKILL to that victim, for printk() inside for_each_process() loop at oom_kill_process() can consume many seconds when there are many thread groups sharing the same memory. Before starting oom-depleter process: Node 0 DMA: 3*4kB (UM) 6*8kB (U) 4*16kB (UEM) 0*32kB 0*64kB 1*128kB (M) 2*256kB (EM) 2*512kB (UE) 2*1024kB (EM) 1*2048kB (E) 1*4096kB (M) = 9980kB Node 0 DMA32: 31*4kB (UEM) 27*8kB (UE) 32*16kB (UE) 13*32kB (UE) 14*64kB (UM) 7*128kB (UM) 8*256kB (UM) 8*512kB (UM) 3*1024kB (U) 4*2048kB (UM) 362*4096kB (UM) = 1503220kB As of invoking the OOM killer: Node 0 DMA: 11*4kB (UE) 8*8kB (UEM) 6*16kB (UE) 2*32kB (EM) 0*64kB 1*128kB (U) 3*256kB (UEM) 2*512kB (UE) 3*1024kB (UEM) 1*2048kB (U) 0*4096kB = 7308kB Node 0 DMA32: 1049*4kB (UEM) 507*8kB (UE) 151*16kB (UE) 53*32kB (UEM) 83*64kB (UEM) 52*128kB (EM) 25*256kB (UEM) 11*512kB (M) 6*1024kB (UM) 1*2048kB (M) 0*4096kB = 44556kB Between the thread group leader got TIF_MEMDIE and receives SIGKILL: Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB Node 0 DMA32: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB The oom-depleter's thread group leader which got TIF_MEMDIE started memset() in user space after the OOM killer set TIF_MEMDIE, and it was free to abuse ALLOC_NO_WATERMARKS by TIF_MEMDIE for memset() in user space until SIGKILL is delivered. If SIGKILL is delivered before TIF_MEMDIE is set, the oom-depleter can terminate without touching memory reserves. Although the possibility of hitting this time lag is very small for 3.19 and earlier kernels because TIF_MEMDIE is set immediately before sending SIGKILL, preemption or long interrupts (an extreme example is SysRq-t) can step between and allow memory allocations which are not needed for terminating the OOM victim. Fixes: 83363b917a29 (""oom: make sure that TIF_MEMDIE is set under task_lock"") Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> [4.0+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/426fb5e72d92b868912e47a1e3ca2df6eabc3872;" Node 0 DMA: 3*4kB (UM) 6*8kB (U) 4*16kB (UEM) 0*32kB 0*64kB 1*128kB (M) 2*256kB (EM) 2*512kB (UE) 2*1024kB (EM) 1*2048kB (E) 1*4096kB (M) = 9980kB Node 0 DMA32: 31*4kB (UEM) 27*8kB (UE) 32*16kB (UE) 13*32kB (UE) 14*64kB (UM) 7*128kB (UM) 8*256kB (UM) 8*512kB (UM) 3*1024kB (U) 4*2048kB (UM) 362*4096kB (UM) = 1503220kB As of invoking the OOM killer";no;no;yes 165;MDY6Q29tbWl0MjMyNTI5ODo0MjZmYjVlNzJkOTJiODY4OTEyZTQ3YTFlM2NhMmRmNmVhYmMzODcy;Tetsuo Handa;Linus Torvalds;"mm/oom_kill.c: reverse the order of setting TIF_MEMDIE and sending SIGKILL It was confirmed that a local unprivileged user can consume all memory reserves and hang up that system using time lag between the OOM killer sets TIF_MEMDIE on an OOM victim and sends SIGKILL to that victim, for printk() inside for_each_process() loop at oom_kill_process() can consume many seconds when there are many thread groups sharing the same memory. Before starting oom-depleter process: Node 0 DMA: 3*4kB (UM) 6*8kB (U) 4*16kB (UEM) 0*32kB 0*64kB 1*128kB (M) 2*256kB (EM) 2*512kB (UE) 2*1024kB (EM) 1*2048kB (E) 1*4096kB (M) = 9980kB Node 0 DMA32: 31*4kB (UEM) 27*8kB (UE) 32*16kB (UE) 13*32kB (UE) 14*64kB (UM) 7*128kB (UM) 8*256kB (UM) 8*512kB (UM) 3*1024kB (U) 4*2048kB (UM) 362*4096kB (UM) = 1503220kB As of invoking the OOM killer: Node 0 DMA: 11*4kB (UE) 8*8kB (UEM) 6*16kB (UE) 2*32kB (EM) 0*64kB 1*128kB (U) 3*256kB (UEM) 2*512kB (UE) 3*1024kB (UEM) 1*2048kB (U) 0*4096kB = 7308kB Node 0 DMA32: 1049*4kB (UEM) 507*8kB (UE) 151*16kB (UE) 53*32kB (UEM) 83*64kB (UEM) 52*128kB (EM) 25*256kB (UEM) 11*512kB (M) 6*1024kB (UM) 1*2048kB (M) 0*4096kB = 44556kB Between the thread group leader got TIF_MEMDIE and receives SIGKILL: Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB Node 0 DMA32: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB The oom-depleter's thread group leader which got TIF_MEMDIE started memset() in user space after the OOM killer set TIF_MEMDIE, and it was free to abuse ALLOC_NO_WATERMARKS by TIF_MEMDIE for memset() in user space until SIGKILL is delivered. If SIGKILL is delivered before TIF_MEMDIE is set, the oom-depleter can terminate without touching memory reserves. Although the possibility of hitting this time lag is very small for 3.19 and earlier kernels because TIF_MEMDIE is set immediately before sending SIGKILL, preemption or long interrupts (an extreme example is SysRq-t) can step between and allow memory allocations which are not needed for terminating the OOM victim. Fixes: 83363b917a29 (""oom: make sure that TIF_MEMDIE is set under task_lock"") Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> [4.0+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/426fb5e72d92b868912e47a1e3ca2df6eabc3872;" Node 0 DMA: 11*4kB (UE) 8*8kB (UEM) 6*16kB (UE) 2*32kB (EM) 0*64kB 1*128kB (U) 3*256kB (UEM) 2*512kB (UE) 3*1024kB (UEM) 1*2048kB (U) 0*4096kB = 7308kB Node 0 DMA32: 1049*4kB (UEM) 507*8kB (UE) 151*16kB (UE) 53*32kB (UEM) 83*64kB (UEM) 52*128kB (EM) 25*256kB (UEM) 11*512kB (M) 6*1024kB (UM) 1*2048kB (M) 0*4096kB = 44556kB Between the thread group leader got TIF_MEMDIE and receives SIGKILL";no;no;yes 165;MDY6Q29tbWl0MjMyNTI5ODo0MjZmYjVlNzJkOTJiODY4OTEyZTQ3YTFlM2NhMmRmNmVhYmMzODcy;Tetsuo Handa;Linus Torvalds;"mm/oom_kill.c: reverse the order of setting TIF_MEMDIE and sending SIGKILL It was confirmed that a local unprivileged user can consume all memory reserves and hang up that system using time lag between the OOM killer sets TIF_MEMDIE on an OOM victim and sends SIGKILL to that victim, for printk() inside for_each_process() loop at oom_kill_process() can consume many seconds when there are many thread groups sharing the same memory. Before starting oom-depleter process: Node 0 DMA: 3*4kB (UM) 6*8kB (U) 4*16kB (UEM) 0*32kB 0*64kB 1*128kB (M) 2*256kB (EM) 2*512kB (UE) 2*1024kB (EM) 1*2048kB (E) 1*4096kB (M) = 9980kB Node 0 DMA32: 31*4kB (UEM) 27*8kB (UE) 32*16kB (UE) 13*32kB (UE) 14*64kB (UM) 7*128kB (UM) 8*256kB (UM) 8*512kB (UM) 3*1024kB (U) 4*2048kB (UM) 362*4096kB (UM) = 1503220kB As of invoking the OOM killer: Node 0 DMA: 11*4kB (UE) 8*8kB (UEM) 6*16kB (UE) 2*32kB (EM) 0*64kB 1*128kB (U) 3*256kB (UEM) 2*512kB (UE) 3*1024kB (UEM) 1*2048kB (U) 0*4096kB = 7308kB Node 0 DMA32: 1049*4kB (UEM) 507*8kB (UE) 151*16kB (UE) 53*32kB (UEM) 83*64kB (UEM) 52*128kB (EM) 25*256kB (UEM) 11*512kB (M) 6*1024kB (UM) 1*2048kB (M) 0*4096kB = 44556kB Between the thread group leader got TIF_MEMDIE and receives SIGKILL: Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB Node 0 DMA32: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB The oom-depleter's thread group leader which got TIF_MEMDIE started memset() in user space after the OOM killer set TIF_MEMDIE, and it was free to abuse ALLOC_NO_WATERMARKS by TIF_MEMDIE for memset() in user space until SIGKILL is delivered. If SIGKILL is delivered before TIF_MEMDIE is set, the oom-depleter can terminate without touching memory reserves. Although the possibility of hitting this time lag is very small for 3.19 and earlier kernels because TIF_MEMDIE is set immediately before sending SIGKILL, preemption or long interrupts (an extreme example is SysRq-t) can step between and allow memory allocations which are not needed for terminating the OOM victim. Fixes: 83363b917a29 (""oom: make sure that TIF_MEMDIE is set under task_lock"") Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> [4.0+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/426fb5e72d92b868912e47a1e3ca2df6eabc3872;" Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB Node 0 DMA32: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB The oom-depleter's thread group leader which got TIF_MEMDIE started memset() in user space after the OOM killer set TIF_MEMDIE, and it was free to abuse ALLOC_NO_WATERMARKS by TIF_MEMDIE for memset() in user space until SIGKILL is delivered";no;no;yes 165;MDY6Q29tbWl0MjMyNTI5ODo0MjZmYjVlNzJkOTJiODY4OTEyZTQ3YTFlM2NhMmRmNmVhYmMzODcy;Tetsuo Handa;Linus Torvalds;"mm/oom_kill.c: reverse the order of setting TIF_MEMDIE and sending SIGKILL It was confirmed that a local unprivileged user can consume all memory reserves and hang up that system using time lag between the OOM killer sets TIF_MEMDIE on an OOM victim and sends SIGKILL to that victim, for printk() inside for_each_process() loop at oom_kill_process() can consume many seconds when there are many thread groups sharing the same memory. Before starting oom-depleter process: Node 0 DMA: 3*4kB (UM) 6*8kB (U) 4*16kB (UEM) 0*32kB 0*64kB 1*128kB (M) 2*256kB (EM) 2*512kB (UE) 2*1024kB (EM) 1*2048kB (E) 1*4096kB (M) = 9980kB Node 0 DMA32: 31*4kB (UEM) 27*8kB (UE) 32*16kB (UE) 13*32kB (UE) 14*64kB (UM) 7*128kB (UM) 8*256kB (UM) 8*512kB (UM) 3*1024kB (U) 4*2048kB (UM) 362*4096kB (UM) = 1503220kB As of invoking the OOM killer: Node 0 DMA: 11*4kB (UE) 8*8kB (UEM) 6*16kB (UE) 2*32kB (EM) 0*64kB 1*128kB (U) 3*256kB (UEM) 2*512kB (UE) 3*1024kB (UEM) 1*2048kB (U) 0*4096kB = 7308kB Node 0 DMA32: 1049*4kB (UEM) 507*8kB (UE) 151*16kB (UE) 53*32kB (UEM) 83*64kB (UEM) 52*128kB (EM) 25*256kB (UEM) 11*512kB (M) 6*1024kB (UM) 1*2048kB (M) 0*4096kB = 44556kB Between the thread group leader got TIF_MEMDIE and receives SIGKILL: Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB Node 0 DMA32: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB The oom-depleter's thread group leader which got TIF_MEMDIE started memset() in user space after the OOM killer set TIF_MEMDIE, and it was free to abuse ALLOC_NO_WATERMARKS by TIF_MEMDIE for memset() in user space until SIGKILL is delivered. If SIGKILL is delivered before TIF_MEMDIE is set, the oom-depleter can terminate without touching memory reserves. Although the possibility of hitting this time lag is very small for 3.19 and earlier kernels because TIF_MEMDIE is set immediately before sending SIGKILL, preemption or long interrupts (an extreme example is SysRq-t) can step between and allow memory allocations which are not needed for terminating the OOM victim. Fixes: 83363b917a29 (""oom: make sure that TIF_MEMDIE is set under task_lock"") Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> [4.0+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/426fb5e72d92b868912e47a1e3ca2df6eabc3872;" If SIGKILL is delivered before TIF_MEMDIE is set, the oom-depleter can terminate without touching memory reserves";no;yes;yes 165;MDY6Q29tbWl0MjMyNTI5ODo0MjZmYjVlNzJkOTJiODY4OTEyZTQ3YTFlM2NhMmRmNmVhYmMzODcy;Tetsuo Handa;Linus Torvalds;"mm/oom_kill.c: reverse the order of setting TIF_MEMDIE and sending SIGKILL It was confirmed that a local unprivileged user can consume all memory reserves and hang up that system using time lag between the OOM killer sets TIF_MEMDIE on an OOM victim and sends SIGKILL to that victim, for printk() inside for_each_process() loop at oom_kill_process() can consume many seconds when there are many thread groups sharing the same memory. Before starting oom-depleter process: Node 0 DMA: 3*4kB (UM) 6*8kB (U) 4*16kB (UEM) 0*32kB 0*64kB 1*128kB (M) 2*256kB (EM) 2*512kB (UE) 2*1024kB (EM) 1*2048kB (E) 1*4096kB (M) = 9980kB Node 0 DMA32: 31*4kB (UEM) 27*8kB (UE) 32*16kB (UE) 13*32kB (UE) 14*64kB (UM) 7*128kB (UM) 8*256kB (UM) 8*512kB (UM) 3*1024kB (U) 4*2048kB (UM) 362*4096kB (UM) = 1503220kB As of invoking the OOM killer: Node 0 DMA: 11*4kB (UE) 8*8kB (UEM) 6*16kB (UE) 2*32kB (EM) 0*64kB 1*128kB (U) 3*256kB (UEM) 2*512kB (UE) 3*1024kB (UEM) 1*2048kB (U) 0*4096kB = 7308kB Node 0 DMA32: 1049*4kB (UEM) 507*8kB (UE) 151*16kB (UE) 53*32kB (UEM) 83*64kB (UEM) 52*128kB (EM) 25*256kB (UEM) 11*512kB (M) 6*1024kB (UM) 1*2048kB (M) 0*4096kB = 44556kB Between the thread group leader got TIF_MEMDIE and receives SIGKILL: Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB Node 0 DMA32: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB The oom-depleter's thread group leader which got TIF_MEMDIE started memset() in user space after the OOM killer set TIF_MEMDIE, and it was free to abuse ALLOC_NO_WATERMARKS by TIF_MEMDIE for memset() in user space until SIGKILL is delivered. If SIGKILL is delivered before TIF_MEMDIE is set, the oom-depleter can terminate without touching memory reserves. Although the possibility of hitting this time lag is very small for 3.19 and earlier kernels because TIF_MEMDIE is set immediately before sending SIGKILL, preemption or long interrupts (an extreme example is SysRq-t) can step between and allow memory allocations which are not needed for terminating the OOM victim. Fixes: 83363b917a29 (""oom: make sure that TIF_MEMDIE is set under task_lock"") Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> [4.0+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/426fb5e72d92b868912e47a1e3ca2df6eabc3872;"Although the possibility of hitting this time lag is very small for 3.19 and earlier kernels because TIF_MEMDIE is set immediately before sending SIGKILL, preemption or long interrupts (an extreme example is SysRq-t) can step between and allow memory allocations which are not needed for terminating the OOM victim";no;yes;yes 165;MDY6Q29tbWl0MjMyNTI5ODo0MjZmYjVlNzJkOTJiODY4OTEyZTQ3YTFlM2NhMmRmNmVhYmMzODcy;Tetsuo Handa;Linus Torvalds;"mm/oom_kill.c: reverse the order of setting TIF_MEMDIE and sending SIGKILL It was confirmed that a local unprivileged user can consume all memory reserves and hang up that system using time lag between the OOM killer sets TIF_MEMDIE on an OOM victim and sends SIGKILL to that victim, for printk() inside for_each_process() loop at oom_kill_process() can consume many seconds when there are many thread groups sharing the same memory. Before starting oom-depleter process: Node 0 DMA: 3*4kB (UM) 6*8kB (U) 4*16kB (UEM) 0*32kB 0*64kB 1*128kB (M) 2*256kB (EM) 2*512kB (UE) 2*1024kB (EM) 1*2048kB (E) 1*4096kB (M) = 9980kB Node 0 DMA32: 31*4kB (UEM) 27*8kB (UE) 32*16kB (UE) 13*32kB (UE) 14*64kB (UM) 7*128kB (UM) 8*256kB (UM) 8*512kB (UM) 3*1024kB (U) 4*2048kB (UM) 362*4096kB (UM) = 1503220kB As of invoking the OOM killer: Node 0 DMA: 11*4kB (UE) 8*8kB (UEM) 6*16kB (UE) 2*32kB (EM) 0*64kB 1*128kB (U) 3*256kB (UEM) 2*512kB (UE) 3*1024kB (UEM) 1*2048kB (U) 0*4096kB = 7308kB Node 0 DMA32: 1049*4kB (UEM) 507*8kB (UE) 151*16kB (UE) 53*32kB (UEM) 83*64kB (UEM) 52*128kB (EM) 25*256kB (UEM) 11*512kB (M) 6*1024kB (UM) 1*2048kB (M) 0*4096kB = 44556kB Between the thread group leader got TIF_MEMDIE and receives SIGKILL: Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB Node 0 DMA32: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB The oom-depleter's thread group leader which got TIF_MEMDIE started memset() in user space after the OOM killer set TIF_MEMDIE, and it was free to abuse ALLOC_NO_WATERMARKS by TIF_MEMDIE for memset() in user space until SIGKILL is delivered. If SIGKILL is delivered before TIF_MEMDIE is set, the oom-depleter can terminate without touching memory reserves. Although the possibility of hitting this time lag is very small for 3.19 and earlier kernels because TIF_MEMDIE is set immediately before sending SIGKILL, preemption or long interrupts (an extreme example is SysRq-t) can step between and allow memory allocations which are not needed for terminating the OOM victim. Fixes: 83363b917a29 (""oom: make sure that TIF_MEMDIE is set under task_lock"") Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> [4.0+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/426fb5e72d92b868912e47a1e3ca2df6eabc3872;"Fixes: 83363b917a29 (""oom: make sure that TIF_MEMDIE is set under task_lock"")";no;yes;no 166;MDY6Q29tbWl0MjMyNTI5ODo3NWU4ZjhiMjRjYjBkYzQ5NTEyNjdkMzFmMGE0OWU1Y2UyZjM0NWM0;David Rientjes;Linus Torvalds;"mm, oom: remove unnecessary variable The ""killed"" variable in out_of_memory() can be removed since the call to oom_kill_process() where we should block to allow the process time to exit is obvious. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/75e8f8b24cb0dc4951267d31f0a49e5ce2f345c4;mm, oom: remove unnecessary variable;yes;yes;no 166;MDY6Q29tbWl0MjMyNTI5ODo3NWU4ZjhiMjRjYjBkYzQ5NTEyNjdkMzFmMGE0OWU1Y2UyZjM0NWM0;David Rientjes;Linus Torvalds;"mm, oom: remove unnecessary variable The ""killed"" variable in out_of_memory() can be removed since the call to oom_kill_process() where we should block to allow the process time to exit is obvious. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/75e8f8b24cb0dc4951267d31f0a49e5ce2f345c4;"The ""killed"" variable in out_of_memory() can be removed since the call to oom_kill_process() where we should block to allow the process time to exit is obvious.";yes;yes;yes 167;MDY6Q29tbWl0MjMyNTI5ODowNzFhNGJlZmViYjY1NWQ2YjMxYmY1YzZiYWNkNWE2ZGYwMzUyMjRk;David Rientjes;Linus Torvalds;"mm, oom: do not panic for oom kills triggered from sysrq Sysrq+f is used to kill a process either for debug or when the VM is otherwise unresponsive. It is not intended to trigger a panic when no process may be killed. Avoid panicking the system for sysrq+f when no processes are killed. Signed-off-by: David Rientjes <rientjes@google.com> Suggested-by: Michal Hocko <mhocko@suse.cz> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/071a4befebb655d6b31bf5c6bacd5a6df035224d;mm, oom: do not panic for oom kills triggered from sysrq;yes;no;no 167;MDY6Q29tbWl0MjMyNTI5ODowNzFhNGJlZmViYjY1NWQ2YjMxYmY1YzZiYWNkNWE2ZGYwMzUyMjRk;David Rientjes;Linus Torvalds;"mm, oom: do not panic for oom kills triggered from sysrq Sysrq+f is used to kill a process either for debug or when the VM is otherwise unresponsive. It is not intended to trigger a panic when no process may be killed. Avoid panicking the system for sysrq+f when no processes are killed. Signed-off-by: David Rientjes <rientjes@google.com> Suggested-by: Michal Hocko <mhocko@suse.cz> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/071a4befebb655d6b31bf5c6bacd5a6df035224d;"Sysrq+f is used to kill a process either for debug or when the VM is otherwise unresponsive";no;no;yes 167;MDY6Q29tbWl0MjMyNTI5ODowNzFhNGJlZmViYjY1NWQ2YjMxYmY1YzZiYWNkNWE2ZGYwMzUyMjRk;David Rientjes;Linus Torvalds;"mm, oom: do not panic for oom kills triggered from sysrq Sysrq+f is used to kill a process either for debug or when the VM is otherwise unresponsive. It is not intended to trigger a panic when no process may be killed. Avoid panicking the system for sysrq+f when no processes are killed. Signed-off-by: David Rientjes <rientjes@google.com> Suggested-by: Michal Hocko <mhocko@suse.cz> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/071a4befebb655d6b31bf5c6bacd5a6df035224d;It is not intended to trigger a panic when no process may be killed;no;yes;yes 167;MDY6Q29tbWl0MjMyNTI5ODowNzFhNGJlZmViYjY1NWQ2YjMxYmY1YzZiYWNkNWE2ZGYwMzUyMjRk;David Rientjes;Linus Torvalds;"mm, oom: do not panic for oom kills triggered from sysrq Sysrq+f is used to kill a process either for debug or when the VM is otherwise unresponsive. It is not intended to trigger a panic when no process may be killed. Avoid panicking the system for sysrq+f when no processes are killed. Signed-off-by: David Rientjes <rientjes@google.com> Suggested-by: Michal Hocko <mhocko@suse.cz> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/071a4befebb655d6b31bf5c6bacd5a6df035224d;Avoid panicking the system for sysrq+f when no processes are killed.;yes;yes;no 168;MDY6Q29tbWl0MjMyNTI5ODo1NGU5ZTI5MTMyZDdjYWVmY2FkNDcwMjgxY2FlMDZhYzM0YTk4MmM4;David Rientjes;Linus Torvalds;"mm, oom: pass an oom order of -1 when triggered by sysrq The force_kill member of struct oom_control isn't needed if an order of -1 is used instead. This is the same as order == -1 in struct compact_control which requires full memory compaction. This patch introduces no functional change. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/54e9e29132d7caefcad470281cae06ac34a982c8;mm, oom: pass an oom order of -1 when triggered by sysrq;yes;no;no 168;MDY6Q29tbWl0MjMyNTI5ODo1NGU5ZTI5MTMyZDdjYWVmY2FkNDcwMjgxY2FlMDZhYzM0YTk4MmM4;David Rientjes;Linus Torvalds;"mm, oom: pass an oom order of -1 when triggered by sysrq The force_kill member of struct oom_control isn't needed if an order of -1 is used instead. This is the same as order == -1 in struct compact_control which requires full memory compaction. This patch introduces no functional change. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/54e9e29132d7caefcad470281cae06ac34a982c8;"The force_kill member of struct oom_control isn't needed if an order of -1 is used instead";no;yes;yes 168;MDY6Q29tbWl0MjMyNTI5ODo1NGU5ZTI5MTMyZDdjYWVmY2FkNDcwMjgxY2FlMDZhYzM0YTk4MmM4;David Rientjes;Linus Torvalds;"mm, oom: pass an oom order of -1 when triggered by sysrq The force_kill member of struct oom_control isn't needed if an order of -1 is used instead. This is the same as order == -1 in struct compact_control which requires full memory compaction. This patch introduces no functional change. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/54e9e29132d7caefcad470281cae06ac34a982c8;" This is the same as order == -1 in struct compact_control which requires full memory compaction";no;no;yes 168;MDY6Q29tbWl0MjMyNTI5ODo1NGU5ZTI5MTMyZDdjYWVmY2FkNDcwMjgxY2FlMDZhYzM0YTk4MmM4;David Rientjes;Linus Torvalds;"mm, oom: pass an oom order of -1 when triggered by sysrq The force_kill member of struct oom_control isn't needed if an order of -1 is used instead. This is the same as order == -1 in struct compact_control which requires full memory compaction. This patch introduces no functional change. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/54e9e29132d7caefcad470281cae06ac34a982c8;This patch introduces no functional change.;yes;no;no 169;MDY6Q29tbWl0MjMyNTI5ODo2ZTBmYzQ2ZGMyMTUyZDNlMmQyNWE1ZDViNjQwYWUzNTg2YzI0N2M2;David Rientjes;Linus Torvalds;"mm, oom: organize oom context into struct There are essential elements to an oom context that are passed around to multiple functions. Organize these elements into a new struct, struct oom_control, that specifies the context for an oom condition. This patch introduces no functional change. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/6e0fc46dc2152d3e2d25a5d5b640ae3586c247c6;mm, oom: organize oom context into struct;yes;yes;no 169;MDY6Q29tbWl0MjMyNTI5ODo2ZTBmYzQ2ZGMyMTUyZDNlMmQyNWE1ZDViNjQwYWUzNTg2YzI0N2M2;David Rientjes;Linus Torvalds;"mm, oom: organize oom context into struct There are essential elements to an oom context that are passed around to multiple functions. Organize these elements into a new struct, struct oom_control, that specifies the context for an oom condition. This patch introduces no functional change. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/6e0fc46dc2152d3e2d25a5d5b640ae3586c247c6;"There are essential elements to an oom context that are passed around to multiple functions";no;yes;yes 169;MDY6Q29tbWl0MjMyNTI5ODo2ZTBmYzQ2ZGMyMTUyZDNlMmQyNWE1ZDViNjQwYWUzNTg2YzI0N2M2;David Rientjes;Linus Torvalds;"mm, oom: organize oom context into struct There are essential elements to an oom context that are passed around to multiple functions. Organize these elements into a new struct, struct oom_control, that specifies the context for an oom condition. This patch introduces no functional change. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/6e0fc46dc2152d3e2d25a5d5b640ae3586c247c6;"Organize these elements into a new struct, struct oom_control, that specifies the context for an oom condition";yes;yes;no 169;MDY6Q29tbWl0MjMyNTI5ODo2ZTBmYzQ2ZGMyMTUyZDNlMmQyNWE1ZDViNjQwYWUzNTg2YzI0N2M2;David Rientjes;Linus Torvalds;"mm, oom: organize oom context into struct There are essential elements to an oom context that are passed around to multiple functions. Organize these elements into a new struct, struct oom_control, that specifies the context for an oom condition. This patch introduces no functional change. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/6e0fc46dc2152d3e2d25a5d5b640ae3586c247c6;This patch introduces no functional change.;yes;no;no 170;MDY6Q29tbWl0MjMyNTI5ODpmMGQ2NjQ3ZTg1MDUwYzZjYTcwZDY5YTY0N2UzYzY1M2RkOWIzNDlh;Wang Long;Linus Torvalds;"mm/oom_kill.c: print points as unsigned int In oom_kill_process(), the variable 'points' is unsigned int. Print it as such. Signed-off-by: Wang Long <long.wanglong@huawei.com> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/f0d6647e85050c6ca70d69a647e3c653dd9b349a;mm/oom_kill.c: print points as unsigned int;yes;no;no 170;MDY6Q29tbWl0MjMyNTI5ODpmMGQ2NjQ3ZTg1MDUwYzZjYTcwZDY5YTY0N2UzYzY1M2RkOWIzNDlh;Wang Long;Linus Torvalds;"mm/oom_kill.c: print points as unsigned int In oom_kill_process(), the variable 'points' is unsigned int. Print it as such. Signed-off-by: Wang Long <long.wanglong@huawei.com> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/f0d6647e85050c6ca70d69a647e3c653dd9b349a;In oom_kill_process(), the variable 'points' is unsigned int;no;no;yes 170;MDY6Q29tbWl0MjMyNTI5ODpmMGQ2NjQ3ZTg1MDUwYzZjYTcwZDY5YTY0N2UzYzY1M2RkOWIzNDlh;Wang Long;Linus Torvalds;"mm/oom_kill.c: print points as unsigned int In oom_kill_process(), the variable 'points' is unsigned int. Print it as such. Signed-off-by: Wang Long <long.wanglong@huawei.com> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/f0d6647e85050c6ca70d69a647e3c653dd9b349a;" Print it as such.";yes;yes;no 171;MDY6Q29tbWl0MjMyNTI5ODpkYzU2NDAxZmM5ZjI1ZThmOTM4OTk5OTFlYzg1OGM5OGEzMzFkODhj;Johannes Weiner;Linus Torvalds;"mm: oom_kill: simplify OOM killer locking The zonelist locking and the oom_sem are two overlapping locks that are used to serialize global OOM killing against different things. The historical zonelist locking serializes OOM kills from allocations with overlapping zonelists against each other to prevent killing more tasks than necessary in the same memory domain. Only when neither tasklists nor zonelists from two concurrent OOM kills overlap (tasks in separate memcgs bound to separate nodes) are OOM kills allowed to execute in parallel. The younger oom_sem is a read-write lock to serialize OOM killing against the PM code trying to disable the OOM killer altogether. However, the OOM killer is a fairly cold error path, there is really no reason to optimize for highly performant and concurrent OOM kills. And the oom_sem is just flat-out redundant. Replace both locking schemes with a single global mutex serializing OOM kills regardless of context. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/dc56401fc9f25e8f93899991ec858c98a331d88c;mm: oom_kill: simplify OOM killer locking;yes;yes;no 171;MDY6Q29tbWl0MjMyNTI5ODpkYzU2NDAxZmM5ZjI1ZThmOTM4OTk5OTFlYzg1OGM5OGEzMzFkODhj;Johannes Weiner;Linus Torvalds;"mm: oom_kill: simplify OOM killer locking The zonelist locking and the oom_sem are two overlapping locks that are used to serialize global OOM killing against different things. The historical zonelist locking serializes OOM kills from allocations with overlapping zonelists against each other to prevent killing more tasks than necessary in the same memory domain. Only when neither tasklists nor zonelists from two concurrent OOM kills overlap (tasks in separate memcgs bound to separate nodes) are OOM kills allowed to execute in parallel. The younger oom_sem is a read-write lock to serialize OOM killing against the PM code trying to disable the OOM killer altogether. However, the OOM killer is a fairly cold error path, there is really no reason to optimize for highly performant and concurrent OOM kills. And the oom_sem is just flat-out redundant. Replace both locking schemes with a single global mutex serializing OOM kills regardless of context. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/dc56401fc9f25e8f93899991ec858c98a331d88c;"The zonelist locking and the oom_sem are two overlapping locks that are used to serialize global OOM killing against different things";no;no;yes 171;MDY6Q29tbWl0MjMyNTI5ODpkYzU2NDAxZmM5ZjI1ZThmOTM4OTk5OTFlYzg1OGM5OGEzMzFkODhj;Johannes Weiner;Linus Torvalds;"mm: oom_kill: simplify OOM killer locking The zonelist locking and the oom_sem are two overlapping locks that are used to serialize global OOM killing against different things. The historical zonelist locking serializes OOM kills from allocations with overlapping zonelists against each other to prevent killing more tasks than necessary in the same memory domain. Only when neither tasklists nor zonelists from two concurrent OOM kills overlap (tasks in separate memcgs bound to separate nodes) are OOM kills allowed to execute in parallel. The younger oom_sem is a read-write lock to serialize OOM killing against the PM code trying to disable the OOM killer altogether. However, the OOM killer is a fairly cold error path, there is really no reason to optimize for highly performant and concurrent OOM kills. And the oom_sem is just flat-out redundant. Replace both locking schemes with a single global mutex serializing OOM kills regardless of context. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/dc56401fc9f25e8f93899991ec858c98a331d88c;"The historical zonelist locking serializes OOM kills from allocations with overlapping zonelists against each other to prevent killing more tasks than necessary in the same memory domain";no;no;yes 171;MDY6Q29tbWl0MjMyNTI5ODpkYzU2NDAxZmM5ZjI1ZThmOTM4OTk5OTFlYzg1OGM5OGEzMzFkODhj;Johannes Weiner;Linus Torvalds;"mm: oom_kill: simplify OOM killer locking The zonelist locking and the oom_sem are two overlapping locks that are used to serialize global OOM killing against different things. The historical zonelist locking serializes OOM kills from allocations with overlapping zonelists against each other to prevent killing more tasks than necessary in the same memory domain. Only when neither tasklists nor zonelists from two concurrent OOM kills overlap (tasks in separate memcgs bound to separate nodes) are OOM kills allowed to execute in parallel. The younger oom_sem is a read-write lock to serialize OOM killing against the PM code trying to disable the OOM killer altogether. However, the OOM killer is a fairly cold error path, there is really no reason to optimize for highly performant and concurrent OOM kills. And the oom_sem is just flat-out redundant. Replace both locking schemes with a single global mutex serializing OOM kills regardless of context. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/dc56401fc9f25e8f93899991ec858c98a331d88c;" Only when neither tasklists nor zonelists from two concurrent OOM kills overlap (tasks in separate memcgs bound to separate nodes) are OOM kills allowed to execute in parallel";no;no;yes 171;MDY6Q29tbWl0MjMyNTI5ODpkYzU2NDAxZmM5ZjI1ZThmOTM4OTk5OTFlYzg1OGM5OGEzMzFkODhj;Johannes Weiner;Linus Torvalds;"mm: oom_kill: simplify OOM killer locking The zonelist locking and the oom_sem are two overlapping locks that are used to serialize global OOM killing against different things. The historical zonelist locking serializes OOM kills from allocations with overlapping zonelists against each other to prevent killing more tasks than necessary in the same memory domain. Only when neither tasklists nor zonelists from two concurrent OOM kills overlap (tasks in separate memcgs bound to separate nodes) are OOM kills allowed to execute in parallel. The younger oom_sem is a read-write lock to serialize OOM killing against the PM code trying to disable the OOM killer altogether. However, the OOM killer is a fairly cold error path, there is really no reason to optimize for highly performant and concurrent OOM kills. And the oom_sem is just flat-out redundant. Replace both locking schemes with a single global mutex serializing OOM kills regardless of context. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/dc56401fc9f25e8f93899991ec858c98a331d88c;"The younger oom_sem is a read-write lock to serialize OOM killing against the PM code trying to disable the OOM killer altogether";no;no;yes 171;MDY6Q29tbWl0MjMyNTI5ODpkYzU2NDAxZmM5ZjI1ZThmOTM4OTk5OTFlYzg1OGM5OGEzMzFkODhj;Johannes Weiner;Linus Torvalds;"mm: oom_kill: simplify OOM killer locking The zonelist locking and the oom_sem are two overlapping locks that are used to serialize global OOM killing against different things. The historical zonelist locking serializes OOM kills from allocations with overlapping zonelists against each other to prevent killing more tasks than necessary in the same memory domain. Only when neither tasklists nor zonelists from two concurrent OOM kills overlap (tasks in separate memcgs bound to separate nodes) are OOM kills allowed to execute in parallel. The younger oom_sem is a read-write lock to serialize OOM killing against the PM code trying to disable the OOM killer altogether. However, the OOM killer is a fairly cold error path, there is really no reason to optimize for highly performant and concurrent OOM kills. And the oom_sem is just flat-out redundant. Replace both locking schemes with a single global mutex serializing OOM kills regardless of context. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/dc56401fc9f25e8f93899991ec858c98a331d88c;"However, the OOM killer is a fairly cold error path, there is really no reason to optimize for highly performant and concurrent OOM kills";no;yes;yes 171;MDY6Q29tbWl0MjMyNTI5ODpkYzU2NDAxZmM5ZjI1ZThmOTM4OTk5OTFlYzg1OGM5OGEzMzFkODhj;Johannes Weiner;Linus Torvalds;"mm: oom_kill: simplify OOM killer locking The zonelist locking and the oom_sem are two overlapping locks that are used to serialize global OOM killing against different things. The historical zonelist locking serializes OOM kills from allocations with overlapping zonelists against each other to prevent killing more tasks than necessary in the same memory domain. Only when neither tasklists nor zonelists from two concurrent OOM kills overlap (tasks in separate memcgs bound to separate nodes) are OOM kills allowed to execute in parallel. The younger oom_sem is a read-write lock to serialize OOM killing against the PM code trying to disable the OOM killer altogether. However, the OOM killer is a fairly cold error path, there is really no reason to optimize for highly performant and concurrent OOM kills. And the oom_sem is just flat-out redundant. Replace both locking schemes with a single global mutex serializing OOM kills regardless of context. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/dc56401fc9f25e8f93899991ec858c98a331d88c;" And the oom_sem is just flat-out redundant";no;yes;no 171;MDY6Q29tbWl0MjMyNTI5ODpkYzU2NDAxZmM5ZjI1ZThmOTM4OTk5OTFlYzg1OGM5OGEzMzFkODhj;Johannes Weiner;Linus Torvalds;"mm: oom_kill: simplify OOM killer locking The zonelist locking and the oom_sem are two overlapping locks that are used to serialize global OOM killing against different things. The historical zonelist locking serializes OOM kills from allocations with overlapping zonelists against each other to prevent killing more tasks than necessary in the same memory domain. Only when neither tasklists nor zonelists from two concurrent OOM kills overlap (tasks in separate memcgs bound to separate nodes) are OOM kills allowed to execute in parallel. The younger oom_sem is a read-write lock to serialize OOM killing against the PM code trying to disable the OOM killer altogether. However, the OOM killer is a fairly cold error path, there is really no reason to optimize for highly performant and concurrent OOM kills. And the oom_sem is just flat-out redundant. Replace both locking schemes with a single global mutex serializing OOM kills regardless of context. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/dc56401fc9f25e8f93899991ec858c98a331d88c;"Replace both locking schemes with a single global mutex serializing OOM kills regardless of context.";yes;yes;no 172;MDY6Q29tbWl0MjMyNTI5ODpkYTUxYjE0YWRiNjcxODI5MDc3ZGEzYWViOWU5ZWRkNmY4YzgwYWZl;Johannes Weiner;Linus Torvalds;"mm: oom_kill: remove unnecessary locking in exit_oom_victim() Disabling the OOM killer needs to exclude allocators from entering, not existing victims from exiting. Right now the only waiter is suspend code, which achieves quiescence by disabling the OOM killer. But later on we want to add waits that hold the lock instead to stop new victims from showing up. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/da51b14adb671829077da3aeb9e9edd6f8c80afe;mm: oom_kill: remove unnecessary locking in exit_oom_victim();yes;yes;no 172;MDY6Q29tbWl0MjMyNTI5ODpkYTUxYjE0YWRiNjcxODI5MDc3ZGEzYWViOWU5ZWRkNmY4YzgwYWZl;Johannes Weiner;Linus Torvalds;"mm: oom_kill: remove unnecessary locking in exit_oom_victim() Disabling the OOM killer needs to exclude allocators from entering, not existing victims from exiting. Right now the only waiter is suspend code, which achieves quiescence by disabling the OOM killer. But later on we want to add waits that hold the lock instead to stop new victims from showing up. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/da51b14adb671829077da3aeb9e9edd6f8c80afe;"Disabling the OOM killer needs to exclude allocators from entering, not existing victims from exiting";no;yes;yes 172;MDY6Q29tbWl0MjMyNTI5ODpkYTUxYjE0YWRiNjcxODI5MDc3ZGEzYWViOWU5ZWRkNmY4YzgwYWZl;Johannes Weiner;Linus Torvalds;"mm: oom_kill: remove unnecessary locking in exit_oom_victim() Disabling the OOM killer needs to exclude allocators from entering, not existing victims from exiting. Right now the only waiter is suspend code, which achieves quiescence by disabling the OOM killer. But later on we want to add waits that hold the lock instead to stop new victims from showing up. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/da51b14adb671829077da3aeb9e9edd6f8c80afe;"Right now the only waiter is suspend code, which achieves quiescence by disabling the OOM killer";yes;no;yes 172;MDY6Q29tbWl0MjMyNTI5ODpkYTUxYjE0YWRiNjcxODI5MDc3ZGEzYWViOWU5ZWRkNmY4YzgwYWZl;Johannes Weiner;Linus Torvalds;"mm: oom_kill: remove unnecessary locking in exit_oom_victim() Disabling the OOM killer needs to exclude allocators from entering, not existing victims from exiting. Right now the only waiter is suspend code, which achieves quiescence by disabling the OOM killer. But later on we want to add waits that hold the lock instead to stop new victims from showing up. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/da51b14adb671829077da3aeb9e9edd6f8c80afe;" But later on we want to add waits that hold the lock instead to stop new victims from showing up.";no;yes;no 173;MDY6Q29tbWl0MjMyNTI5ODpjMzhmMTAyNWYyOTEwZDYxODNlOTkyM2Q0YjRkNTgwNDQ3NGI1MGM1;Johannes Weiner;Linus Torvalds;"mm: oom_kill: generalize OOM progress waitqueue It turns out that the mechanism to wait for exiting OOM victims is less generic than it looks: it won't issue wakeups unless the OOM killer is disabled. The reason this check was added was the thought that, since only the OOM disabling code would wait on this queue, wakeup operations could be saved when that specific consumer is known to be absent. However, this is quite the handgrenade. Later attempts to reuse the waitqueue for other purposes will lead to completely unexpected bugs and the failure mode will appear seemingly illogical. Generally, providers shouldn't make unnecessary assumptions about consumers. This could have been replaced with waitqueue_active(), but it only saves a few instructions in one of the coldest paths in the kernel. Simply remove it. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/c38f1025f2910d6183e9923d4b4d5804474b50c5;mm: oom_kill: generalize OOM progress waitqueue;yes;yes;no 173;MDY6Q29tbWl0MjMyNTI5ODpjMzhmMTAyNWYyOTEwZDYxODNlOTkyM2Q0YjRkNTgwNDQ3NGI1MGM1;Johannes Weiner;Linus Torvalds;"mm: oom_kill: generalize OOM progress waitqueue It turns out that the mechanism to wait for exiting OOM victims is less generic than it looks: it won't issue wakeups unless the OOM killer is disabled. The reason this check was added was the thought that, since only the OOM disabling code would wait on this queue, wakeup operations could be saved when that specific consumer is known to be absent. However, this is quite the handgrenade. Later attempts to reuse the waitqueue for other purposes will lead to completely unexpected bugs and the failure mode will appear seemingly illogical. Generally, providers shouldn't make unnecessary assumptions about consumers. This could have been replaced with waitqueue_active(), but it only saves a few instructions in one of the coldest paths in the kernel. Simply remove it. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/c38f1025f2910d6183e9923d4b4d5804474b50c5;"It turns out that the mechanism to wait for exiting OOM victims is less generic than it looks: it won't issue wakeups unless the OOM killer is disabled";no;yes;yes 173;MDY6Q29tbWl0MjMyNTI5ODpjMzhmMTAyNWYyOTEwZDYxODNlOTkyM2Q0YjRkNTgwNDQ3NGI1MGM1;Johannes Weiner;Linus Torvalds;"mm: oom_kill: generalize OOM progress waitqueue It turns out that the mechanism to wait for exiting OOM victims is less generic than it looks: it won't issue wakeups unless the OOM killer is disabled. The reason this check was added was the thought that, since only the OOM disabling code would wait on this queue, wakeup operations could be saved when that specific consumer is known to be absent. However, this is quite the handgrenade. Later attempts to reuse the waitqueue for other purposes will lead to completely unexpected bugs and the failure mode will appear seemingly illogical. Generally, providers shouldn't make unnecessary assumptions about consumers. This could have been replaced with waitqueue_active(), but it only saves a few instructions in one of the coldest paths in the kernel. Simply remove it. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/c38f1025f2910d6183e9923d4b4d5804474b50c5;"The reason this check was added was the thought that, since only the OOM disabling code would wait on this queue, wakeup operations could be saved when that specific consumer is known to be absent";no;no;yes 173;MDY6Q29tbWl0MjMyNTI5ODpjMzhmMTAyNWYyOTEwZDYxODNlOTkyM2Q0YjRkNTgwNDQ3NGI1MGM1;Johannes Weiner;Linus Torvalds;"mm: oom_kill: generalize OOM progress waitqueue It turns out that the mechanism to wait for exiting OOM victims is less generic than it looks: it won't issue wakeups unless the OOM killer is disabled. The reason this check was added was the thought that, since only the OOM disabling code would wait on this queue, wakeup operations could be saved when that specific consumer is known to be absent. However, this is quite the handgrenade. Later attempts to reuse the waitqueue for other purposes will lead to completely unexpected bugs and the failure mode will appear seemingly illogical. Generally, providers shouldn't make unnecessary assumptions about consumers. This could have been replaced with waitqueue_active(), but it only saves a few instructions in one of the coldest paths in the kernel. Simply remove it. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/c38f1025f2910d6183e9923d4b4d5804474b50c5;However, this is quite the handgrenade;no;yes;no 173;MDY6Q29tbWl0MjMyNTI5ODpjMzhmMTAyNWYyOTEwZDYxODNlOTkyM2Q0YjRkNTgwNDQ3NGI1MGM1;Johannes Weiner;Linus Torvalds;"mm: oom_kill: generalize OOM progress waitqueue It turns out that the mechanism to wait for exiting OOM victims is less generic than it looks: it won't issue wakeups unless the OOM killer is disabled. The reason this check was added was the thought that, since only the OOM disabling code would wait on this queue, wakeup operations could be saved when that specific consumer is known to be absent. However, this is quite the handgrenade. Later attempts to reuse the waitqueue for other purposes will lead to completely unexpected bugs and the failure mode will appear seemingly illogical. Generally, providers shouldn't make unnecessary assumptions about consumers. This could have been replaced with waitqueue_active(), but it only saves a few instructions in one of the coldest paths in the kernel. Simply remove it. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/c38f1025f2910d6183e9923d4b4d5804474b50c5;" Later attempts to reuse the waitqueue for other purposes will lead to completely unexpected bugs and the failure mode will appear seemingly illogical";no;yes;yes 173;MDY6Q29tbWl0MjMyNTI5ODpjMzhmMTAyNWYyOTEwZDYxODNlOTkyM2Q0YjRkNTgwNDQ3NGI1MGM1;Johannes Weiner;Linus Torvalds;"mm: oom_kill: generalize OOM progress waitqueue It turns out that the mechanism to wait for exiting OOM victims is less generic than it looks: it won't issue wakeups unless the OOM killer is disabled. The reason this check was added was the thought that, since only the OOM disabling code would wait on this queue, wakeup operations could be saved when that specific consumer is known to be absent. However, this is quite the handgrenade. Later attempts to reuse the waitqueue for other purposes will lead to completely unexpected bugs and the failure mode will appear seemingly illogical. Generally, providers shouldn't make unnecessary assumptions about consumers. This could have been replaced with waitqueue_active(), but it only saves a few instructions in one of the coldest paths in the kernel. Simply remove it. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/c38f1025f2910d6183e9923d4b4d5804474b50c5;" Generally, providers shouldn't make unnecessary assumptions about consumers";no;yes;yes 173;MDY6Q29tbWl0MjMyNTI5ODpjMzhmMTAyNWYyOTEwZDYxODNlOTkyM2Q0YjRkNTgwNDQ3NGI1MGM1;Johannes Weiner;Linus Torvalds;"mm: oom_kill: generalize OOM progress waitqueue It turns out that the mechanism to wait for exiting OOM victims is less generic than it looks: it won't issue wakeups unless the OOM killer is disabled. The reason this check was added was the thought that, since only the OOM disabling code would wait on this queue, wakeup operations could be saved when that specific consumer is known to be absent. However, this is quite the handgrenade. Later attempts to reuse the waitqueue for other purposes will lead to completely unexpected bugs and the failure mode will appear seemingly illogical. Generally, providers shouldn't make unnecessary assumptions about consumers. This could have been replaced with waitqueue_active(), but it only saves a few instructions in one of the coldest paths in the kernel. Simply remove it. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/c38f1025f2910d6183e9923d4b4d5804474b50c5;"This could have been replaced with waitqueue_active(), but it only saves a few instructions in one of the coldest paths in the kernel";yes;yes;yes 173;MDY6Q29tbWl0MjMyNTI5ODpjMzhmMTAyNWYyOTEwZDYxODNlOTkyM2Q0YjRkNTgwNDQ3NGI1MGM1;Johannes Weiner;Linus Torvalds;"mm: oom_kill: generalize OOM progress waitqueue It turns out that the mechanism to wait for exiting OOM victims is less generic than it looks: it won't issue wakeups unless the OOM killer is disabled. The reason this check was added was the thought that, since only the OOM disabling code would wait on this queue, wakeup operations could be saved when that specific consumer is known to be absent. However, this is quite the handgrenade. Later attempts to reuse the waitqueue for other purposes will lead to completely unexpected bugs and the failure mode will appear seemingly illogical. Generally, providers shouldn't make unnecessary assumptions about consumers. This could have been replaced with waitqueue_active(), but it only saves a few instructions in one of the coldest paths in the kernel. Simply remove it. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/c38f1025f2910d6183e9923d4b4d5804474b50c5;" Simply remove it.";yes;no;no 174;MDY6Q29tbWl0MjMyNTI5ODo0NjQwMjc3ODVmZjYzMzQ2OGVjYmI2ODNlZGUxNDY3MmQ1MTRiM2Qz;Johannes Weiner;Linus Torvalds;"mm: oom_kill: switch test-and-clear of known TIF_MEMDIE to clear exit_oom_victim() already knows that TIF_MEMDIE is set, and nobody else can clear it concurrently. Use clear_thread_flag() directly. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/464027785ff633468ecbb683ede14672d514b3d3;mm: oom_kill: switch test-and-clear of known TIF_MEMDIE to clear;yes;no;no 174;MDY6Q29tbWl0MjMyNTI5ODo0NjQwMjc3ODVmZjYzMzQ2OGVjYmI2ODNlZGUxNDY3MmQ1MTRiM2Qz;Johannes Weiner;Linus Torvalds;"mm: oom_kill: switch test-and-clear of known TIF_MEMDIE to clear exit_oom_victim() already knows that TIF_MEMDIE is set, and nobody else can clear it concurrently. Use clear_thread_flag() directly. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/464027785ff633468ecbb683ede14672d514b3d3;"exit_oom_victim() already knows that TIF_MEMDIE is set, and nobody else can clear it concurrently";no;yes;yes 174;MDY6Q29tbWl0MjMyNTI5ODo0NjQwMjc3ODVmZjYzMzQ2OGVjYmI2ODNlZGUxNDY3MmQ1MTRiM2Qz;Johannes Weiner;Linus Torvalds;"mm: oom_kill: switch test-and-clear of known TIF_MEMDIE to clear exit_oom_victim() already knows that TIF_MEMDIE is set, and nobody else can clear it concurrently. Use clear_thread_flag() directly. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/464027785ff633468ecbb683ede14672d514b3d3; Use clear_thread_flag() directly.;yes;yes;no 175;MDY6Q29tbWl0MjMyNTI5ODoxNmU5NTE5NjZmMDVkYTVjY2Q2NTAxMDQxNzZmNmJhMjg5ZjdmYTIw;Johannes Weiner;Linus Torvalds;"mm: oom_kill: clean up victim marking and exiting interfaces Rename unmark_oom_victim() to exit_oom_victim(). Marking and unmarking are related in functionality, but the interface is not symmetrical at all: one is an internal OOM killer function used during the killing, the other is for an OOM victim to signal its own death on exit later on. This has locking implications, see follow-up changes. While at it, rename mark_tsk_oom_victim() to mark_oom_victim(), which is easier on the eye. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/16e951966f05da5ccd650104176f6ba289f7fa20;mm: oom_kill: clean up victim marking and exiting interfaces;yes;yes;no 175;MDY6Q29tbWl0MjMyNTI5ODoxNmU5NTE5NjZmMDVkYTVjY2Q2NTAxMDQxNzZmNmJhMjg5ZjdmYTIw;Johannes Weiner;Linus Torvalds;"mm: oom_kill: clean up victim marking and exiting interfaces Rename unmark_oom_victim() to exit_oom_victim(). Marking and unmarking are related in functionality, but the interface is not symmetrical at all: one is an internal OOM killer function used during the killing, the other is for an OOM victim to signal its own death on exit later on. This has locking implications, see follow-up changes. While at it, rename mark_tsk_oom_victim() to mark_oom_victim(), which is easier on the eye. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/16e951966f05da5ccd650104176f6ba289f7fa20;Rename unmark_oom_victim() to exit_oom_victim();yes;no;no 175;MDY6Q29tbWl0MjMyNTI5ODoxNmU5NTE5NjZmMDVkYTVjY2Q2NTAxMDQxNzZmNmJhMjg5ZjdmYTIw;Johannes Weiner;Linus Torvalds;"mm: oom_kill: clean up victim marking and exiting interfaces Rename unmark_oom_victim() to exit_oom_victim(). Marking and unmarking are related in functionality, but the interface is not symmetrical at all: one is an internal OOM killer function used during the killing, the other is for an OOM victim to signal its own death on exit later on. This has locking implications, see follow-up changes. While at it, rename mark_tsk_oom_victim() to mark_oom_victim(), which is easier on the eye. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/16e951966f05da5ccd650104176f6ba289f7fa20;" Marking and unmarking are related in functionality, but the interface is not symmetrical at all: one is an internal OOM killer function used during the killing, the other is for an OOM victim to signal its own death on exit later on";no;yes;yes 175;MDY6Q29tbWl0MjMyNTI5ODoxNmU5NTE5NjZmMDVkYTVjY2Q2NTAxMDQxNzZmNmJhMjg5ZjdmYTIw;Johannes Weiner;Linus Torvalds;"mm: oom_kill: clean up victim marking and exiting interfaces Rename unmark_oom_victim() to exit_oom_victim(). Marking and unmarking are related in functionality, but the interface is not symmetrical at all: one is an internal OOM killer function used during the killing, the other is for an OOM victim to signal its own death on exit later on. This has locking implications, see follow-up changes. While at it, rename mark_tsk_oom_victim() to mark_oom_victim(), which is easier on the eye. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/16e951966f05da5ccd650104176f6ba289f7fa20;This has locking implications, see follow-up changes;yes;yes;yes 175;MDY6Q29tbWl0MjMyNTI5ODoxNmU5NTE5NjZmMDVkYTVjY2Q2NTAxMDQxNzZmNmJhMjg5ZjdmYTIw;Johannes Weiner;Linus Torvalds;"mm: oom_kill: clean up victim marking and exiting interfaces Rename unmark_oom_victim() to exit_oom_victim(). Marking and unmarking are related in functionality, but the interface is not symmetrical at all: one is an internal OOM killer function used during the killing, the other is for an OOM victim to signal its own death on exit later on. This has locking implications, see follow-up changes. While at it, rename mark_tsk_oom_victim() to mark_oom_victim(), which is easier on the eye. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/16e951966f05da5ccd650104176f6ba289f7fa20;"While at it, rename mark_tsk_oom_victim() to mark_oom_victim(), which is easier on the eye.";yes;yes;no 176;MDY6Q29tbWl0MjMyNTI5ODozZjVhYjhjZmJmMTVlOGUwMjgzOGZmYzM1NDkxOTEzNTEzMDVkZjBl;Johannes Weiner;Linus Torvalds;"mm: oom_kill: remove unnecessary locking in oom_enable() Setting oom_killer_disabled to false is atomic, there is no need for further synchronization with ongoing allocations trying to OOM-kill. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3f5ab8cfbf15e8e02838ffc3549191351305df0e;mm: oom_kill: remove unnecessary locking in oom_enable();yes;yes;no 176;MDY6Q29tbWl0MjMyNTI5ODozZjVhYjhjZmJmMTVlOGUwMjgzOGZmYzM1NDkxOTEzNTEzMDVkZjBl;Johannes Weiner;Linus Torvalds;"mm: oom_kill: remove unnecessary locking in oom_enable() Setting oom_killer_disabled to false is atomic, there is no need for further synchronization with ongoing allocations trying to OOM-kill. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3f5ab8cfbf15e8e02838ffc3549191351305df0e;"Setting oom_killer_disabled to false is atomic, there is no need for further synchronization with ongoing allocations trying to OOM-kill.";no;yes;yes 177;MDY6Q29tbWl0MjMyNTI5ODpiZGRkYmNkNDVmZDE5MWEwMjEzZTZkMmEwMzJlYjU1ZDE4YmQxZmMw;Yaowei Bai;Linus Torvalds;"mm/oom_kill.c: fix typo in comment Alter 'taks' -> 'task' Signed-off-by: Yaowei Bai <bywxiaobai@163.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/bdddbcd45fd191a0213e6d2a032eb55d18bd1fc0;mm/oom_kill.c: fix typo in comment;yes;yes;no 177;MDY6Q29tbWl0MjMyNTI5ODpiZGRkYmNkNDVmZDE5MWEwMjEzZTZkMmEwMzJlYjU1ZDE4YmQxZmMw;Yaowei Bai;Linus Torvalds;"mm/oom_kill.c: fix typo in comment Alter 'taks' -> 'task' Signed-off-by: Yaowei Bai <bywxiaobai@163.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/bdddbcd45fd191a0213e6d2a032eb55d18bd1fc0;Alter 'taks' -> 'task';yes;yes;no 179;MDY6Q29tbWl0MjMyNTI5ODpkYzZjOWEzNWI2NmI1MjBjZjY3ZTA1ZDhjYTYwZWJlY2FkM2IwNDc5;Kirill A. Shutemov;Linus Torvalds;"mm: account pmd page tables to the process Dave noticed that unprivileged process can allocate significant amount of memory -- >500 MiB on x86_64 -- and stay unnoticed by oom-killer and memory cgroup. The trick is to allocate a lot of PMD page tables. Linux kernel doesn't account PMD tables to the process, only PTE. The use-cases below use few tricks to allocate a lot of PMD page tables while keeping VmRSS and VmPTE low. oom_score for the process will be 0. #include <errno.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <sys/mman.h> #include <sys/prctl.h> #define PUD_SIZE (1UL << 30) #define PMD_SIZE (1UL << 21) #define NR_PUD 130000 int main(void) { char *addr = NULL; unsigned long i; prctl(PR_SET_THP_DISABLE); for (i = 0; i < NR_PUD ; i++) { addr = mmap(addr + PUD_SIZE, PUD_SIZE, PROT_WRITE|PROT_READ, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0); if (addr == MAP_FAILED) { perror(""mmap""); break; } *addr = 'x'; munmap(addr, PMD_SIZE); mmap(addr, PMD_SIZE, PROT_WRITE|PROT_READ, MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED, -1, 0); if (addr == MAP_FAILED) perror(""re-mmap""), exit(1); } printf(""PID %d consumed %lu KiB in PMD page tables\n"", getpid(), i * 4096 >> 10); return pause(); } The patch addresses the issue by account PMD tables to the process the same way we account PTE. The main place where PMD tables is accounted is __pmd_alloc() and free_pmd_range(). But there're few corner cases: - HugeTLB can share PMD page tables. The patch handles by accounting the table to all processes who share it. - x86 PAE pre-allocates few PMD tables on fork. - Architectures with FIRST_USER_ADDRESS > 0. We need to adjust sanity check on exit(2). Accounting only happens on configuration where PMD page table's level is present (PMD is not folded). As with nr_ptes we use per-mm counter. The counter value is used to calculate baseline for badness score by oom-killer. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: David Rientjes <rientjes@google.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/dc6c9a35b66b520cf67e05d8ca60ebecad3b0479;mm: account pmd page tables to the process;yes;no;no 179;MDY6Q29tbWl0MjMyNTI5ODpkYzZjOWEzNWI2NmI1MjBjZjY3ZTA1ZDhjYTYwZWJlY2FkM2IwNDc5;Kirill A. Shutemov;Linus Torvalds;"mm: account pmd page tables to the process Dave noticed that unprivileged process can allocate significant amount of memory -- >500 MiB on x86_64 -- and stay unnoticed by oom-killer and memory cgroup. The trick is to allocate a lot of PMD page tables. Linux kernel doesn't account PMD tables to the process, only PTE. The use-cases below use few tricks to allocate a lot of PMD page tables while keeping VmRSS and VmPTE low. oom_score for the process will be 0. #include <errno.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <sys/mman.h> #include <sys/prctl.h> #define PUD_SIZE (1UL << 30) #define PMD_SIZE (1UL << 21) #define NR_PUD 130000 int main(void) { char *addr = NULL; unsigned long i; prctl(PR_SET_THP_DISABLE); for (i = 0; i < NR_PUD ; i++) { addr = mmap(addr + PUD_SIZE, PUD_SIZE, PROT_WRITE|PROT_READ, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0); if (addr == MAP_FAILED) { perror(""mmap""); break; } *addr = 'x'; munmap(addr, PMD_SIZE); mmap(addr, PMD_SIZE, PROT_WRITE|PROT_READ, MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED, -1, 0); if (addr == MAP_FAILED) perror(""re-mmap""), exit(1); } printf(""PID %d consumed %lu KiB in PMD page tables\n"", getpid(), i * 4096 >> 10); return pause(); } The patch addresses the issue by account PMD tables to the process the same way we account PTE. The main place where PMD tables is accounted is __pmd_alloc() and free_pmd_range(). But there're few corner cases: - HugeTLB can share PMD page tables. The patch handles by accounting the table to all processes who share it. - x86 PAE pre-allocates few PMD tables on fork. - Architectures with FIRST_USER_ADDRESS > 0. We need to adjust sanity check on exit(2). Accounting only happens on configuration where PMD page table's level is present (PMD is not folded). As with nr_ptes we use per-mm counter. The counter value is used to calculate baseline for badness score by oom-killer. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: David Rientjes <rientjes@google.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/dc6c9a35b66b520cf67e05d8ca60ebecad3b0479;"Dave noticed that unprivileged process can allocate significant amount of memory -- >500 MiB on x86_64 -- and stay unnoticed by oom-killer and memory cgroup";no;no;yes 179;MDY6Q29tbWl0MjMyNTI5ODpkYzZjOWEzNWI2NmI1MjBjZjY3ZTA1ZDhjYTYwZWJlY2FkM2IwNDc5;Kirill A. Shutemov;Linus Torvalds;"mm: account pmd page tables to the process Dave noticed that unprivileged process can allocate significant amount of memory -- >500 MiB on x86_64 -- and stay unnoticed by oom-killer and memory cgroup. The trick is to allocate a lot of PMD page tables. Linux kernel doesn't account PMD tables to the process, only PTE. The use-cases below use few tricks to allocate a lot of PMD page tables while keeping VmRSS and VmPTE low. oom_score for the process will be 0. #include <errno.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <sys/mman.h> #include <sys/prctl.h> #define PUD_SIZE (1UL << 30) #define PMD_SIZE (1UL << 21) #define NR_PUD 130000 int main(void) { char *addr = NULL; unsigned long i; prctl(PR_SET_THP_DISABLE); for (i = 0; i < NR_PUD ; i++) { addr = mmap(addr + PUD_SIZE, PUD_SIZE, PROT_WRITE|PROT_READ, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0); if (addr == MAP_FAILED) { perror(""mmap""); break; } *addr = 'x'; munmap(addr, PMD_SIZE); mmap(addr, PMD_SIZE, PROT_WRITE|PROT_READ, MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED, -1, 0); if (addr == MAP_FAILED) perror(""re-mmap""), exit(1); } printf(""PID %d consumed %lu KiB in PMD page tables\n"", getpid(), i * 4096 >> 10); return pause(); } The patch addresses the issue by account PMD tables to the process the same way we account PTE. The main place where PMD tables is accounted is __pmd_alloc() and free_pmd_range(). But there're few corner cases: - HugeTLB can share PMD page tables. The patch handles by accounting the table to all processes who share it. - x86 PAE pre-allocates few PMD tables on fork. - Architectures with FIRST_USER_ADDRESS > 0. We need to adjust sanity check on exit(2). Accounting only happens on configuration where PMD page table's level is present (PMD is not folded). As with nr_ptes we use per-mm counter. The counter value is used to calculate baseline for badness score by oom-killer. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: David Rientjes <rientjes@google.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/dc6c9a35b66b520cf67e05d8ca60ebecad3b0479; The trick is to allocate a lot of PMD page tables;no;no;yes 179;MDY6Q29tbWl0MjMyNTI5ODpkYzZjOWEzNWI2NmI1MjBjZjY3ZTA1ZDhjYTYwZWJlY2FkM2IwNDc5;Kirill A. Shutemov;Linus Torvalds;"mm: account pmd page tables to the process Dave noticed that unprivileged process can allocate significant amount of memory -- >500 MiB on x86_64 -- and stay unnoticed by oom-killer and memory cgroup. The trick is to allocate a lot of PMD page tables. Linux kernel doesn't account PMD tables to the process, only PTE. The use-cases below use few tricks to allocate a lot of PMD page tables while keeping VmRSS and VmPTE low. oom_score for the process will be 0. #include <errno.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <sys/mman.h> #include <sys/prctl.h> #define PUD_SIZE (1UL << 30) #define PMD_SIZE (1UL << 21) #define NR_PUD 130000 int main(void) { char *addr = NULL; unsigned long i; prctl(PR_SET_THP_DISABLE); for (i = 0; i < NR_PUD ; i++) { addr = mmap(addr + PUD_SIZE, PUD_SIZE, PROT_WRITE|PROT_READ, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0); if (addr == MAP_FAILED) { perror(""mmap""); break; } *addr = 'x'; munmap(addr, PMD_SIZE); mmap(addr, PMD_SIZE, PROT_WRITE|PROT_READ, MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED, -1, 0); if (addr == MAP_FAILED) perror(""re-mmap""), exit(1); } printf(""PID %d consumed %lu KiB in PMD page tables\n"", getpid(), i * 4096 >> 10); return pause(); } The patch addresses the issue by account PMD tables to the process the same way we account PTE. The main place where PMD tables is accounted is __pmd_alloc() and free_pmd_range(). But there're few corner cases: - HugeTLB can share PMD page tables. The patch handles by accounting the table to all processes who share it. - x86 PAE pre-allocates few PMD tables on fork. - Architectures with FIRST_USER_ADDRESS > 0. We need to adjust sanity check on exit(2). Accounting only happens on configuration where PMD page table's level is present (PMD is not folded). As with nr_ptes we use per-mm counter. The counter value is used to calculate baseline for badness score by oom-killer. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: David Rientjes <rientjes@google.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/dc6c9a35b66b520cf67e05d8ca60ebecad3b0479;" Linux kernel doesn't account PMD tables to the process, only PTE";no;no;yes 179;MDY6Q29tbWl0MjMyNTI5ODpkYzZjOWEzNWI2NmI1MjBjZjY3ZTA1ZDhjYTYwZWJlY2FkM2IwNDc5;Kirill A. Shutemov;Linus Torvalds;"mm: account pmd page tables to the process Dave noticed that unprivileged process can allocate significant amount of memory -- >500 MiB on x86_64 -- and stay unnoticed by oom-killer and memory cgroup. The trick is to allocate a lot of PMD page tables. Linux kernel doesn't account PMD tables to the process, only PTE. The use-cases below use few tricks to allocate a lot of PMD page tables while keeping VmRSS and VmPTE low. oom_score for the process will be 0. #include <errno.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <sys/mman.h> #include <sys/prctl.h> #define PUD_SIZE (1UL << 30) #define PMD_SIZE (1UL << 21) #define NR_PUD 130000 int main(void) { char *addr = NULL; unsigned long i; prctl(PR_SET_THP_DISABLE); for (i = 0; i < NR_PUD ; i++) { addr = mmap(addr + PUD_SIZE, PUD_SIZE, PROT_WRITE|PROT_READ, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0); if (addr == MAP_FAILED) { perror(""mmap""); break; } *addr = 'x'; munmap(addr, PMD_SIZE); mmap(addr, PMD_SIZE, PROT_WRITE|PROT_READ, MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED, -1, 0); if (addr == MAP_FAILED) perror(""re-mmap""), exit(1); } printf(""PID %d consumed %lu KiB in PMD page tables\n"", getpid(), i * 4096 >> 10); return pause(); } The patch addresses the issue by account PMD tables to the process the same way we account PTE. The main place where PMD tables is accounted is __pmd_alloc() and free_pmd_range(). But there're few corner cases: - HugeTLB can share PMD page tables. The patch handles by accounting the table to all processes who share it. - x86 PAE pre-allocates few PMD tables on fork. - Architectures with FIRST_USER_ADDRESS > 0. We need to adjust sanity check on exit(2). Accounting only happens on configuration where PMD page table's level is present (PMD is not folded). As with nr_ptes we use per-mm counter. The counter value is used to calculate baseline for badness score by oom-killer. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: David Rientjes <rientjes@google.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/dc6c9a35b66b520cf67e05d8ca60ebecad3b0479;"The use-cases below use few tricks to allocate a lot of PMD page tables while keeping VmRSS and VmPTE low";no;no;yes 179;MDY6Q29tbWl0MjMyNTI5ODpkYzZjOWEzNWI2NmI1MjBjZjY3ZTA1ZDhjYTYwZWJlY2FkM2IwNDc5;Kirill A. Shutemov;Linus Torvalds;"mm: account pmd page tables to the process Dave noticed that unprivileged process can allocate significant amount of memory -- >500 MiB on x86_64 -- and stay unnoticed by oom-killer and memory cgroup. The trick is to allocate a lot of PMD page tables. Linux kernel doesn't account PMD tables to the process, only PTE. The use-cases below use few tricks to allocate a lot of PMD page tables while keeping VmRSS and VmPTE low. oom_score for the process will be 0. #include <errno.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <sys/mman.h> #include <sys/prctl.h> #define PUD_SIZE (1UL << 30) #define PMD_SIZE (1UL << 21) #define NR_PUD 130000 int main(void) { char *addr = NULL; unsigned long i; prctl(PR_SET_THP_DISABLE); for (i = 0; i < NR_PUD ; i++) { addr = mmap(addr + PUD_SIZE, PUD_SIZE, PROT_WRITE|PROT_READ, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0); if (addr == MAP_FAILED) { perror(""mmap""); break; } *addr = 'x'; munmap(addr, PMD_SIZE); mmap(addr, PMD_SIZE, PROT_WRITE|PROT_READ, MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED, -1, 0); if (addr == MAP_FAILED) perror(""re-mmap""), exit(1); } printf(""PID %d consumed %lu KiB in PMD page tables\n"", getpid(), i * 4096 >> 10); return pause(); } The patch addresses the issue by account PMD tables to the process the same way we account PTE. The main place where PMD tables is accounted is __pmd_alloc() and free_pmd_range(). But there're few corner cases: - HugeTLB can share PMD page tables. The patch handles by accounting the table to all processes who share it. - x86 PAE pre-allocates few PMD tables on fork. - Architectures with FIRST_USER_ADDRESS > 0. We need to adjust sanity check on exit(2). Accounting only happens on configuration where PMD page table's level is present (PMD is not folded). As with nr_ptes we use per-mm counter. The counter value is used to calculate baseline for badness score by oom-killer. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: David Rientjes <rientjes@google.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/dc6c9a35b66b520cf67e05d8ca60ebecad3b0479; oom_score for the process will be 0;no;no;yes 179;MDY6Q29tbWl0MjMyNTI5ODpkYzZjOWEzNWI2NmI1MjBjZjY3ZTA1ZDhjYTYwZWJlY2FkM2IwNDc5;Kirill A. Shutemov;Linus Torvalds;"mm: account pmd page tables to the process Dave noticed that unprivileged process can allocate significant amount of memory -- >500 MiB on x86_64 -- and stay unnoticed by oom-killer and memory cgroup. The trick is to allocate a lot of PMD page tables. Linux kernel doesn't account PMD tables to the process, only PTE. The use-cases below use few tricks to allocate a lot of PMD page tables while keeping VmRSS and VmPTE low. oom_score for the process will be 0. #include <errno.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <sys/mman.h> #include <sys/prctl.h> #define PUD_SIZE (1UL << 30) #define PMD_SIZE (1UL << 21) #define NR_PUD 130000 int main(void) { char *addr = NULL; unsigned long i; prctl(PR_SET_THP_DISABLE); for (i = 0; i < NR_PUD ; i++) { addr = mmap(addr + PUD_SIZE, PUD_SIZE, PROT_WRITE|PROT_READ, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0); if (addr == MAP_FAILED) { perror(""mmap""); break; } *addr = 'x'; munmap(addr, PMD_SIZE); mmap(addr, PMD_SIZE, PROT_WRITE|PROT_READ, MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED, -1, 0); if (addr == MAP_FAILED) perror(""re-mmap""), exit(1); } printf(""PID %d consumed %lu KiB in PMD page tables\n"", getpid(), i * 4096 >> 10); return pause(); } The patch addresses the issue by account PMD tables to the process the same way we account PTE. The main place where PMD tables is accounted is __pmd_alloc() and free_pmd_range(). But there're few corner cases: - HugeTLB can share PMD page tables. The patch handles by accounting the table to all processes who share it. - x86 PAE pre-allocates few PMD tables on fork. - Architectures with FIRST_USER_ADDRESS > 0. We need to adjust sanity check on exit(2). Accounting only happens on configuration where PMD page table's level is present (PMD is not folded). As with nr_ptes we use per-mm counter. The counter value is used to calculate baseline for badness score by oom-killer. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: David Rientjes <rientjes@google.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/dc6c9a35b66b520cf67e05d8ca60ebecad3b0479;" int main(void) addr = mmap(addr + PUD_SIZE, PUD_SIZE, PROT_WRITE|PROT_READ, mmap(addr, PMD_SIZE, PROT_WRITE|PROT_READ, printf(""PID %d consumed %lu KiB in PMD page tables\n"", The patch addresses the issue by account PMD tables to the process the same way we account PTE";yes;yes;no 179;MDY6Q29tbWl0MjMyNTI5ODpkYzZjOWEzNWI2NmI1MjBjZjY3ZTA1ZDhjYTYwZWJlY2FkM2IwNDc5;Kirill A. Shutemov;Linus Torvalds;"mm: account pmd page tables to the process Dave noticed that unprivileged process can allocate significant amount of memory -- >500 MiB on x86_64 -- and stay unnoticed by oom-killer and memory cgroup. The trick is to allocate a lot of PMD page tables. Linux kernel doesn't account PMD tables to the process, only PTE. The use-cases below use few tricks to allocate a lot of PMD page tables while keeping VmRSS and VmPTE low. oom_score for the process will be 0. #include <errno.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <sys/mman.h> #include <sys/prctl.h> #define PUD_SIZE (1UL << 30) #define PMD_SIZE (1UL << 21) #define NR_PUD 130000 int main(void) { char *addr = NULL; unsigned long i; prctl(PR_SET_THP_DISABLE); for (i = 0; i < NR_PUD ; i++) { addr = mmap(addr + PUD_SIZE, PUD_SIZE, PROT_WRITE|PROT_READ, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0); if (addr == MAP_FAILED) { perror(""mmap""); break; } *addr = 'x'; munmap(addr, PMD_SIZE); mmap(addr, PMD_SIZE, PROT_WRITE|PROT_READ, MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED, -1, 0); if (addr == MAP_FAILED) perror(""re-mmap""), exit(1); } printf(""PID %d consumed %lu KiB in PMD page tables\n"", getpid(), i * 4096 >> 10); return pause(); } The patch addresses the issue by account PMD tables to the process the same way we account PTE. The main place where PMD tables is accounted is __pmd_alloc() and free_pmd_range(). But there're few corner cases: - HugeTLB can share PMD page tables. The patch handles by accounting the table to all processes who share it. - x86 PAE pre-allocates few PMD tables on fork. - Architectures with FIRST_USER_ADDRESS > 0. We need to adjust sanity check on exit(2). Accounting only happens on configuration where PMD page table's level is present (PMD is not folded). As with nr_ptes we use per-mm counter. The counter value is used to calculate baseline for badness score by oom-killer. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: David Rientjes <rientjes@google.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/dc6c9a35b66b520cf67e05d8ca60ebecad3b0479;"The main place where PMD tables is accounted is __pmd_alloc() and free_pmd_range()";yes;no;yes 179;MDY6Q29tbWl0MjMyNTI5ODpkYzZjOWEzNWI2NmI1MjBjZjY3ZTA1ZDhjYTYwZWJlY2FkM2IwNDc5;Kirill A. Shutemov;Linus Torvalds;"mm: account pmd page tables to the process Dave noticed that unprivileged process can allocate significant amount of memory -- >500 MiB on x86_64 -- and stay unnoticed by oom-killer and memory cgroup. The trick is to allocate a lot of PMD page tables. Linux kernel doesn't account PMD tables to the process, only PTE. The use-cases below use few tricks to allocate a lot of PMD page tables while keeping VmRSS and VmPTE low. oom_score for the process will be 0. #include <errno.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <sys/mman.h> #include <sys/prctl.h> #define PUD_SIZE (1UL << 30) #define PMD_SIZE (1UL << 21) #define NR_PUD 130000 int main(void) { char *addr = NULL; unsigned long i; prctl(PR_SET_THP_DISABLE); for (i = 0; i < NR_PUD ; i++) { addr = mmap(addr + PUD_SIZE, PUD_SIZE, PROT_WRITE|PROT_READ, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0); if (addr == MAP_FAILED) { perror(""mmap""); break; } *addr = 'x'; munmap(addr, PMD_SIZE); mmap(addr, PMD_SIZE, PROT_WRITE|PROT_READ, MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED, -1, 0); if (addr == MAP_FAILED) perror(""re-mmap""), exit(1); } printf(""PID %d consumed %lu KiB in PMD page tables\n"", getpid(), i * 4096 >> 10); return pause(); } The patch addresses the issue by account PMD tables to the process the same way we account PTE. The main place where PMD tables is accounted is __pmd_alloc() and free_pmd_range(). But there're few corner cases: - HugeTLB can share PMD page tables. The patch handles by accounting the table to all processes who share it. - x86 PAE pre-allocates few PMD tables on fork. - Architectures with FIRST_USER_ADDRESS > 0. We need to adjust sanity check on exit(2). Accounting only happens on configuration where PMD page table's level is present (PMD is not folded). As with nr_ptes we use per-mm counter. The counter value is used to calculate baseline for badness score by oom-killer. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: David Rientjes <rientjes@google.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/dc6c9a35b66b520cf67e05d8ca60ebecad3b0479;But there're few corner cases;yes;no;yes 179;MDY6Q29tbWl0MjMyNTI5ODpkYzZjOWEzNWI2NmI1MjBjZjY3ZTA1ZDhjYTYwZWJlY2FkM2IwNDc5;Kirill A. Shutemov;Linus Torvalds;"mm: account pmd page tables to the process Dave noticed that unprivileged process can allocate significant amount of memory -- >500 MiB on x86_64 -- and stay unnoticed by oom-killer and memory cgroup. The trick is to allocate a lot of PMD page tables. Linux kernel doesn't account PMD tables to the process, only PTE. The use-cases below use few tricks to allocate a lot of PMD page tables while keeping VmRSS and VmPTE low. oom_score for the process will be 0. #include <errno.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <sys/mman.h> #include <sys/prctl.h> #define PUD_SIZE (1UL << 30) #define PMD_SIZE (1UL << 21) #define NR_PUD 130000 int main(void) { char *addr = NULL; unsigned long i; prctl(PR_SET_THP_DISABLE); for (i = 0; i < NR_PUD ; i++) { addr = mmap(addr + PUD_SIZE, PUD_SIZE, PROT_WRITE|PROT_READ, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0); if (addr == MAP_FAILED) { perror(""mmap""); break; } *addr = 'x'; munmap(addr, PMD_SIZE); mmap(addr, PMD_SIZE, PROT_WRITE|PROT_READ, MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED, -1, 0); if (addr == MAP_FAILED) perror(""re-mmap""), exit(1); } printf(""PID %d consumed %lu KiB in PMD page tables\n"", getpid(), i * 4096 >> 10); return pause(); } The patch addresses the issue by account PMD tables to the process the same way we account PTE. The main place where PMD tables is accounted is __pmd_alloc() and free_pmd_range(). But there're few corner cases: - HugeTLB can share PMD page tables. The patch handles by accounting the table to all processes who share it. - x86 PAE pre-allocates few PMD tables on fork. - Architectures with FIRST_USER_ADDRESS > 0. We need to adjust sanity check on exit(2). Accounting only happens on configuration where PMD page table's level is present (PMD is not folded). As with nr_ptes we use per-mm counter. The counter value is used to calculate baseline for badness score by oom-killer. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: David Rientjes <rientjes@google.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/dc6c9a35b66b520cf67e05d8ca60ebecad3b0479; - HugeTLB can share PMD page tables;yes;no;yes 179;MDY6Q29tbWl0MjMyNTI5ODpkYzZjOWEzNWI2NmI1MjBjZjY3ZTA1ZDhjYTYwZWJlY2FkM2IwNDc5;Kirill A. Shutemov;Linus Torvalds;"mm: account pmd page tables to the process Dave noticed that unprivileged process can allocate significant amount of memory -- >500 MiB on x86_64 -- and stay unnoticed by oom-killer and memory cgroup. The trick is to allocate a lot of PMD page tables. Linux kernel doesn't account PMD tables to the process, only PTE. The use-cases below use few tricks to allocate a lot of PMD page tables while keeping VmRSS and VmPTE low. oom_score for the process will be 0. #include <errno.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <sys/mman.h> #include <sys/prctl.h> #define PUD_SIZE (1UL << 30) #define PMD_SIZE (1UL << 21) #define NR_PUD 130000 int main(void) { char *addr = NULL; unsigned long i; prctl(PR_SET_THP_DISABLE); for (i = 0; i < NR_PUD ; i++) { addr = mmap(addr + PUD_SIZE, PUD_SIZE, PROT_WRITE|PROT_READ, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0); if (addr == MAP_FAILED) { perror(""mmap""); break; } *addr = 'x'; munmap(addr, PMD_SIZE); mmap(addr, PMD_SIZE, PROT_WRITE|PROT_READ, MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED, -1, 0); if (addr == MAP_FAILED) perror(""re-mmap""), exit(1); } printf(""PID %d consumed %lu KiB in PMD page tables\n"", getpid(), i * 4096 >> 10); return pause(); } The patch addresses the issue by account PMD tables to the process the same way we account PTE. The main place where PMD tables is accounted is __pmd_alloc() and free_pmd_range(). But there're few corner cases: - HugeTLB can share PMD page tables. The patch handles by accounting the table to all processes who share it. - x86 PAE pre-allocates few PMD tables on fork. - Architectures with FIRST_USER_ADDRESS > 0. We need to adjust sanity check on exit(2). Accounting only happens on configuration where PMD page table's level is present (PMD is not folded). As with nr_ptes we use per-mm counter. The counter value is used to calculate baseline for badness score by oom-killer. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: David Rientjes <rientjes@google.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/dc6c9a35b66b520cf67e05d8ca60ebecad3b0479;"The patch handles by accounting the table to all processes who share it";yes;no;no 179;MDY6Q29tbWl0MjMyNTI5ODpkYzZjOWEzNWI2NmI1MjBjZjY3ZTA1ZDhjYTYwZWJlY2FkM2IwNDc5;Kirill A. Shutemov;Linus Torvalds;"mm: account pmd page tables to the process Dave noticed that unprivileged process can allocate significant amount of memory -- >500 MiB on x86_64 -- and stay unnoticed by oom-killer and memory cgroup. The trick is to allocate a lot of PMD page tables. Linux kernel doesn't account PMD tables to the process, only PTE. The use-cases below use few tricks to allocate a lot of PMD page tables while keeping VmRSS and VmPTE low. oom_score for the process will be 0. #include <errno.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <sys/mman.h> #include <sys/prctl.h> #define PUD_SIZE (1UL << 30) #define PMD_SIZE (1UL << 21) #define NR_PUD 130000 int main(void) { char *addr = NULL; unsigned long i; prctl(PR_SET_THP_DISABLE); for (i = 0; i < NR_PUD ; i++) { addr = mmap(addr + PUD_SIZE, PUD_SIZE, PROT_WRITE|PROT_READ, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0); if (addr == MAP_FAILED) { perror(""mmap""); break; } *addr = 'x'; munmap(addr, PMD_SIZE); mmap(addr, PMD_SIZE, PROT_WRITE|PROT_READ, MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED, -1, 0); if (addr == MAP_FAILED) perror(""re-mmap""), exit(1); } printf(""PID %d consumed %lu KiB in PMD page tables\n"", getpid(), i * 4096 >> 10); return pause(); } The patch addresses the issue by account PMD tables to the process the same way we account PTE. The main place where PMD tables is accounted is __pmd_alloc() and free_pmd_range(). But there're few corner cases: - HugeTLB can share PMD page tables. The patch handles by accounting the table to all processes who share it. - x86 PAE pre-allocates few PMD tables on fork. - Architectures with FIRST_USER_ADDRESS > 0. We need to adjust sanity check on exit(2). Accounting only happens on configuration where PMD page table's level is present (PMD is not folded). As with nr_ptes we use per-mm counter. The counter value is used to calculate baseline for badness score by oom-killer. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: David Rientjes <rientjes@google.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/dc6c9a35b66b520cf67e05d8ca60ebecad3b0479; - x86 PAE pre-allocates few PMD tables on fork;yes;no;yes 179;MDY6Q29tbWl0MjMyNTI5ODpkYzZjOWEzNWI2NmI1MjBjZjY3ZTA1ZDhjYTYwZWJlY2FkM2IwNDc5;Kirill A. Shutemov;Linus Torvalds;"mm: account pmd page tables to the process Dave noticed that unprivileged process can allocate significant amount of memory -- >500 MiB on x86_64 -- and stay unnoticed by oom-killer and memory cgroup. The trick is to allocate a lot of PMD page tables. Linux kernel doesn't account PMD tables to the process, only PTE. The use-cases below use few tricks to allocate a lot of PMD page tables while keeping VmRSS and VmPTE low. oom_score for the process will be 0. #include <errno.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <sys/mman.h> #include <sys/prctl.h> #define PUD_SIZE (1UL << 30) #define PMD_SIZE (1UL << 21) #define NR_PUD 130000 int main(void) { char *addr = NULL; unsigned long i; prctl(PR_SET_THP_DISABLE); for (i = 0; i < NR_PUD ; i++) { addr = mmap(addr + PUD_SIZE, PUD_SIZE, PROT_WRITE|PROT_READ, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0); if (addr == MAP_FAILED) { perror(""mmap""); break; } *addr = 'x'; munmap(addr, PMD_SIZE); mmap(addr, PMD_SIZE, PROT_WRITE|PROT_READ, MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED, -1, 0); if (addr == MAP_FAILED) perror(""re-mmap""), exit(1); } printf(""PID %d consumed %lu KiB in PMD page tables\n"", getpid(), i * 4096 >> 10); return pause(); } The patch addresses the issue by account PMD tables to the process the same way we account PTE. The main place where PMD tables is accounted is __pmd_alloc() and free_pmd_range(). But there're few corner cases: - HugeTLB can share PMD page tables. The patch handles by accounting the table to all processes who share it. - x86 PAE pre-allocates few PMD tables on fork. - Architectures with FIRST_USER_ADDRESS > 0. We need to adjust sanity check on exit(2). Accounting only happens on configuration where PMD page table's level is present (PMD is not folded). As with nr_ptes we use per-mm counter. The counter value is used to calculate baseline for badness score by oom-killer. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: David Rientjes <rientjes@google.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/dc6c9a35b66b520cf67e05d8ca60ebecad3b0479; - Architectures with FIRST_USER_ADDRESS > 0;yes;no;yes 179;MDY6Q29tbWl0MjMyNTI5ODpkYzZjOWEzNWI2NmI1MjBjZjY3ZTA1ZDhjYTYwZWJlY2FkM2IwNDc5;Kirill A. Shutemov;Linus Torvalds;"mm: account pmd page tables to the process Dave noticed that unprivileged process can allocate significant amount of memory -- >500 MiB on x86_64 -- and stay unnoticed by oom-killer and memory cgroup. The trick is to allocate a lot of PMD page tables. Linux kernel doesn't account PMD tables to the process, only PTE. The use-cases below use few tricks to allocate a lot of PMD page tables while keeping VmRSS and VmPTE low. oom_score for the process will be 0. #include <errno.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <sys/mman.h> #include <sys/prctl.h> #define PUD_SIZE (1UL << 30) #define PMD_SIZE (1UL << 21) #define NR_PUD 130000 int main(void) { char *addr = NULL; unsigned long i; prctl(PR_SET_THP_DISABLE); for (i = 0; i < NR_PUD ; i++) { addr = mmap(addr + PUD_SIZE, PUD_SIZE, PROT_WRITE|PROT_READ, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0); if (addr == MAP_FAILED) { perror(""mmap""); break; } *addr = 'x'; munmap(addr, PMD_SIZE); mmap(addr, PMD_SIZE, PROT_WRITE|PROT_READ, MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED, -1, 0); if (addr == MAP_FAILED) perror(""re-mmap""), exit(1); } printf(""PID %d consumed %lu KiB in PMD page tables\n"", getpid(), i * 4096 >> 10); return pause(); } The patch addresses the issue by account PMD tables to the process the same way we account PTE. The main place where PMD tables is accounted is __pmd_alloc() and free_pmd_range(). But there're few corner cases: - HugeTLB can share PMD page tables. The patch handles by accounting the table to all processes who share it. - x86 PAE pre-allocates few PMD tables on fork. - Architectures with FIRST_USER_ADDRESS > 0. We need to adjust sanity check on exit(2). Accounting only happens on configuration where PMD page table's level is present (PMD is not folded). As with nr_ptes we use per-mm counter. The counter value is used to calculate baseline for badness score by oom-killer. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: David Rientjes <rientjes@google.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/dc6c9a35b66b520cf67e05d8ca60ebecad3b0479;"We need to adjust sanity check on exit(2)";yes;yes;no 179;MDY6Q29tbWl0MjMyNTI5ODpkYzZjOWEzNWI2NmI1MjBjZjY3ZTA1ZDhjYTYwZWJlY2FkM2IwNDc5;Kirill A. Shutemov;Linus Torvalds;"mm: account pmd page tables to the process Dave noticed that unprivileged process can allocate significant amount of memory -- >500 MiB on x86_64 -- and stay unnoticed by oom-killer and memory cgroup. The trick is to allocate a lot of PMD page tables. Linux kernel doesn't account PMD tables to the process, only PTE. The use-cases below use few tricks to allocate a lot of PMD page tables while keeping VmRSS and VmPTE low. oom_score for the process will be 0. #include <errno.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <sys/mman.h> #include <sys/prctl.h> #define PUD_SIZE (1UL << 30) #define PMD_SIZE (1UL << 21) #define NR_PUD 130000 int main(void) { char *addr = NULL; unsigned long i; prctl(PR_SET_THP_DISABLE); for (i = 0; i < NR_PUD ; i++) { addr = mmap(addr + PUD_SIZE, PUD_SIZE, PROT_WRITE|PROT_READ, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0); if (addr == MAP_FAILED) { perror(""mmap""); break; } *addr = 'x'; munmap(addr, PMD_SIZE); mmap(addr, PMD_SIZE, PROT_WRITE|PROT_READ, MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED, -1, 0); if (addr == MAP_FAILED) perror(""re-mmap""), exit(1); } printf(""PID %d consumed %lu KiB in PMD page tables\n"", getpid(), i * 4096 >> 10); return pause(); } The patch addresses the issue by account PMD tables to the process the same way we account PTE. The main place where PMD tables is accounted is __pmd_alloc() and free_pmd_range(). But there're few corner cases: - HugeTLB can share PMD page tables. The patch handles by accounting the table to all processes who share it. - x86 PAE pre-allocates few PMD tables on fork. - Architectures with FIRST_USER_ADDRESS > 0. We need to adjust sanity check on exit(2). Accounting only happens on configuration where PMD page table's level is present (PMD is not folded). As with nr_ptes we use per-mm counter. The counter value is used to calculate baseline for badness score by oom-killer. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: David Rientjes <rientjes@google.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/dc6c9a35b66b520cf67e05d8ca60ebecad3b0479;"Accounting only happens on configuration where PMD page table's level is present (PMD is not folded)";yes;no;yes 179;MDY6Q29tbWl0MjMyNTI5ODpkYzZjOWEzNWI2NmI1MjBjZjY3ZTA1ZDhjYTYwZWJlY2FkM2IwNDc5;Kirill A. Shutemov;Linus Torvalds;"mm: account pmd page tables to the process Dave noticed that unprivileged process can allocate significant amount of memory -- >500 MiB on x86_64 -- and stay unnoticed by oom-killer and memory cgroup. The trick is to allocate a lot of PMD page tables. Linux kernel doesn't account PMD tables to the process, only PTE. The use-cases below use few tricks to allocate a lot of PMD page tables while keeping VmRSS and VmPTE low. oom_score for the process will be 0. #include <errno.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <sys/mman.h> #include <sys/prctl.h> #define PUD_SIZE (1UL << 30) #define PMD_SIZE (1UL << 21) #define NR_PUD 130000 int main(void) { char *addr = NULL; unsigned long i; prctl(PR_SET_THP_DISABLE); for (i = 0; i < NR_PUD ; i++) { addr = mmap(addr + PUD_SIZE, PUD_SIZE, PROT_WRITE|PROT_READ, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0); if (addr == MAP_FAILED) { perror(""mmap""); break; } *addr = 'x'; munmap(addr, PMD_SIZE); mmap(addr, PMD_SIZE, PROT_WRITE|PROT_READ, MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED, -1, 0); if (addr == MAP_FAILED) perror(""re-mmap""), exit(1); } printf(""PID %d consumed %lu KiB in PMD page tables\n"", getpid(), i * 4096 >> 10); return pause(); } The patch addresses the issue by account PMD tables to the process the same way we account PTE. The main place where PMD tables is accounted is __pmd_alloc() and free_pmd_range(). But there're few corner cases: - HugeTLB can share PMD page tables. The patch handles by accounting the table to all processes who share it. - x86 PAE pre-allocates few PMD tables on fork. - Architectures with FIRST_USER_ADDRESS > 0. We need to adjust sanity check on exit(2). Accounting only happens on configuration where PMD page table's level is present (PMD is not folded). As with nr_ptes we use per-mm counter. The counter value is used to calculate baseline for badness score by oom-killer. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: David Rientjes <rientjes@google.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/dc6c9a35b66b520cf67e05d8ca60ebecad3b0479; As with nr_ptes we use per-mm counter;yes;no;yes 179;MDY6Q29tbWl0MjMyNTI5ODpkYzZjOWEzNWI2NmI1MjBjZjY3ZTA1ZDhjYTYwZWJlY2FkM2IwNDc5;Kirill A. Shutemov;Linus Torvalds;"mm: account pmd page tables to the process Dave noticed that unprivileged process can allocate significant amount of memory -- >500 MiB on x86_64 -- and stay unnoticed by oom-killer and memory cgroup. The trick is to allocate a lot of PMD page tables. Linux kernel doesn't account PMD tables to the process, only PTE. The use-cases below use few tricks to allocate a lot of PMD page tables while keeping VmRSS and VmPTE low. oom_score for the process will be 0. #include <errno.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <sys/mman.h> #include <sys/prctl.h> #define PUD_SIZE (1UL << 30) #define PMD_SIZE (1UL << 21) #define NR_PUD 130000 int main(void) { char *addr = NULL; unsigned long i; prctl(PR_SET_THP_DISABLE); for (i = 0; i < NR_PUD ; i++) { addr = mmap(addr + PUD_SIZE, PUD_SIZE, PROT_WRITE|PROT_READ, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0); if (addr == MAP_FAILED) { perror(""mmap""); break; } *addr = 'x'; munmap(addr, PMD_SIZE); mmap(addr, PMD_SIZE, PROT_WRITE|PROT_READ, MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED, -1, 0); if (addr == MAP_FAILED) perror(""re-mmap""), exit(1); } printf(""PID %d consumed %lu KiB in PMD page tables\n"", getpid(), i * 4096 >> 10); return pause(); } The patch addresses the issue by account PMD tables to the process the same way we account PTE. The main place where PMD tables is accounted is __pmd_alloc() and free_pmd_range(). But there're few corner cases: - HugeTLB can share PMD page tables. The patch handles by accounting the table to all processes who share it. - x86 PAE pre-allocates few PMD tables on fork. - Architectures with FIRST_USER_ADDRESS > 0. We need to adjust sanity check on exit(2). Accounting only happens on configuration where PMD page table's level is present (PMD is not folded). As with nr_ptes we use per-mm counter. The counter value is used to calculate baseline for badness score by oom-killer. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: David Rientjes <rientjes@google.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/dc6c9a35b66b520cf67e05d8ca60ebecad3b0479;" The counter value is used to calculate baseline for badness score by oom-killer.";yes;no;yes 180;MDY6Q29tbWl0MjMyNTI5ODpjMzJiM2NiZTBkMDY3YTljZmFlODVhYTcwYmExZTk3Y2ViYTBjZWQ3;Michal Hocko;Linus Torvalds;"oom, PM: make OOM detection in the freezer path raceless Commit 5695be142e20 (""OOM, PM: OOM killed task shouldn't escape PM suspend"") has left a race window when OOM killer manages to note_oom_kill after freeze_processes checks the counter. The race window is quite small and really unlikely and partial solution deemed sufficient at the time of submission. Tejun wasn't happy about this partial solution though and insisted on a full solution. That requires the full OOM and freezer's task freezing exclusion, though. This is done by this patch which introduces oom_sem RW lock and turns oom_killer_disable() into a full OOM barrier. oom_killer_disabled check is moved from the allocation path to the OOM level and we take oom_sem for reading for both the check and the whole OOM invocation. oom_killer_disable() takes oom_sem for writing so it waits for all currently running OOM killer invocations. Then it disable all the further OOMs by setting oom_killer_disabled and checks for any oom victims. Victims are counted via mark_tsk_oom_victim resp. unmark_oom_victim. The last victim wakes up all waiters enqueued by oom_killer_disable(). Therefore this function acts as the full OOM barrier. The page fault path is covered now as well although it was assumed to be safe before. As per Tejun, ""We used to have freezing points deep in file system code which may be reacheable from page fault."" so it would be better and more robust to not rely on freezing points here. Same applies to the memcg OOM killer. out_of_memory tells the caller whether the OOM was allowed to trigger and the callers are supposed to handle the situation. The page allocation path simply fails the allocation same as before. The page fault path will retry the fault (more on that later) and Sysrq OOM trigger will simply complain to the log. Normally there wouldn't be any unfrozen user tasks after try_to_freeze_tasks so the function will not block. But if there was an OOM killer racing with try_to_freeze_tasks and the OOM victim didn't finish yet then we have to wait for it. This should complete in a finite time, though, because - the victim cannot loop in the page fault handler (it would die on the way out from the exception) - it cannot loop in the page allocator because all the further allocation would fail and __GFP_NOFAIL allocations are not acceptable at this stage - it shouldn't be blocked on any locks held by frozen tasks (try_to_freeze expects lockless context) and kernel threads and work queues are not frozen yet Signed-off-by: Michal Hocko <mhocko@suse.cz> Suggested-by: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: ""Rafael J. Wysocki"" <rjw@rjwysocki.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/c32b3cbe0d067a9cfae85aa70ba1e97ceba0ced7;oom, PM: make OOM detection in the freezer path raceless;yes;yes;no 180;MDY6Q29tbWl0MjMyNTI5ODpjMzJiM2NiZTBkMDY3YTljZmFlODVhYTcwYmExZTk3Y2ViYTBjZWQ3;Michal Hocko;Linus Torvalds;"oom, PM: make OOM detection in the freezer path raceless Commit 5695be142e20 (""OOM, PM: OOM killed task shouldn't escape PM suspend"") has left a race window when OOM killer manages to note_oom_kill after freeze_processes checks the counter. The race window is quite small and really unlikely and partial solution deemed sufficient at the time of submission. Tejun wasn't happy about this partial solution though and insisted on a full solution. That requires the full OOM and freezer's task freezing exclusion, though. This is done by this patch which introduces oom_sem RW lock and turns oom_killer_disable() into a full OOM barrier. oom_killer_disabled check is moved from the allocation path to the OOM level and we take oom_sem for reading for both the check and the whole OOM invocation. oom_killer_disable() takes oom_sem for writing so it waits for all currently running OOM killer invocations. Then it disable all the further OOMs by setting oom_killer_disabled and checks for any oom victims. Victims are counted via mark_tsk_oom_victim resp. unmark_oom_victim. The last victim wakes up all waiters enqueued by oom_killer_disable(). Therefore this function acts as the full OOM barrier. The page fault path is covered now as well although it was assumed to be safe before. As per Tejun, ""We used to have freezing points deep in file system code which may be reacheable from page fault."" so it would be better and more robust to not rely on freezing points here. Same applies to the memcg OOM killer. out_of_memory tells the caller whether the OOM was allowed to trigger and the callers are supposed to handle the situation. The page allocation path simply fails the allocation same as before. The page fault path will retry the fault (more on that later) and Sysrq OOM trigger will simply complain to the log. Normally there wouldn't be any unfrozen user tasks after try_to_freeze_tasks so the function will not block. But if there was an OOM killer racing with try_to_freeze_tasks and the OOM victim didn't finish yet then we have to wait for it. This should complete in a finite time, though, because - the victim cannot loop in the page fault handler (it would die on the way out from the exception) - it cannot loop in the page allocator because all the further allocation would fail and __GFP_NOFAIL allocations are not acceptable at this stage - it shouldn't be blocked on any locks held by frozen tasks (try_to_freeze expects lockless context) and kernel threads and work queues are not frozen yet Signed-off-by: Michal Hocko <mhocko@suse.cz> Suggested-by: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: ""Rafael J. Wysocki"" <rjw@rjwysocki.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/c32b3cbe0d067a9cfae85aa70ba1e97ceba0ced7;"Commit 5695be142e20 (""OOM, PM: OOM killed task shouldn't escape PM suspend"") has left a race window when OOM killer manages to note_oom_kill after freeze_processes checks the counter";no;yes;yes 180;MDY6Q29tbWl0MjMyNTI5ODpjMzJiM2NiZTBkMDY3YTljZmFlODVhYTcwYmExZTk3Y2ViYTBjZWQ3;Michal Hocko;Linus Torvalds;"oom, PM: make OOM detection in the freezer path raceless Commit 5695be142e20 (""OOM, PM: OOM killed task shouldn't escape PM suspend"") has left a race window when OOM killer manages to note_oom_kill after freeze_processes checks the counter. The race window is quite small and really unlikely and partial solution deemed sufficient at the time of submission. Tejun wasn't happy about this partial solution though and insisted on a full solution. That requires the full OOM and freezer's task freezing exclusion, though. This is done by this patch which introduces oom_sem RW lock and turns oom_killer_disable() into a full OOM barrier. oom_killer_disabled check is moved from the allocation path to the OOM level and we take oom_sem for reading for both the check and the whole OOM invocation. oom_killer_disable() takes oom_sem for writing so it waits for all currently running OOM killer invocations. Then it disable all the further OOMs by setting oom_killer_disabled and checks for any oom victims. Victims are counted via mark_tsk_oom_victim resp. unmark_oom_victim. The last victim wakes up all waiters enqueued by oom_killer_disable(). Therefore this function acts as the full OOM barrier. The page fault path is covered now as well although it was assumed to be safe before. As per Tejun, ""We used to have freezing points deep in file system code which may be reacheable from page fault."" so it would be better and more robust to not rely on freezing points here. Same applies to the memcg OOM killer. out_of_memory tells the caller whether the OOM was allowed to trigger and the callers are supposed to handle the situation. The page allocation path simply fails the allocation same as before. The page fault path will retry the fault (more on that later) and Sysrq OOM trigger will simply complain to the log. Normally there wouldn't be any unfrozen user tasks after try_to_freeze_tasks so the function will not block. But if there was an OOM killer racing with try_to_freeze_tasks and the OOM victim didn't finish yet then we have to wait for it. This should complete in a finite time, though, because - the victim cannot loop in the page fault handler (it would die on the way out from the exception) - it cannot loop in the page allocator because all the further allocation would fail and __GFP_NOFAIL allocations are not acceptable at this stage - it shouldn't be blocked on any locks held by frozen tasks (try_to_freeze expects lockless context) and kernel threads and work queues are not frozen yet Signed-off-by: Michal Hocko <mhocko@suse.cz> Suggested-by: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: ""Rafael J. Wysocki"" <rjw@rjwysocki.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/c32b3cbe0d067a9cfae85aa70ba1e97ceba0ced7;" The race window is quite small and really unlikely and partial solution deemed sufficient at the time of submission";no;no;yes 180;MDY6Q29tbWl0MjMyNTI5ODpjMzJiM2NiZTBkMDY3YTljZmFlODVhYTcwYmExZTk3Y2ViYTBjZWQ3;Michal Hocko;Linus Torvalds;"oom, PM: make OOM detection in the freezer path raceless Commit 5695be142e20 (""OOM, PM: OOM killed task shouldn't escape PM suspend"") has left a race window when OOM killer manages to note_oom_kill after freeze_processes checks the counter. The race window is quite small and really unlikely and partial solution deemed sufficient at the time of submission. Tejun wasn't happy about this partial solution though and insisted on a full solution. That requires the full OOM and freezer's task freezing exclusion, though. This is done by this patch which introduces oom_sem RW lock and turns oom_killer_disable() into a full OOM barrier. oom_killer_disabled check is moved from the allocation path to the OOM level and we take oom_sem for reading for both the check and the whole OOM invocation. oom_killer_disable() takes oom_sem for writing so it waits for all currently running OOM killer invocations. Then it disable all the further OOMs by setting oom_killer_disabled and checks for any oom victims. Victims are counted via mark_tsk_oom_victim resp. unmark_oom_victim. The last victim wakes up all waiters enqueued by oom_killer_disable(). Therefore this function acts as the full OOM barrier. The page fault path is covered now as well although it was assumed to be safe before. As per Tejun, ""We used to have freezing points deep in file system code which may be reacheable from page fault."" so it would be better and more robust to not rely on freezing points here. Same applies to the memcg OOM killer. out_of_memory tells the caller whether the OOM was allowed to trigger and the callers are supposed to handle the situation. The page allocation path simply fails the allocation same as before. The page fault path will retry the fault (more on that later) and Sysrq OOM trigger will simply complain to the log. Normally there wouldn't be any unfrozen user tasks after try_to_freeze_tasks so the function will not block. But if there was an OOM killer racing with try_to_freeze_tasks and the OOM victim didn't finish yet then we have to wait for it. This should complete in a finite time, though, because - the victim cannot loop in the page fault handler (it would die on the way out from the exception) - it cannot loop in the page allocator because all the further allocation would fail and __GFP_NOFAIL allocations are not acceptable at this stage - it shouldn't be blocked on any locks held by frozen tasks (try_to_freeze expects lockless context) and kernel threads and work queues are not frozen yet Signed-off-by: Michal Hocko <mhocko@suse.cz> Suggested-by: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: ""Rafael J. Wysocki"" <rjw@rjwysocki.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/c32b3cbe0d067a9cfae85aa70ba1e97ceba0ced7;"Tejun wasn't happy about this partial solution though and insisted on a full solution";no;yes;yes 180;MDY6Q29tbWl0MjMyNTI5ODpjMzJiM2NiZTBkMDY3YTljZmFlODVhYTcwYmExZTk3Y2ViYTBjZWQ3;Michal Hocko;Linus Torvalds;"oom, PM: make OOM detection in the freezer path raceless Commit 5695be142e20 (""OOM, PM: OOM killed task shouldn't escape PM suspend"") has left a race window when OOM killer manages to note_oom_kill after freeze_processes checks the counter. The race window is quite small and really unlikely and partial solution deemed sufficient at the time of submission. Tejun wasn't happy about this partial solution though and insisted on a full solution. That requires the full OOM and freezer's task freezing exclusion, though. This is done by this patch which introduces oom_sem RW lock and turns oom_killer_disable() into a full OOM barrier. oom_killer_disabled check is moved from the allocation path to the OOM level and we take oom_sem for reading for both the check and the whole OOM invocation. oom_killer_disable() takes oom_sem for writing so it waits for all currently running OOM killer invocations. Then it disable all the further OOMs by setting oom_killer_disabled and checks for any oom victims. Victims are counted via mark_tsk_oom_victim resp. unmark_oom_victim. The last victim wakes up all waiters enqueued by oom_killer_disable(). Therefore this function acts as the full OOM barrier. The page fault path is covered now as well although it was assumed to be safe before. As per Tejun, ""We used to have freezing points deep in file system code which may be reacheable from page fault."" so it would be better and more robust to not rely on freezing points here. Same applies to the memcg OOM killer. out_of_memory tells the caller whether the OOM was allowed to trigger and the callers are supposed to handle the situation. The page allocation path simply fails the allocation same as before. The page fault path will retry the fault (more on that later) and Sysrq OOM trigger will simply complain to the log. Normally there wouldn't be any unfrozen user tasks after try_to_freeze_tasks so the function will not block. But if there was an OOM killer racing with try_to_freeze_tasks and the OOM victim didn't finish yet then we have to wait for it. This should complete in a finite time, though, because - the victim cannot loop in the page fault handler (it would die on the way out from the exception) - it cannot loop in the page allocator because all the further allocation would fail and __GFP_NOFAIL allocations are not acceptable at this stage - it shouldn't be blocked on any locks held by frozen tasks (try_to_freeze expects lockless context) and kernel threads and work queues are not frozen yet Signed-off-by: Michal Hocko <mhocko@suse.cz> Suggested-by: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: ""Rafael J. Wysocki"" <rjw@rjwysocki.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/c32b3cbe0d067a9cfae85aa70ba1e97ceba0ced7;" That requires the full OOM and freezer's task freezing exclusion, though";yes;yes;yes 180;MDY6Q29tbWl0MjMyNTI5ODpjMzJiM2NiZTBkMDY3YTljZmFlODVhYTcwYmExZTk3Y2ViYTBjZWQ3;Michal Hocko;Linus Torvalds;"oom, PM: make OOM detection in the freezer path raceless Commit 5695be142e20 (""OOM, PM: OOM killed task shouldn't escape PM suspend"") has left a race window when OOM killer manages to note_oom_kill after freeze_processes checks the counter. The race window is quite small and really unlikely and partial solution deemed sufficient at the time of submission. Tejun wasn't happy about this partial solution though and insisted on a full solution. That requires the full OOM and freezer's task freezing exclusion, though. This is done by this patch which introduces oom_sem RW lock and turns oom_killer_disable() into a full OOM barrier. oom_killer_disabled check is moved from the allocation path to the OOM level and we take oom_sem for reading for both the check and the whole OOM invocation. oom_killer_disable() takes oom_sem for writing so it waits for all currently running OOM killer invocations. Then it disable all the further OOMs by setting oom_killer_disabled and checks for any oom victims. Victims are counted via mark_tsk_oom_victim resp. unmark_oom_victim. The last victim wakes up all waiters enqueued by oom_killer_disable(). Therefore this function acts as the full OOM barrier. The page fault path is covered now as well although it was assumed to be safe before. As per Tejun, ""We used to have freezing points deep in file system code which may be reacheable from page fault."" so it would be better and more robust to not rely on freezing points here. Same applies to the memcg OOM killer. out_of_memory tells the caller whether the OOM was allowed to trigger and the callers are supposed to handle the situation. The page allocation path simply fails the allocation same as before. The page fault path will retry the fault (more on that later) and Sysrq OOM trigger will simply complain to the log. Normally there wouldn't be any unfrozen user tasks after try_to_freeze_tasks so the function will not block. But if there was an OOM killer racing with try_to_freeze_tasks and the OOM victim didn't finish yet then we have to wait for it. This should complete in a finite time, though, because - the victim cannot loop in the page fault handler (it would die on the way out from the exception) - it cannot loop in the page allocator because all the further allocation would fail and __GFP_NOFAIL allocations are not acceptable at this stage - it shouldn't be blocked on any locks held by frozen tasks (try_to_freeze expects lockless context) and kernel threads and work queues are not frozen yet Signed-off-by: Michal Hocko <mhocko@suse.cz> Suggested-by: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: ""Rafael J. Wysocki"" <rjw@rjwysocki.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/c32b3cbe0d067a9cfae85aa70ba1e97ceba0ced7;" This is done by this patch which introduces oom_sem RW lock and turns oom_killer_disable() into a full OOM barrier";yes;yes;no 180;MDY6Q29tbWl0MjMyNTI5ODpjMzJiM2NiZTBkMDY3YTljZmFlODVhYTcwYmExZTk3Y2ViYTBjZWQ3;Michal Hocko;Linus Torvalds;"oom, PM: make OOM detection in the freezer path raceless Commit 5695be142e20 (""OOM, PM: OOM killed task shouldn't escape PM suspend"") has left a race window when OOM killer manages to note_oom_kill after freeze_processes checks the counter. The race window is quite small and really unlikely and partial solution deemed sufficient at the time of submission. Tejun wasn't happy about this partial solution though and insisted on a full solution. That requires the full OOM and freezer's task freezing exclusion, though. This is done by this patch which introduces oom_sem RW lock and turns oom_killer_disable() into a full OOM barrier. oom_killer_disabled check is moved from the allocation path to the OOM level and we take oom_sem for reading for both the check and the whole OOM invocation. oom_killer_disable() takes oom_sem for writing so it waits for all currently running OOM killer invocations. Then it disable all the further OOMs by setting oom_killer_disabled and checks for any oom victims. Victims are counted via mark_tsk_oom_victim resp. unmark_oom_victim. The last victim wakes up all waiters enqueued by oom_killer_disable(). Therefore this function acts as the full OOM barrier. The page fault path is covered now as well although it was assumed to be safe before. As per Tejun, ""We used to have freezing points deep in file system code which may be reacheable from page fault."" so it would be better and more robust to not rely on freezing points here. Same applies to the memcg OOM killer. out_of_memory tells the caller whether the OOM was allowed to trigger and the callers are supposed to handle the situation. The page allocation path simply fails the allocation same as before. The page fault path will retry the fault (more on that later) and Sysrq OOM trigger will simply complain to the log. Normally there wouldn't be any unfrozen user tasks after try_to_freeze_tasks so the function will not block. But if there was an OOM killer racing with try_to_freeze_tasks and the OOM victim didn't finish yet then we have to wait for it. This should complete in a finite time, though, because - the victim cannot loop in the page fault handler (it would die on the way out from the exception) - it cannot loop in the page allocator because all the further allocation would fail and __GFP_NOFAIL allocations are not acceptable at this stage - it shouldn't be blocked on any locks held by frozen tasks (try_to_freeze expects lockless context) and kernel threads and work queues are not frozen yet Signed-off-by: Michal Hocko <mhocko@suse.cz> Suggested-by: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: ""Rafael J. Wysocki"" <rjw@rjwysocki.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/c32b3cbe0d067a9cfae85aa70ba1e97ceba0ced7;"oom_killer_disabled check is moved from the allocation path to the OOM level and we take oom_sem for reading for both the check and the whole OOM invocation";yes;yes;no 180;MDY6Q29tbWl0MjMyNTI5ODpjMzJiM2NiZTBkMDY3YTljZmFlODVhYTcwYmExZTk3Y2ViYTBjZWQ3;Michal Hocko;Linus Torvalds;"oom, PM: make OOM detection in the freezer path raceless Commit 5695be142e20 (""OOM, PM: OOM killed task shouldn't escape PM suspend"") has left a race window when OOM killer manages to note_oom_kill after freeze_processes checks the counter. The race window is quite small and really unlikely and partial solution deemed sufficient at the time of submission. Tejun wasn't happy about this partial solution though and insisted on a full solution. That requires the full OOM and freezer's task freezing exclusion, though. This is done by this patch which introduces oom_sem RW lock and turns oom_killer_disable() into a full OOM barrier. oom_killer_disabled check is moved from the allocation path to the OOM level and we take oom_sem for reading for both the check and the whole OOM invocation. oom_killer_disable() takes oom_sem for writing so it waits for all currently running OOM killer invocations. Then it disable all the further OOMs by setting oom_killer_disabled and checks for any oom victims. Victims are counted via mark_tsk_oom_victim resp. unmark_oom_victim. The last victim wakes up all waiters enqueued by oom_killer_disable(). Therefore this function acts as the full OOM barrier. The page fault path is covered now as well although it was assumed to be safe before. As per Tejun, ""We used to have freezing points deep in file system code which may be reacheable from page fault."" so it would be better and more robust to not rely on freezing points here. Same applies to the memcg OOM killer. out_of_memory tells the caller whether the OOM was allowed to trigger and the callers are supposed to handle the situation. The page allocation path simply fails the allocation same as before. The page fault path will retry the fault (more on that later) and Sysrq OOM trigger will simply complain to the log. Normally there wouldn't be any unfrozen user tasks after try_to_freeze_tasks so the function will not block. But if there was an OOM killer racing with try_to_freeze_tasks and the OOM victim didn't finish yet then we have to wait for it. This should complete in a finite time, though, because - the victim cannot loop in the page fault handler (it would die on the way out from the exception) - it cannot loop in the page allocator because all the further allocation would fail and __GFP_NOFAIL allocations are not acceptable at this stage - it shouldn't be blocked on any locks held by frozen tasks (try_to_freeze expects lockless context) and kernel threads and work queues are not frozen yet Signed-off-by: Michal Hocko <mhocko@suse.cz> Suggested-by: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: ""Rafael J. Wysocki"" <rjw@rjwysocki.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/c32b3cbe0d067a9cfae85aa70ba1e97ceba0ced7;"oom_killer_disable() takes oom_sem for writing so it waits for all currently running OOM killer invocations";yes;no;no 180;MDY6Q29tbWl0MjMyNTI5ODpjMzJiM2NiZTBkMDY3YTljZmFlODVhYTcwYmExZTk3Y2ViYTBjZWQ3;Michal Hocko;Linus Torvalds;"oom, PM: make OOM detection in the freezer path raceless Commit 5695be142e20 (""OOM, PM: OOM killed task shouldn't escape PM suspend"") has left a race window when OOM killer manages to note_oom_kill after freeze_processes checks the counter. The race window is quite small and really unlikely and partial solution deemed sufficient at the time of submission. Tejun wasn't happy about this partial solution though and insisted on a full solution. That requires the full OOM and freezer's task freezing exclusion, though. This is done by this patch which introduces oom_sem RW lock and turns oom_killer_disable() into a full OOM barrier. oom_killer_disabled check is moved from the allocation path to the OOM level and we take oom_sem for reading for both the check and the whole OOM invocation. oom_killer_disable() takes oom_sem for writing so it waits for all currently running OOM killer invocations. Then it disable all the further OOMs by setting oom_killer_disabled and checks for any oom victims. Victims are counted via mark_tsk_oom_victim resp. unmark_oom_victim. The last victim wakes up all waiters enqueued by oom_killer_disable(). Therefore this function acts as the full OOM barrier. The page fault path is covered now as well although it was assumed to be safe before. As per Tejun, ""We used to have freezing points deep in file system code which may be reacheable from page fault."" so it would be better and more robust to not rely on freezing points here. Same applies to the memcg OOM killer. out_of_memory tells the caller whether the OOM was allowed to trigger and the callers are supposed to handle the situation. The page allocation path simply fails the allocation same as before. The page fault path will retry the fault (more on that later) and Sysrq OOM trigger will simply complain to the log. Normally there wouldn't be any unfrozen user tasks after try_to_freeze_tasks so the function will not block. But if there was an OOM killer racing with try_to_freeze_tasks and the OOM victim didn't finish yet then we have to wait for it. This should complete in a finite time, though, because - the victim cannot loop in the page fault handler (it would die on the way out from the exception) - it cannot loop in the page allocator because all the further allocation would fail and __GFP_NOFAIL allocations are not acceptable at this stage - it shouldn't be blocked on any locks held by frozen tasks (try_to_freeze expects lockless context) and kernel threads and work queues are not frozen yet Signed-off-by: Michal Hocko <mhocko@suse.cz> Suggested-by: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: ""Rafael J. Wysocki"" <rjw@rjwysocki.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/c32b3cbe0d067a9cfae85aa70ba1e97ceba0ced7;" Then it disable all the further OOMs by setting oom_killer_disabled and checks for any oom victims";yes;no;no 180;MDY6Q29tbWl0MjMyNTI5ODpjMzJiM2NiZTBkMDY3YTljZmFlODVhYTcwYmExZTk3Y2ViYTBjZWQ3;Michal Hocko;Linus Torvalds;"oom, PM: make OOM detection in the freezer path raceless Commit 5695be142e20 (""OOM, PM: OOM killed task shouldn't escape PM suspend"") has left a race window when OOM killer manages to note_oom_kill after freeze_processes checks the counter. The race window is quite small and really unlikely and partial solution deemed sufficient at the time of submission. Tejun wasn't happy about this partial solution though and insisted on a full solution. That requires the full OOM and freezer's task freezing exclusion, though. This is done by this patch which introduces oom_sem RW lock and turns oom_killer_disable() into a full OOM barrier. oom_killer_disabled check is moved from the allocation path to the OOM level and we take oom_sem for reading for both the check and the whole OOM invocation. oom_killer_disable() takes oom_sem for writing so it waits for all currently running OOM killer invocations. Then it disable all the further OOMs by setting oom_killer_disabled and checks for any oom victims. Victims are counted via mark_tsk_oom_victim resp. unmark_oom_victim. The last victim wakes up all waiters enqueued by oom_killer_disable(). Therefore this function acts as the full OOM barrier. The page fault path is covered now as well although it was assumed to be safe before. As per Tejun, ""We used to have freezing points deep in file system code which may be reacheable from page fault."" so it would be better and more robust to not rely on freezing points here. Same applies to the memcg OOM killer. out_of_memory tells the caller whether the OOM was allowed to trigger and the callers are supposed to handle the situation. The page allocation path simply fails the allocation same as before. The page fault path will retry the fault (more on that later) and Sysrq OOM trigger will simply complain to the log. Normally there wouldn't be any unfrozen user tasks after try_to_freeze_tasks so the function will not block. But if there was an OOM killer racing with try_to_freeze_tasks and the OOM victim didn't finish yet then we have to wait for it. This should complete in a finite time, though, because - the victim cannot loop in the page fault handler (it would die on the way out from the exception) - it cannot loop in the page allocator because all the further allocation would fail and __GFP_NOFAIL allocations are not acceptable at this stage - it shouldn't be blocked on any locks held by frozen tasks (try_to_freeze expects lockless context) and kernel threads and work queues are not frozen yet Signed-off-by: Michal Hocko <mhocko@suse.cz> Suggested-by: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: ""Rafael J. Wysocki"" <rjw@rjwysocki.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/c32b3cbe0d067a9cfae85aa70ba1e97ceba0ced7;Victims are counted via mark_tsk_oom_victim resp;yes;no;no 180;MDY6Q29tbWl0MjMyNTI5ODpjMzJiM2NiZTBkMDY3YTljZmFlODVhYTcwYmExZTk3Y2ViYTBjZWQ3;Michal Hocko;Linus Torvalds;"oom, PM: make OOM detection in the freezer path raceless Commit 5695be142e20 (""OOM, PM: OOM killed task shouldn't escape PM suspend"") has left a race window when OOM killer manages to note_oom_kill after freeze_processes checks the counter. The race window is quite small and really unlikely and partial solution deemed sufficient at the time of submission. Tejun wasn't happy about this partial solution though and insisted on a full solution. That requires the full OOM and freezer's task freezing exclusion, though. This is done by this patch which introduces oom_sem RW lock and turns oom_killer_disable() into a full OOM barrier. oom_killer_disabled check is moved from the allocation path to the OOM level and we take oom_sem for reading for both the check and the whole OOM invocation. oom_killer_disable() takes oom_sem for writing so it waits for all currently running OOM killer invocations. Then it disable all the further OOMs by setting oom_killer_disabled and checks for any oom victims. Victims are counted via mark_tsk_oom_victim resp. unmark_oom_victim. The last victim wakes up all waiters enqueued by oom_killer_disable(). Therefore this function acts as the full OOM barrier. The page fault path is covered now as well although it was assumed to be safe before. As per Tejun, ""We used to have freezing points deep in file system code which may be reacheable from page fault."" so it would be better and more robust to not rely on freezing points here. Same applies to the memcg OOM killer. out_of_memory tells the caller whether the OOM was allowed to trigger and the callers are supposed to handle the situation. The page allocation path simply fails the allocation same as before. The page fault path will retry the fault (more on that later) and Sysrq OOM trigger will simply complain to the log. Normally there wouldn't be any unfrozen user tasks after try_to_freeze_tasks so the function will not block. But if there was an OOM killer racing with try_to_freeze_tasks and the OOM victim didn't finish yet then we have to wait for it. This should complete in a finite time, though, because - the victim cannot loop in the page fault handler (it would die on the way out from the exception) - it cannot loop in the page allocator because all the further allocation would fail and __GFP_NOFAIL allocations are not acceptable at this stage - it shouldn't be blocked on any locks held by frozen tasks (try_to_freeze expects lockless context) and kernel threads and work queues are not frozen yet Signed-off-by: Michal Hocko <mhocko@suse.cz> Suggested-by: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: ""Rafael J. Wysocki"" <rjw@rjwysocki.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/c32b3cbe0d067a9cfae85aa70ba1e97ceba0ced7; unmark_oom_victim;yes;no;no 180;MDY6Q29tbWl0MjMyNTI5ODpjMzJiM2NiZTBkMDY3YTljZmFlODVhYTcwYmExZTk3Y2ViYTBjZWQ3;Michal Hocko;Linus Torvalds;"oom, PM: make OOM detection in the freezer path raceless Commit 5695be142e20 (""OOM, PM: OOM killed task shouldn't escape PM suspend"") has left a race window when OOM killer manages to note_oom_kill after freeze_processes checks the counter. The race window is quite small and really unlikely and partial solution deemed sufficient at the time of submission. Tejun wasn't happy about this partial solution though and insisted on a full solution. That requires the full OOM and freezer's task freezing exclusion, though. This is done by this patch which introduces oom_sem RW lock and turns oom_killer_disable() into a full OOM barrier. oom_killer_disabled check is moved from the allocation path to the OOM level and we take oom_sem for reading for both the check and the whole OOM invocation. oom_killer_disable() takes oom_sem for writing so it waits for all currently running OOM killer invocations. Then it disable all the further OOMs by setting oom_killer_disabled and checks for any oom victims. Victims are counted via mark_tsk_oom_victim resp. unmark_oom_victim. The last victim wakes up all waiters enqueued by oom_killer_disable(). Therefore this function acts as the full OOM barrier. The page fault path is covered now as well although it was assumed to be safe before. As per Tejun, ""We used to have freezing points deep in file system code which may be reacheable from page fault."" so it would be better and more robust to not rely on freezing points here. Same applies to the memcg OOM killer. out_of_memory tells the caller whether the OOM was allowed to trigger and the callers are supposed to handle the situation. The page allocation path simply fails the allocation same as before. The page fault path will retry the fault (more on that later) and Sysrq OOM trigger will simply complain to the log. Normally there wouldn't be any unfrozen user tasks after try_to_freeze_tasks so the function will not block. But if there was an OOM killer racing with try_to_freeze_tasks and the OOM victim didn't finish yet then we have to wait for it. This should complete in a finite time, though, because - the victim cannot loop in the page fault handler (it would die on the way out from the exception) - it cannot loop in the page allocator because all the further allocation would fail and __GFP_NOFAIL allocations are not acceptable at this stage - it shouldn't be blocked on any locks held by frozen tasks (try_to_freeze expects lockless context) and kernel threads and work queues are not frozen yet Signed-off-by: Michal Hocko <mhocko@suse.cz> Suggested-by: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: ""Rafael J. Wysocki"" <rjw@rjwysocki.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/c32b3cbe0d067a9cfae85aa70ba1e97ceba0ced7;" The last victim wakes up all waiters enqueued by oom_killer_disable()";yes;no;no 180;MDY6Q29tbWl0MjMyNTI5ODpjMzJiM2NiZTBkMDY3YTljZmFlODVhYTcwYmExZTk3Y2ViYTBjZWQ3;Michal Hocko;Linus Torvalds;"oom, PM: make OOM detection in the freezer path raceless Commit 5695be142e20 (""OOM, PM: OOM killed task shouldn't escape PM suspend"") has left a race window when OOM killer manages to note_oom_kill after freeze_processes checks the counter. The race window is quite small and really unlikely and partial solution deemed sufficient at the time of submission. Tejun wasn't happy about this partial solution though and insisted on a full solution. That requires the full OOM and freezer's task freezing exclusion, though. This is done by this patch which introduces oom_sem RW lock and turns oom_killer_disable() into a full OOM barrier. oom_killer_disabled check is moved from the allocation path to the OOM level and we take oom_sem for reading for both the check and the whole OOM invocation. oom_killer_disable() takes oom_sem for writing so it waits for all currently running OOM killer invocations. Then it disable all the further OOMs by setting oom_killer_disabled and checks for any oom victims. Victims are counted via mark_tsk_oom_victim resp. unmark_oom_victim. The last victim wakes up all waiters enqueued by oom_killer_disable(). Therefore this function acts as the full OOM barrier. The page fault path is covered now as well although it was assumed to be safe before. As per Tejun, ""We used to have freezing points deep in file system code which may be reacheable from page fault."" so it would be better and more robust to not rely on freezing points here. Same applies to the memcg OOM killer. out_of_memory tells the caller whether the OOM was allowed to trigger and the callers are supposed to handle the situation. The page allocation path simply fails the allocation same as before. The page fault path will retry the fault (more on that later) and Sysrq OOM trigger will simply complain to the log. Normally there wouldn't be any unfrozen user tasks after try_to_freeze_tasks so the function will not block. But if there was an OOM killer racing with try_to_freeze_tasks and the OOM victim didn't finish yet then we have to wait for it. This should complete in a finite time, though, because - the victim cannot loop in the page fault handler (it would die on the way out from the exception) - it cannot loop in the page allocator because all the further allocation would fail and __GFP_NOFAIL allocations are not acceptable at this stage - it shouldn't be blocked on any locks held by frozen tasks (try_to_freeze expects lockless context) and kernel threads and work queues are not frozen yet Signed-off-by: Michal Hocko <mhocko@suse.cz> Suggested-by: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: ""Rafael J. Wysocki"" <rjw@rjwysocki.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/c32b3cbe0d067a9cfae85aa70ba1e97ceba0ced7;Therefore this function acts as the full OOM barrier;yes;yes;no 180;MDY6Q29tbWl0MjMyNTI5ODpjMzJiM2NiZTBkMDY3YTljZmFlODVhYTcwYmExZTk3Y2ViYTBjZWQ3;Michal Hocko;Linus Torvalds;"oom, PM: make OOM detection in the freezer path raceless Commit 5695be142e20 (""OOM, PM: OOM killed task shouldn't escape PM suspend"") has left a race window when OOM killer manages to note_oom_kill after freeze_processes checks the counter. The race window is quite small and really unlikely and partial solution deemed sufficient at the time of submission. Tejun wasn't happy about this partial solution though and insisted on a full solution. That requires the full OOM and freezer's task freezing exclusion, though. This is done by this patch which introduces oom_sem RW lock and turns oom_killer_disable() into a full OOM barrier. oom_killer_disabled check is moved from the allocation path to the OOM level and we take oom_sem for reading for both the check and the whole OOM invocation. oom_killer_disable() takes oom_sem for writing so it waits for all currently running OOM killer invocations. Then it disable all the further OOMs by setting oom_killer_disabled and checks for any oom victims. Victims are counted via mark_tsk_oom_victim resp. unmark_oom_victim. The last victim wakes up all waiters enqueued by oom_killer_disable(). Therefore this function acts as the full OOM barrier. The page fault path is covered now as well although it was assumed to be safe before. As per Tejun, ""We used to have freezing points deep in file system code which may be reacheable from page fault."" so it would be better and more robust to not rely on freezing points here. Same applies to the memcg OOM killer. out_of_memory tells the caller whether the OOM was allowed to trigger and the callers are supposed to handle the situation. The page allocation path simply fails the allocation same as before. The page fault path will retry the fault (more on that later) and Sysrq OOM trigger will simply complain to the log. Normally there wouldn't be any unfrozen user tasks after try_to_freeze_tasks so the function will not block. But if there was an OOM killer racing with try_to_freeze_tasks and the OOM victim didn't finish yet then we have to wait for it. This should complete in a finite time, though, because - the victim cannot loop in the page fault handler (it would die on the way out from the exception) - it cannot loop in the page allocator because all the further allocation would fail and __GFP_NOFAIL allocations are not acceptable at this stage - it shouldn't be blocked on any locks held by frozen tasks (try_to_freeze expects lockless context) and kernel threads and work queues are not frozen yet Signed-off-by: Michal Hocko <mhocko@suse.cz> Suggested-by: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: ""Rafael J. Wysocki"" <rjw@rjwysocki.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/c32b3cbe0d067a9cfae85aa70ba1e97ceba0ced7;"The page fault path is covered now as well although it was assumed to be safe before";yes;yes;no 180;MDY6Q29tbWl0MjMyNTI5ODpjMzJiM2NiZTBkMDY3YTljZmFlODVhYTcwYmExZTk3Y2ViYTBjZWQ3;Michal Hocko;Linus Torvalds;"oom, PM: make OOM detection in the freezer path raceless Commit 5695be142e20 (""OOM, PM: OOM killed task shouldn't escape PM suspend"") has left a race window when OOM killer manages to note_oom_kill after freeze_processes checks the counter. The race window is quite small and really unlikely and partial solution deemed sufficient at the time of submission. Tejun wasn't happy about this partial solution though and insisted on a full solution. That requires the full OOM and freezer's task freezing exclusion, though. This is done by this patch which introduces oom_sem RW lock and turns oom_killer_disable() into a full OOM barrier. oom_killer_disabled check is moved from the allocation path to the OOM level and we take oom_sem for reading for both the check and the whole OOM invocation. oom_killer_disable() takes oom_sem for writing so it waits for all currently running OOM killer invocations. Then it disable all the further OOMs by setting oom_killer_disabled and checks for any oom victims. Victims are counted via mark_tsk_oom_victim resp. unmark_oom_victim. The last victim wakes up all waiters enqueued by oom_killer_disable(). Therefore this function acts as the full OOM barrier. The page fault path is covered now as well although it was assumed to be safe before. As per Tejun, ""We used to have freezing points deep in file system code which may be reacheable from page fault."" so it would be better and more robust to not rely on freezing points here. Same applies to the memcg OOM killer. out_of_memory tells the caller whether the OOM was allowed to trigger and the callers are supposed to handle the situation. The page allocation path simply fails the allocation same as before. The page fault path will retry the fault (more on that later) and Sysrq OOM trigger will simply complain to the log. Normally there wouldn't be any unfrozen user tasks after try_to_freeze_tasks so the function will not block. But if there was an OOM killer racing with try_to_freeze_tasks and the OOM victim didn't finish yet then we have to wait for it. This should complete in a finite time, though, because - the victim cannot loop in the page fault handler (it would die on the way out from the exception) - it cannot loop in the page allocator because all the further allocation would fail and __GFP_NOFAIL allocations are not acceptable at this stage - it shouldn't be blocked on any locks held by frozen tasks (try_to_freeze expects lockless context) and kernel threads and work queues are not frozen yet Signed-off-by: Michal Hocko <mhocko@suse.cz> Suggested-by: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: ""Rafael J. Wysocki"" <rjw@rjwysocki.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/c32b3cbe0d067a9cfae85aa70ba1e97ceba0ced7;" As per Tejun, ""We used to have freezing points deep in file system code which may be reacheable from page fault."" so it would be better and more robust to not rely on freezing points here";yes;yes;yes 180;MDY6Q29tbWl0MjMyNTI5ODpjMzJiM2NiZTBkMDY3YTljZmFlODVhYTcwYmExZTk3Y2ViYTBjZWQ3;Michal Hocko;Linus Torvalds;"oom, PM: make OOM detection in the freezer path raceless Commit 5695be142e20 (""OOM, PM: OOM killed task shouldn't escape PM suspend"") has left a race window when OOM killer manages to note_oom_kill after freeze_processes checks the counter. The race window is quite small and really unlikely and partial solution deemed sufficient at the time of submission. Tejun wasn't happy about this partial solution though and insisted on a full solution. That requires the full OOM and freezer's task freezing exclusion, though. This is done by this patch which introduces oom_sem RW lock and turns oom_killer_disable() into a full OOM barrier. oom_killer_disabled check is moved from the allocation path to the OOM level and we take oom_sem for reading for both the check and the whole OOM invocation. oom_killer_disable() takes oom_sem for writing so it waits for all currently running OOM killer invocations. Then it disable all the further OOMs by setting oom_killer_disabled and checks for any oom victims. Victims are counted via mark_tsk_oom_victim resp. unmark_oom_victim. The last victim wakes up all waiters enqueued by oom_killer_disable(). Therefore this function acts as the full OOM barrier. The page fault path is covered now as well although it was assumed to be safe before. As per Tejun, ""We used to have freezing points deep in file system code which may be reacheable from page fault."" so it would be better and more robust to not rely on freezing points here. Same applies to the memcg OOM killer. out_of_memory tells the caller whether the OOM was allowed to trigger and the callers are supposed to handle the situation. The page allocation path simply fails the allocation same as before. The page fault path will retry the fault (more on that later) and Sysrq OOM trigger will simply complain to the log. Normally there wouldn't be any unfrozen user tasks after try_to_freeze_tasks so the function will not block. But if there was an OOM killer racing with try_to_freeze_tasks and the OOM victim didn't finish yet then we have to wait for it. This should complete in a finite time, though, because - the victim cannot loop in the page fault handler (it would die on the way out from the exception) - it cannot loop in the page allocator because all the further allocation would fail and __GFP_NOFAIL allocations are not acceptable at this stage - it shouldn't be blocked on any locks held by frozen tasks (try_to_freeze expects lockless context) and kernel threads and work queues are not frozen yet Signed-off-by: Michal Hocko <mhocko@suse.cz> Suggested-by: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: ""Rafael J. Wysocki"" <rjw@rjwysocki.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/c32b3cbe0d067a9cfae85aa70ba1e97ceba0ced7;" Same applies to the memcg OOM killer";yes;yes;no 180;MDY6Q29tbWl0MjMyNTI5ODpjMzJiM2NiZTBkMDY3YTljZmFlODVhYTcwYmExZTk3Y2ViYTBjZWQ3;Michal Hocko;Linus Torvalds;"oom, PM: make OOM detection in the freezer path raceless Commit 5695be142e20 (""OOM, PM: OOM killed task shouldn't escape PM suspend"") has left a race window when OOM killer manages to note_oom_kill after freeze_processes checks the counter. The race window is quite small and really unlikely and partial solution deemed sufficient at the time of submission. Tejun wasn't happy about this partial solution though and insisted on a full solution. That requires the full OOM and freezer's task freezing exclusion, though. This is done by this patch which introduces oom_sem RW lock and turns oom_killer_disable() into a full OOM barrier. oom_killer_disabled check is moved from the allocation path to the OOM level and we take oom_sem for reading for both the check and the whole OOM invocation. oom_killer_disable() takes oom_sem for writing so it waits for all currently running OOM killer invocations. Then it disable all the further OOMs by setting oom_killer_disabled and checks for any oom victims. Victims are counted via mark_tsk_oom_victim resp. unmark_oom_victim. The last victim wakes up all waiters enqueued by oom_killer_disable(). Therefore this function acts as the full OOM barrier. The page fault path is covered now as well although it was assumed to be safe before. As per Tejun, ""We used to have freezing points deep in file system code which may be reacheable from page fault."" so it would be better and more robust to not rely on freezing points here. Same applies to the memcg OOM killer. out_of_memory tells the caller whether the OOM was allowed to trigger and the callers are supposed to handle the situation. The page allocation path simply fails the allocation same as before. The page fault path will retry the fault (more on that later) and Sysrq OOM trigger will simply complain to the log. Normally there wouldn't be any unfrozen user tasks after try_to_freeze_tasks so the function will not block. But if there was an OOM killer racing with try_to_freeze_tasks and the OOM victim didn't finish yet then we have to wait for it. This should complete in a finite time, though, because - the victim cannot loop in the page fault handler (it would die on the way out from the exception) - it cannot loop in the page allocator because all the further allocation would fail and __GFP_NOFAIL allocations are not acceptable at this stage - it shouldn't be blocked on any locks held by frozen tasks (try_to_freeze expects lockless context) and kernel threads and work queues are not frozen yet Signed-off-by: Michal Hocko <mhocko@suse.cz> Suggested-by: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: ""Rafael J. Wysocki"" <rjw@rjwysocki.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/c32b3cbe0d067a9cfae85aa70ba1e97ceba0ced7;"out_of_memory tells the caller whether the OOM was allowed to trigger and the callers are supposed to handle the situation";yes;no;no 180;MDY6Q29tbWl0MjMyNTI5ODpjMzJiM2NiZTBkMDY3YTljZmFlODVhYTcwYmExZTk3Y2ViYTBjZWQ3;Michal Hocko;Linus Torvalds;"oom, PM: make OOM detection in the freezer path raceless Commit 5695be142e20 (""OOM, PM: OOM killed task shouldn't escape PM suspend"") has left a race window when OOM killer manages to note_oom_kill after freeze_processes checks the counter. The race window is quite small and really unlikely and partial solution deemed sufficient at the time of submission. Tejun wasn't happy about this partial solution though and insisted on a full solution. That requires the full OOM and freezer's task freezing exclusion, though. This is done by this patch which introduces oom_sem RW lock and turns oom_killer_disable() into a full OOM barrier. oom_killer_disabled check is moved from the allocation path to the OOM level and we take oom_sem for reading for both the check and the whole OOM invocation. oom_killer_disable() takes oom_sem for writing so it waits for all currently running OOM killer invocations. Then it disable all the further OOMs by setting oom_killer_disabled and checks for any oom victims. Victims are counted via mark_tsk_oom_victim resp. unmark_oom_victim. The last victim wakes up all waiters enqueued by oom_killer_disable(). Therefore this function acts as the full OOM barrier. The page fault path is covered now as well although it was assumed to be safe before. As per Tejun, ""We used to have freezing points deep in file system code which may be reacheable from page fault."" so it would be better and more robust to not rely on freezing points here. Same applies to the memcg OOM killer. out_of_memory tells the caller whether the OOM was allowed to trigger and the callers are supposed to handle the situation. The page allocation path simply fails the allocation same as before. The page fault path will retry the fault (more on that later) and Sysrq OOM trigger will simply complain to the log. Normally there wouldn't be any unfrozen user tasks after try_to_freeze_tasks so the function will not block. But if there was an OOM killer racing with try_to_freeze_tasks and the OOM victim didn't finish yet then we have to wait for it. This should complete in a finite time, though, because - the victim cannot loop in the page fault handler (it would die on the way out from the exception) - it cannot loop in the page allocator because all the further allocation would fail and __GFP_NOFAIL allocations are not acceptable at this stage - it shouldn't be blocked on any locks held by frozen tasks (try_to_freeze expects lockless context) and kernel threads and work queues are not frozen yet Signed-off-by: Michal Hocko <mhocko@suse.cz> Suggested-by: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: ""Rafael J. Wysocki"" <rjw@rjwysocki.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/c32b3cbe0d067a9cfae85aa70ba1e97ceba0ced7;" The page allocation path simply fails the allocation same as before";yes;no;yes 180;MDY6Q29tbWl0MjMyNTI5ODpjMzJiM2NiZTBkMDY3YTljZmFlODVhYTcwYmExZTk3Y2ViYTBjZWQ3;Michal Hocko;Linus Torvalds;"oom, PM: make OOM detection in the freezer path raceless Commit 5695be142e20 (""OOM, PM: OOM killed task shouldn't escape PM suspend"") has left a race window when OOM killer manages to note_oom_kill after freeze_processes checks the counter. The race window is quite small and really unlikely and partial solution deemed sufficient at the time of submission. Tejun wasn't happy about this partial solution though and insisted on a full solution. That requires the full OOM and freezer's task freezing exclusion, though. This is done by this patch which introduces oom_sem RW lock and turns oom_killer_disable() into a full OOM barrier. oom_killer_disabled check is moved from the allocation path to the OOM level and we take oom_sem for reading for both the check and the whole OOM invocation. oom_killer_disable() takes oom_sem for writing so it waits for all currently running OOM killer invocations. Then it disable all the further OOMs by setting oom_killer_disabled and checks for any oom victims. Victims are counted via mark_tsk_oom_victim resp. unmark_oom_victim. The last victim wakes up all waiters enqueued by oom_killer_disable(). Therefore this function acts as the full OOM barrier. The page fault path is covered now as well although it was assumed to be safe before. As per Tejun, ""We used to have freezing points deep in file system code which may be reacheable from page fault."" so it would be better and more robust to not rely on freezing points here. Same applies to the memcg OOM killer. out_of_memory tells the caller whether the OOM was allowed to trigger and the callers are supposed to handle the situation. The page allocation path simply fails the allocation same as before. The page fault path will retry the fault (more on that later) and Sysrq OOM trigger will simply complain to the log. Normally there wouldn't be any unfrozen user tasks after try_to_freeze_tasks so the function will not block. But if there was an OOM killer racing with try_to_freeze_tasks and the OOM victim didn't finish yet then we have to wait for it. This should complete in a finite time, though, because - the victim cannot loop in the page fault handler (it would die on the way out from the exception) - it cannot loop in the page allocator because all the further allocation would fail and __GFP_NOFAIL allocations are not acceptable at this stage - it shouldn't be blocked on any locks held by frozen tasks (try_to_freeze expects lockless context) and kernel threads and work queues are not frozen yet Signed-off-by: Michal Hocko <mhocko@suse.cz> Suggested-by: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: ""Rafael J. Wysocki"" <rjw@rjwysocki.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/c32b3cbe0d067a9cfae85aa70ba1e97ceba0ced7;" The page fault path will retry the fault (more on that later) and Sysrq OOM trigger will simply complain to the log";yes;no;yes 180;MDY6Q29tbWl0MjMyNTI5ODpjMzJiM2NiZTBkMDY3YTljZmFlODVhYTcwYmExZTk3Y2ViYTBjZWQ3;Michal Hocko;Linus Torvalds;"oom, PM: make OOM detection in the freezer path raceless Commit 5695be142e20 (""OOM, PM: OOM killed task shouldn't escape PM suspend"") has left a race window when OOM killer manages to note_oom_kill after freeze_processes checks the counter. The race window is quite small and really unlikely and partial solution deemed sufficient at the time of submission. Tejun wasn't happy about this partial solution though and insisted on a full solution. That requires the full OOM and freezer's task freezing exclusion, though. This is done by this patch which introduces oom_sem RW lock and turns oom_killer_disable() into a full OOM barrier. oom_killer_disabled check is moved from the allocation path to the OOM level and we take oom_sem for reading for both the check and the whole OOM invocation. oom_killer_disable() takes oom_sem for writing so it waits for all currently running OOM killer invocations. Then it disable all the further OOMs by setting oom_killer_disabled and checks for any oom victims. Victims are counted via mark_tsk_oom_victim resp. unmark_oom_victim. The last victim wakes up all waiters enqueued by oom_killer_disable(). Therefore this function acts as the full OOM barrier. The page fault path is covered now as well although it was assumed to be safe before. As per Tejun, ""We used to have freezing points deep in file system code which may be reacheable from page fault."" so it would be better and more robust to not rely on freezing points here. Same applies to the memcg OOM killer. out_of_memory tells the caller whether the OOM was allowed to trigger and the callers are supposed to handle the situation. The page allocation path simply fails the allocation same as before. The page fault path will retry the fault (more on that later) and Sysrq OOM trigger will simply complain to the log. Normally there wouldn't be any unfrozen user tasks after try_to_freeze_tasks so the function will not block. But if there was an OOM killer racing with try_to_freeze_tasks and the OOM victim didn't finish yet then we have to wait for it. This should complete in a finite time, though, because - the victim cannot loop in the page fault handler (it would die on the way out from the exception) - it cannot loop in the page allocator because all the further allocation would fail and __GFP_NOFAIL allocations are not acceptable at this stage - it shouldn't be blocked on any locks held by frozen tasks (try_to_freeze expects lockless context) and kernel threads and work queues are not frozen yet Signed-off-by: Michal Hocko <mhocko@suse.cz> Suggested-by: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: ""Rafael J. Wysocki"" <rjw@rjwysocki.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/c32b3cbe0d067a9cfae85aa70ba1e97ceba0ced7;"Normally there wouldn't be any unfrozen user tasks after try_to_freeze_tasks so the function will not block";yes;no;yes 180;MDY6Q29tbWl0MjMyNTI5ODpjMzJiM2NiZTBkMDY3YTljZmFlODVhYTcwYmExZTk3Y2ViYTBjZWQ3;Michal Hocko;Linus Torvalds;"oom, PM: make OOM detection in the freezer path raceless Commit 5695be142e20 (""OOM, PM: OOM killed task shouldn't escape PM suspend"") has left a race window when OOM killer manages to note_oom_kill after freeze_processes checks the counter. The race window is quite small and really unlikely and partial solution deemed sufficient at the time of submission. Tejun wasn't happy about this partial solution though and insisted on a full solution. That requires the full OOM and freezer's task freezing exclusion, though. This is done by this patch which introduces oom_sem RW lock and turns oom_killer_disable() into a full OOM barrier. oom_killer_disabled check is moved from the allocation path to the OOM level and we take oom_sem for reading for both the check and the whole OOM invocation. oom_killer_disable() takes oom_sem for writing so it waits for all currently running OOM killer invocations. Then it disable all the further OOMs by setting oom_killer_disabled and checks for any oom victims. Victims are counted via mark_tsk_oom_victim resp. unmark_oom_victim. The last victim wakes up all waiters enqueued by oom_killer_disable(). Therefore this function acts as the full OOM barrier. The page fault path is covered now as well although it was assumed to be safe before. As per Tejun, ""We used to have freezing points deep in file system code which may be reacheable from page fault."" so it would be better and more robust to not rely on freezing points here. Same applies to the memcg OOM killer. out_of_memory tells the caller whether the OOM was allowed to trigger and the callers are supposed to handle the situation. The page allocation path simply fails the allocation same as before. The page fault path will retry the fault (more on that later) and Sysrq OOM trigger will simply complain to the log. Normally there wouldn't be any unfrozen user tasks after try_to_freeze_tasks so the function will not block. But if there was an OOM killer racing with try_to_freeze_tasks and the OOM victim didn't finish yet then we have to wait for it. This should complete in a finite time, though, because - the victim cannot loop in the page fault handler (it would die on the way out from the exception) - it cannot loop in the page allocator because all the further allocation would fail and __GFP_NOFAIL allocations are not acceptable at this stage - it shouldn't be blocked on any locks held by frozen tasks (try_to_freeze expects lockless context) and kernel threads and work queues are not frozen yet Signed-off-by: Michal Hocko <mhocko@suse.cz> Suggested-by: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: ""Rafael J. Wysocki"" <rjw@rjwysocki.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/c32b3cbe0d067a9cfae85aa70ba1e97ceba0ced7;"But if there was an OOM killer racing with try_to_freeze_tasks and the OOM victim didn't finish yet then we have to wait for it";yes;no;yes 180;MDY6Q29tbWl0MjMyNTI5ODpjMzJiM2NiZTBkMDY3YTljZmFlODVhYTcwYmExZTk3Y2ViYTBjZWQ3;Michal Hocko;Linus Torvalds;"oom, PM: make OOM detection in the freezer path raceless Commit 5695be142e20 (""OOM, PM: OOM killed task shouldn't escape PM suspend"") has left a race window when OOM killer manages to note_oom_kill after freeze_processes checks the counter. The race window is quite small and really unlikely and partial solution deemed sufficient at the time of submission. Tejun wasn't happy about this partial solution though and insisted on a full solution. That requires the full OOM and freezer's task freezing exclusion, though. This is done by this patch which introduces oom_sem RW lock and turns oom_killer_disable() into a full OOM barrier. oom_killer_disabled check is moved from the allocation path to the OOM level and we take oom_sem for reading for both the check and the whole OOM invocation. oom_killer_disable() takes oom_sem for writing so it waits for all currently running OOM killer invocations. Then it disable all the further OOMs by setting oom_killer_disabled and checks for any oom victims. Victims are counted via mark_tsk_oom_victim resp. unmark_oom_victim. The last victim wakes up all waiters enqueued by oom_killer_disable(). Therefore this function acts as the full OOM barrier. The page fault path is covered now as well although it was assumed to be safe before. As per Tejun, ""We used to have freezing points deep in file system code which may be reacheable from page fault."" so it would be better and more robust to not rely on freezing points here. Same applies to the memcg OOM killer. out_of_memory tells the caller whether the OOM was allowed to trigger and the callers are supposed to handle the situation. The page allocation path simply fails the allocation same as before. The page fault path will retry the fault (more on that later) and Sysrq OOM trigger will simply complain to the log. Normally there wouldn't be any unfrozen user tasks after try_to_freeze_tasks so the function will not block. But if there was an OOM killer racing with try_to_freeze_tasks and the OOM victim didn't finish yet then we have to wait for it. This should complete in a finite time, though, because - the victim cannot loop in the page fault handler (it would die on the way out from the exception) - it cannot loop in the page allocator because all the further allocation would fail and __GFP_NOFAIL allocations are not acceptable at this stage - it shouldn't be blocked on any locks held by frozen tasks (try_to_freeze expects lockless context) and kernel threads and work queues are not frozen yet Signed-off-by: Michal Hocko <mhocko@suse.cz> Suggested-by: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: ""Rafael J. Wysocki"" <rjw@rjwysocki.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/c32b3cbe0d067a9cfae85aa70ba1e97ceba0ced7;"This should complete in a finite time, though, because - the victim cannot loop in the page fault handler (it would die on the way out from the exception) - it cannot loop in the page allocator because all the further allocation would fail and __GFP_NOFAIL allocations are not acceptable at this stage - it shouldn't be blocked on any locks held by frozen tasks (try_to_freeze expects lockless context) and kernel threads and work queues are not frozen yet";yes;yes;yes 181;MDY6Q29tbWl0MjMyNTI5ODo2M2E4Y2E5YjIwODRmYTViZDkxYWEzODA1MzJmMThlMzYxNzY0MTA5;Michal Hocko;Linus Torvalds;"oom: thaw the OOM victim if it is frozen oom_kill_process only sets TIF_MEMDIE flag and sends a signal to the victim. This is basically noop when the task is frozen though because the task sleeps in the uninterruptible sleep. The victim is eventually thawed later when oom_scan_process_thread meets the task again in a later OOM invocation so the OOM killer doesn't live lock. But this is less than optimal. Let's add __thaw_task into mark_tsk_oom_victim after we set TIF_MEMDIE to the victim. We are not checking whether the task is frozen because that would be racy and __thaw_task does that already. oom_scan_process_thread doesn't need to care about freezer anymore as TIF_MEMDIE and freezer are excluded completely now. Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: ""Rafael J. Wysocki"" <rjw@rjwysocki.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/63a8ca9b2084fa5bd91aa380532f18e361764109;oom: thaw the OOM victim if it is frozen;yes;no;no 181;MDY6Q29tbWl0MjMyNTI5ODo2M2E4Y2E5YjIwODRmYTViZDkxYWEzODA1MzJmMThlMzYxNzY0MTA5;Michal Hocko;Linus Torvalds;"oom: thaw the OOM victim if it is frozen oom_kill_process only sets TIF_MEMDIE flag and sends a signal to the victim. This is basically noop when the task is frozen though because the task sleeps in the uninterruptible sleep. The victim is eventually thawed later when oom_scan_process_thread meets the task again in a later OOM invocation so the OOM killer doesn't live lock. But this is less than optimal. Let's add __thaw_task into mark_tsk_oom_victim after we set TIF_MEMDIE to the victim. We are not checking whether the task is frozen because that would be racy and __thaw_task does that already. oom_scan_process_thread doesn't need to care about freezer anymore as TIF_MEMDIE and freezer are excluded completely now. Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: ""Rafael J. Wysocki"" <rjw@rjwysocki.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/63a8ca9b2084fa5bd91aa380532f18e361764109;"oom_kill_process only sets TIF_MEMDIE flag and sends a signal to the victim";no;no;yes 181;MDY6Q29tbWl0MjMyNTI5ODo2M2E4Y2E5YjIwODRmYTViZDkxYWEzODA1MzJmMThlMzYxNzY0MTA5;Michal Hocko;Linus Torvalds;"oom: thaw the OOM victim if it is frozen oom_kill_process only sets TIF_MEMDIE flag and sends a signal to the victim. This is basically noop when the task is frozen though because the task sleeps in the uninterruptible sleep. The victim is eventually thawed later when oom_scan_process_thread meets the task again in a later OOM invocation so the OOM killer doesn't live lock. But this is less than optimal. Let's add __thaw_task into mark_tsk_oom_victim after we set TIF_MEMDIE to the victim. We are not checking whether the task is frozen because that would be racy and __thaw_task does that already. oom_scan_process_thread doesn't need to care about freezer anymore as TIF_MEMDIE and freezer are excluded completely now. Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: ""Rafael J. Wysocki"" <rjw@rjwysocki.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/63a8ca9b2084fa5bd91aa380532f18e361764109;" This is basically noop when the task is frozen though because the task sleeps in the uninterruptible sleep";no;no;yes 181;MDY6Q29tbWl0MjMyNTI5ODo2M2E4Y2E5YjIwODRmYTViZDkxYWEzODA1MzJmMThlMzYxNzY0MTA5;Michal Hocko;Linus Torvalds;"oom: thaw the OOM victim if it is frozen oom_kill_process only sets TIF_MEMDIE flag and sends a signal to the victim. This is basically noop when the task is frozen though because the task sleeps in the uninterruptible sleep. The victim is eventually thawed later when oom_scan_process_thread meets the task again in a later OOM invocation so the OOM killer doesn't live lock. But this is less than optimal. Let's add __thaw_task into mark_tsk_oom_victim after we set TIF_MEMDIE to the victim. We are not checking whether the task is frozen because that would be racy and __thaw_task does that already. oom_scan_process_thread doesn't need to care about freezer anymore as TIF_MEMDIE and freezer are excluded completely now. Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: ""Rafael J. Wysocki"" <rjw@rjwysocki.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/63a8ca9b2084fa5bd91aa380532f18e361764109;" The victim is eventually thawed later when oom_scan_process_thread meets the task again in a later OOM invocation so the OOM killer doesn't live lock";no;no;yes 181;MDY6Q29tbWl0MjMyNTI5ODo2M2E4Y2E5YjIwODRmYTViZDkxYWEzODA1MzJmMThlMzYxNzY0MTA5;Michal Hocko;Linus Torvalds;"oom: thaw the OOM victim if it is frozen oom_kill_process only sets TIF_MEMDIE flag and sends a signal to the victim. This is basically noop when the task is frozen though because the task sleeps in the uninterruptible sleep. The victim is eventually thawed later when oom_scan_process_thread meets the task again in a later OOM invocation so the OOM killer doesn't live lock. But this is less than optimal. Let's add __thaw_task into mark_tsk_oom_victim after we set TIF_MEMDIE to the victim. We are not checking whether the task is frozen because that would be racy and __thaw_task does that already. oom_scan_process_thread doesn't need to care about freezer anymore as TIF_MEMDIE and freezer are excluded completely now. Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: ""Rafael J. Wysocki"" <rjw@rjwysocki.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/63a8ca9b2084fa5bd91aa380532f18e361764109;" But this is less than optimal";no;yes;no 181;MDY6Q29tbWl0MjMyNTI5ODo2M2E4Y2E5YjIwODRmYTViZDkxYWEzODA1MzJmMThlMzYxNzY0MTA5;Michal Hocko;Linus Torvalds;"oom: thaw the OOM victim if it is frozen oom_kill_process only sets TIF_MEMDIE flag and sends a signal to the victim. This is basically noop when the task is frozen though because the task sleeps in the uninterruptible sleep. The victim is eventually thawed later when oom_scan_process_thread meets the task again in a later OOM invocation so the OOM killer doesn't live lock. But this is less than optimal. Let's add __thaw_task into mark_tsk_oom_victim after we set TIF_MEMDIE to the victim. We are not checking whether the task is frozen because that would be racy and __thaw_task does that already. oom_scan_process_thread doesn't need to care about freezer anymore as TIF_MEMDIE and freezer are excluded completely now. Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: ""Rafael J. Wysocki"" <rjw@rjwysocki.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/63a8ca9b2084fa5bd91aa380532f18e361764109;"Let's add __thaw_task into mark_tsk_oom_victim after we set TIF_MEMDIE to the victim";yes;no;no 181;MDY6Q29tbWl0MjMyNTI5ODo2M2E4Y2E5YjIwODRmYTViZDkxYWEzODA1MzJmMThlMzYxNzY0MTA5;Michal Hocko;Linus Torvalds;"oom: thaw the OOM victim if it is frozen oom_kill_process only sets TIF_MEMDIE flag and sends a signal to the victim. This is basically noop when the task is frozen though because the task sleeps in the uninterruptible sleep. The victim is eventually thawed later when oom_scan_process_thread meets the task again in a later OOM invocation so the OOM killer doesn't live lock. But this is less than optimal. Let's add __thaw_task into mark_tsk_oom_victim after we set TIF_MEMDIE to the victim. We are not checking whether the task is frozen because that would be racy and __thaw_task does that already. oom_scan_process_thread doesn't need to care about freezer anymore as TIF_MEMDIE and freezer are excluded completely now. Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: ""Rafael J. Wysocki"" <rjw@rjwysocki.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/63a8ca9b2084fa5bd91aa380532f18e361764109;" We are not checking whether the task is frozen because that would be racy and __thaw_task does that already";yes;yes;no 181;MDY6Q29tbWl0MjMyNTI5ODo2M2E4Y2E5YjIwODRmYTViZDkxYWEzODA1MzJmMThlMzYxNzY0MTA5;Michal Hocko;Linus Torvalds;"oom: thaw the OOM victim if it is frozen oom_kill_process only sets TIF_MEMDIE flag and sends a signal to the victim. This is basically noop when the task is frozen though because the task sleeps in the uninterruptible sleep. The victim is eventually thawed later when oom_scan_process_thread meets the task again in a later OOM invocation so the OOM killer doesn't live lock. But this is less than optimal. Let's add __thaw_task into mark_tsk_oom_victim after we set TIF_MEMDIE to the victim. We are not checking whether the task is frozen because that would be racy and __thaw_task does that already. oom_scan_process_thread doesn't need to care about freezer anymore as TIF_MEMDIE and freezer are excluded completely now. Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: ""Rafael J. Wysocki"" <rjw@rjwysocki.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/63a8ca9b2084fa5bd91aa380532f18e361764109;" oom_scan_process_thread doesn't need to care about freezer anymore as TIF_MEMDIE and freezer are excluded completely now.";yes;yes;no 182;MDY6Q29tbWl0MjMyNTI5ODo0OTU1MGI2MDU1ODc5MjRiMzMzNjM4NmNhYWU1MzIwMGM2ODk2OWQz;Michal Hocko;Linus Torvalds;"oom: add helpers for setting and clearing TIF_MEMDIE This patchset addresses a race which was described in the changelog for 5695be142e20 (""OOM, PM: OOM killed task shouldn't escape PM suspend""): : PM freezer relies on having all tasks frozen by the time devices are : getting frozen so that no task will touch them while they are getting : frozen. But OOM killer is allowed to kill an already frozen task in order : to handle OOM situtation. In order to protect from late wake ups OOM : killer is disabled after all tasks are frozen. This, however, still keeps : a window open when a killed task didn't manage to die by the time : freeze_processes finishes. The original patch hasn't closed the race window completely because that would require a more complex solution as it can be seen by this patchset. The primary motivation was to close the race condition between OOM killer and PM freezer _completely_. As Tejun pointed out, even though the race condition is unlikely the harder it would be to debug weird bugs deep in the PM freezer when the debugging options are reduced considerably. I can only speculate what might happen when a task is still runnable unexpectedly. On a plus side and as a side effect the oom enable/disable has a better (full barrier) semantic without polluting hot paths. I have tested the series in KVM with 100M RAM: - many small tasks (20M anon mmap) which are triggering OOM continually - s2ram which resumes automatically is triggered in a loop echo processors > /sys/power/pm_test while true do echo mem > /sys/power/state sleep 1s done - simple module which allocates and frees 20M in 8K chunks. If it sees freezing(current) then it tries another round of allocation before calling try_to_freeze - debugging messages of PM stages and OOM killer enable/disable/fail added and unmark_oom_victim is delayed by 1s after it clears TIF_MEMDIE and before it wakes up waiters. - rebased on top of the current mmotm which means some necessary updates in mm/oom_kill.c. mark_tsk_oom_victim is now called under task_lock but I think this should be OK because __thaw_task shouldn't interfere with any locking down wake_up_process. Oleg? As expected there are no OOM killed tasks after oom is disabled and allocations requested by the kernel thread are failing after all the tasks are frozen and OOM disabled. I wasn't able to catch a race where oom_killer_disable would really have to wait but I kinda expected the race is really unlikely. [ 242.609330] Killed process 2992 (mem_eater) total-vm:24412kB, anon-rss:2164kB, file-rss:4kB [ 243.628071] Unmarking 2992 OOM victim. oom_victims: 1 [ 243.636072] (elapsed 2.837 seconds) done. [ 243.641985] Trying to disable OOM killer [ 243.643032] Waiting for concurent OOM victims [ 243.644342] OOM killer disabled [ 243.645447] Freezing remaining freezable tasks ... (elapsed 0.005 seconds) done. [ 243.652983] Suspending console(s) (use no_console_suspend to debug) [ 243.903299] kmem_eater: page allocation failure: order:1, mode:0x204010 [...] [ 243.992600] PM: suspend of devices complete after 336.667 msecs [ 243.993264] PM: late suspend of devices complete after 0.660 msecs [ 243.994713] PM: noirq suspend of devices complete after 1.446 msecs [ 243.994717] ACPI: Preparing to enter system sleep state S3 [ 243.994795] PM: Saving platform NVS memory [ 243.994796] Disabling non-boot CPUs ... The first 2 patches are simple cleanups for OOM. They should go in regardless the rest IMO. Patches 3 and 4 are trivial printk -> pr_info conversion and they should go in ditto. The main patch is the last one and I would appreciate acks from Tejun and Rafael. I think the OOM part should be OK (except for __thaw_task vs. task_lock where a look from Oleg would appreciated) but I am not so sure I haven't screwed anything in the freezer code. I have found several surprises there. This patch (of 5): This patch is just a preparatory and it doesn't introduce any functional change. Note: I am utterly unhappy about lowmemory killer abusing TIF_MEMDIE just to wait for the oom victim and to prevent from new killing. This is just a side effect of the flag. The primary meaning is to give the oom victim access to the memory reserves and that shouldn't be necessary here. Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: ""Rafael J. Wysocki"" <rjw@rjwysocki.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/49550b605587924b3336386caae53200c68969d3;oom: add helpers for setting and clearing TIF_MEMDIE;yes;yes;no 182;MDY6Q29tbWl0MjMyNTI5ODo0OTU1MGI2MDU1ODc5MjRiMzMzNjM4NmNhYWU1MzIwMGM2ODk2OWQz;Michal Hocko;Linus Torvalds;"oom: add helpers for setting and clearing TIF_MEMDIE This patchset addresses a race which was described in the changelog for 5695be142e20 (""OOM, PM: OOM killed task shouldn't escape PM suspend""): : PM freezer relies on having all tasks frozen by the time devices are : getting frozen so that no task will touch them while they are getting : frozen. But OOM killer is allowed to kill an already frozen task in order : to handle OOM situtation. In order to protect from late wake ups OOM : killer is disabled after all tasks are frozen. This, however, still keeps : a window open when a killed task didn't manage to die by the time : freeze_processes finishes. The original patch hasn't closed the race window completely because that would require a more complex solution as it can be seen by this patchset. The primary motivation was to close the race condition between OOM killer and PM freezer _completely_. As Tejun pointed out, even though the race condition is unlikely the harder it would be to debug weird bugs deep in the PM freezer when the debugging options are reduced considerably. I can only speculate what might happen when a task is still runnable unexpectedly. On a plus side and as a side effect the oom enable/disable has a better (full barrier) semantic without polluting hot paths. I have tested the series in KVM with 100M RAM: - many small tasks (20M anon mmap) which are triggering OOM continually - s2ram which resumes automatically is triggered in a loop echo processors > /sys/power/pm_test while true do echo mem > /sys/power/state sleep 1s done - simple module which allocates and frees 20M in 8K chunks. If it sees freezing(current) then it tries another round of allocation before calling try_to_freeze - debugging messages of PM stages and OOM killer enable/disable/fail added and unmark_oom_victim is delayed by 1s after it clears TIF_MEMDIE and before it wakes up waiters. - rebased on top of the current mmotm which means some necessary updates in mm/oom_kill.c. mark_tsk_oom_victim is now called under task_lock but I think this should be OK because __thaw_task shouldn't interfere with any locking down wake_up_process. Oleg? As expected there are no OOM killed tasks after oom is disabled and allocations requested by the kernel thread are failing after all the tasks are frozen and OOM disabled. I wasn't able to catch a race where oom_killer_disable would really have to wait but I kinda expected the race is really unlikely. [ 242.609330] Killed process 2992 (mem_eater) total-vm:24412kB, anon-rss:2164kB, file-rss:4kB [ 243.628071] Unmarking 2992 OOM victim. oom_victims: 1 [ 243.636072] (elapsed 2.837 seconds) done. [ 243.641985] Trying to disable OOM killer [ 243.643032] Waiting for concurent OOM victims [ 243.644342] OOM killer disabled [ 243.645447] Freezing remaining freezable tasks ... (elapsed 0.005 seconds) done. [ 243.652983] Suspending console(s) (use no_console_suspend to debug) [ 243.903299] kmem_eater: page allocation failure: order:1, mode:0x204010 [...] [ 243.992600] PM: suspend of devices complete after 336.667 msecs [ 243.993264] PM: late suspend of devices complete after 0.660 msecs [ 243.994713] PM: noirq suspend of devices complete after 1.446 msecs [ 243.994717] ACPI: Preparing to enter system sleep state S3 [ 243.994795] PM: Saving platform NVS memory [ 243.994796] Disabling non-boot CPUs ... The first 2 patches are simple cleanups for OOM. They should go in regardless the rest IMO. Patches 3 and 4 are trivial printk -> pr_info conversion and they should go in ditto. The main patch is the last one and I would appreciate acks from Tejun and Rafael. I think the OOM part should be OK (except for __thaw_task vs. task_lock where a look from Oleg would appreciated) but I am not so sure I haven't screwed anything in the freezer code. I have found several surprises there. This patch (of 5): This patch is just a preparatory and it doesn't introduce any functional change. Note: I am utterly unhappy about lowmemory killer abusing TIF_MEMDIE just to wait for the oom victim and to prevent from new killing. This is just a side effect of the flag. The primary meaning is to give the oom victim access to the memory reserves and that shouldn't be necessary here. Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: ""Rafael J. Wysocki"" <rjw@rjwysocki.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/49550b605587924b3336386caae53200c68969d3;"This patchset addresses a race which was described in the changelog for 5695be142e20 (""OOM, PM: OOM killed task shouldn't escape PM suspend"")";yes;yes;no 182;MDY6Q29tbWl0MjMyNTI5ODo0OTU1MGI2MDU1ODc5MjRiMzMzNjM4NmNhYWU1MzIwMGM2ODk2OWQz;Michal Hocko;Linus Torvalds;"oom: add helpers for setting and clearing TIF_MEMDIE This patchset addresses a race which was described in the changelog for 5695be142e20 (""OOM, PM: OOM killed task shouldn't escape PM suspend""): : PM freezer relies on having all tasks frozen by the time devices are : getting frozen so that no task will touch them while they are getting : frozen. But OOM killer is allowed to kill an already frozen task in order : to handle OOM situtation. In order to protect from late wake ups OOM : killer is disabled after all tasks are frozen. This, however, still keeps : a window open when a killed task didn't manage to die by the time : freeze_processes finishes. The original patch hasn't closed the race window completely because that would require a more complex solution as it can be seen by this patchset. The primary motivation was to close the race condition between OOM killer and PM freezer _completely_. As Tejun pointed out, even though the race condition is unlikely the harder it would be to debug weird bugs deep in the PM freezer when the debugging options are reduced considerably. I can only speculate what might happen when a task is still runnable unexpectedly. On a plus side and as a side effect the oom enable/disable has a better (full barrier) semantic without polluting hot paths. I have tested the series in KVM with 100M RAM: - many small tasks (20M anon mmap) which are triggering OOM continually - s2ram which resumes automatically is triggered in a loop echo processors > /sys/power/pm_test while true do echo mem > /sys/power/state sleep 1s done - simple module which allocates and frees 20M in 8K chunks. If it sees freezing(current) then it tries another round of allocation before calling try_to_freeze - debugging messages of PM stages and OOM killer enable/disable/fail added and unmark_oom_victim is delayed by 1s after it clears TIF_MEMDIE and before it wakes up waiters. - rebased on top of the current mmotm which means some necessary updates in mm/oom_kill.c. mark_tsk_oom_victim is now called under task_lock but I think this should be OK because __thaw_task shouldn't interfere with any locking down wake_up_process. Oleg? As expected there are no OOM killed tasks after oom is disabled and allocations requested by the kernel thread are failing after all the tasks are frozen and OOM disabled. I wasn't able to catch a race where oom_killer_disable would really have to wait but I kinda expected the race is really unlikely. [ 242.609330] Killed process 2992 (mem_eater) total-vm:24412kB, anon-rss:2164kB, file-rss:4kB [ 243.628071] Unmarking 2992 OOM victim. oom_victims: 1 [ 243.636072] (elapsed 2.837 seconds) done. [ 243.641985] Trying to disable OOM killer [ 243.643032] Waiting for concurent OOM victims [ 243.644342] OOM killer disabled [ 243.645447] Freezing remaining freezable tasks ... (elapsed 0.005 seconds) done. [ 243.652983] Suspending console(s) (use no_console_suspend to debug) [ 243.903299] kmem_eater: page allocation failure: order:1, mode:0x204010 [...] [ 243.992600] PM: suspend of devices complete after 336.667 msecs [ 243.993264] PM: late suspend of devices complete after 0.660 msecs [ 243.994713] PM: noirq suspend of devices complete after 1.446 msecs [ 243.994717] ACPI: Preparing to enter system sleep state S3 [ 243.994795] PM: Saving platform NVS memory [ 243.994796] Disabling non-boot CPUs ... The first 2 patches are simple cleanups for OOM. They should go in regardless the rest IMO. Patches 3 and 4 are trivial printk -> pr_info conversion and they should go in ditto. The main patch is the last one and I would appreciate acks from Tejun and Rafael. I think the OOM part should be OK (except for __thaw_task vs. task_lock where a look from Oleg would appreciated) but I am not so sure I haven't screwed anything in the freezer code. I have found several surprises there. This patch (of 5): This patch is just a preparatory and it doesn't introduce any functional change. Note: I am utterly unhappy about lowmemory killer abusing TIF_MEMDIE just to wait for the oom victim and to prevent from new killing. This is just a side effect of the flag. The primary meaning is to give the oom victim access to the memory reserves and that shouldn't be necessary here. Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: ""Rafael J. Wysocki"" <rjw@rjwysocki.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/49550b605587924b3336386caae53200c68969d3;": PM freezer relies on having all tasks frozen by the time devices are : getting frozen so that no task will touch them while they are getting : frozen";no;no;yes 182;MDY6Q29tbWl0MjMyNTI5ODo0OTU1MGI2MDU1ODc5MjRiMzMzNjM4NmNhYWU1MzIwMGM2ODk2OWQz;Michal Hocko;Linus Torvalds;"oom: add helpers for setting and clearing TIF_MEMDIE This patchset addresses a race which was described in the changelog for 5695be142e20 (""OOM, PM: OOM killed task shouldn't escape PM suspend""): : PM freezer relies on having all tasks frozen by the time devices are : getting frozen so that no task will touch them while they are getting : frozen. But OOM killer is allowed to kill an already frozen task in order : to handle OOM situtation. In order to protect from late wake ups OOM : killer is disabled after all tasks are frozen. This, however, still keeps : a window open when a killed task didn't manage to die by the time : freeze_processes finishes. The original patch hasn't closed the race window completely because that would require a more complex solution as it can be seen by this patchset. The primary motivation was to close the race condition between OOM killer and PM freezer _completely_. As Tejun pointed out, even though the race condition is unlikely the harder it would be to debug weird bugs deep in the PM freezer when the debugging options are reduced considerably. I can only speculate what might happen when a task is still runnable unexpectedly. On a plus side and as a side effect the oom enable/disable has a better (full barrier) semantic without polluting hot paths. I have tested the series in KVM with 100M RAM: - many small tasks (20M anon mmap) which are triggering OOM continually - s2ram which resumes automatically is triggered in a loop echo processors > /sys/power/pm_test while true do echo mem > /sys/power/state sleep 1s done - simple module which allocates and frees 20M in 8K chunks. If it sees freezing(current) then it tries another round of allocation before calling try_to_freeze - debugging messages of PM stages and OOM killer enable/disable/fail added and unmark_oom_victim is delayed by 1s after it clears TIF_MEMDIE and before it wakes up waiters. - rebased on top of the current mmotm which means some necessary updates in mm/oom_kill.c. mark_tsk_oom_victim is now called under task_lock but I think this should be OK because __thaw_task shouldn't interfere with any locking down wake_up_process. Oleg? As expected there are no OOM killed tasks after oom is disabled and allocations requested by the kernel thread are failing after all the tasks are frozen and OOM disabled. I wasn't able to catch a race where oom_killer_disable would really have to wait but I kinda expected the race is really unlikely. [ 242.609330] Killed process 2992 (mem_eater) total-vm:24412kB, anon-rss:2164kB, file-rss:4kB [ 243.628071] Unmarking 2992 OOM victim. oom_victims: 1 [ 243.636072] (elapsed 2.837 seconds) done. [ 243.641985] Trying to disable OOM killer [ 243.643032] Waiting for concurent OOM victims [ 243.644342] OOM killer disabled [ 243.645447] Freezing remaining freezable tasks ... (elapsed 0.005 seconds) done. [ 243.652983] Suspending console(s) (use no_console_suspend to debug) [ 243.903299] kmem_eater: page allocation failure: order:1, mode:0x204010 [...] [ 243.992600] PM: suspend of devices complete after 336.667 msecs [ 243.993264] PM: late suspend of devices complete after 0.660 msecs [ 243.994713] PM: noirq suspend of devices complete after 1.446 msecs [ 243.994717] ACPI: Preparing to enter system sleep state S3 [ 243.994795] PM: Saving platform NVS memory [ 243.994796] Disabling non-boot CPUs ... The first 2 patches are simple cleanups for OOM. They should go in regardless the rest IMO. Patches 3 and 4 are trivial printk -> pr_info conversion and they should go in ditto. The main patch is the last one and I would appreciate acks from Tejun and Rafael. I think the OOM part should be OK (except for __thaw_task vs. task_lock where a look from Oleg would appreciated) but I am not so sure I haven't screwed anything in the freezer code. I have found several surprises there. This patch (of 5): This patch is just a preparatory and it doesn't introduce any functional change. Note: I am utterly unhappy about lowmemory killer abusing TIF_MEMDIE just to wait for the oom victim and to prevent from new killing. This is just a side effect of the flag. The primary meaning is to give the oom victim access to the memory reserves and that shouldn't be necessary here. Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: ""Rafael J. Wysocki"" <rjw@rjwysocki.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/49550b605587924b3336386caae53200c68969d3;" But OOM killer is allowed to kill an already frozen task in order : to handle OOM situtation";no;no;yes 182;MDY6Q29tbWl0MjMyNTI5ODo0OTU1MGI2MDU1ODc5MjRiMzMzNjM4NmNhYWU1MzIwMGM2ODk2OWQz;Michal Hocko;Linus Torvalds;"oom: add helpers for setting and clearing TIF_MEMDIE This patchset addresses a race which was described in the changelog for 5695be142e20 (""OOM, PM: OOM killed task shouldn't escape PM suspend""): : PM freezer relies on having all tasks frozen by the time devices are : getting frozen so that no task will touch them while they are getting : frozen. But OOM killer is allowed to kill an already frozen task in order : to handle OOM situtation. In order to protect from late wake ups OOM : killer is disabled after all tasks are frozen. This, however, still keeps : a window open when a killed task didn't manage to die by the time : freeze_processes finishes. The original patch hasn't closed the race window completely because that would require a more complex solution as it can be seen by this patchset. The primary motivation was to close the race condition between OOM killer and PM freezer _completely_. As Tejun pointed out, even though the race condition is unlikely the harder it would be to debug weird bugs deep in the PM freezer when the debugging options are reduced considerably. I can only speculate what might happen when a task is still runnable unexpectedly. On a plus side and as a side effect the oom enable/disable has a better (full barrier) semantic without polluting hot paths. I have tested the series in KVM with 100M RAM: - many small tasks (20M anon mmap) which are triggering OOM continually - s2ram which resumes automatically is triggered in a loop echo processors > /sys/power/pm_test while true do echo mem > /sys/power/state sleep 1s done - simple module which allocates and frees 20M in 8K chunks. If it sees freezing(current) then it tries another round of allocation before calling try_to_freeze - debugging messages of PM stages and OOM killer enable/disable/fail added and unmark_oom_victim is delayed by 1s after it clears TIF_MEMDIE and before it wakes up waiters. - rebased on top of the current mmotm which means some necessary updates in mm/oom_kill.c. mark_tsk_oom_victim is now called under task_lock but I think this should be OK because __thaw_task shouldn't interfere with any locking down wake_up_process. Oleg? As expected there are no OOM killed tasks after oom is disabled and allocations requested by the kernel thread are failing after all the tasks are frozen and OOM disabled. I wasn't able to catch a race where oom_killer_disable would really have to wait but I kinda expected the race is really unlikely. [ 242.609330] Killed process 2992 (mem_eater) total-vm:24412kB, anon-rss:2164kB, file-rss:4kB [ 243.628071] Unmarking 2992 OOM victim. oom_victims: 1 [ 243.636072] (elapsed 2.837 seconds) done. [ 243.641985] Trying to disable OOM killer [ 243.643032] Waiting for concurent OOM victims [ 243.644342] OOM killer disabled [ 243.645447] Freezing remaining freezable tasks ... (elapsed 0.005 seconds) done. [ 243.652983] Suspending console(s) (use no_console_suspend to debug) [ 243.903299] kmem_eater: page allocation failure: order:1, mode:0x204010 [...] [ 243.992600] PM: suspend of devices complete after 336.667 msecs [ 243.993264] PM: late suspend of devices complete after 0.660 msecs [ 243.994713] PM: noirq suspend of devices complete after 1.446 msecs [ 243.994717] ACPI: Preparing to enter system sleep state S3 [ 243.994795] PM: Saving platform NVS memory [ 243.994796] Disabling non-boot CPUs ... The first 2 patches are simple cleanups for OOM. They should go in regardless the rest IMO. Patches 3 and 4 are trivial printk -> pr_info conversion and they should go in ditto. The main patch is the last one and I would appreciate acks from Tejun and Rafael. I think the OOM part should be OK (except for __thaw_task vs. task_lock where a look from Oleg would appreciated) but I am not so sure I haven't screwed anything in the freezer code. I have found several surprises there. This patch (of 5): This patch is just a preparatory and it doesn't introduce any functional change. Note: I am utterly unhappy about lowmemory killer abusing TIF_MEMDIE just to wait for the oom victim and to prevent from new killing. This is just a side effect of the flag. The primary meaning is to give the oom victim access to the memory reserves and that shouldn't be necessary here. Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: ""Rafael J. Wysocki"" <rjw@rjwysocki.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/49550b605587924b3336386caae53200c68969d3;" In order to protect from late wake ups OOM : killer is disabled after all tasks are frozen";no;no;yes 182;MDY6Q29tbWl0MjMyNTI5ODo0OTU1MGI2MDU1ODc5MjRiMzMzNjM4NmNhYWU1MzIwMGM2ODk2OWQz;Michal Hocko;Linus Torvalds;"oom: add helpers for setting and clearing TIF_MEMDIE This patchset addresses a race which was described in the changelog for 5695be142e20 (""OOM, PM: OOM killed task shouldn't escape PM suspend""): : PM freezer relies on having all tasks frozen by the time devices are : getting frozen so that no task will touch them while they are getting : frozen. But OOM killer is allowed to kill an already frozen task in order : to handle OOM situtation. In order to protect from late wake ups OOM : killer is disabled after all tasks are frozen. This, however, still keeps : a window open when a killed task didn't manage to die by the time : freeze_processes finishes. The original patch hasn't closed the race window completely because that would require a more complex solution as it can be seen by this patchset. The primary motivation was to close the race condition between OOM killer and PM freezer _completely_. As Tejun pointed out, even though the race condition is unlikely the harder it would be to debug weird bugs deep in the PM freezer when the debugging options are reduced considerably. I can only speculate what might happen when a task is still runnable unexpectedly. On a plus side and as a side effect the oom enable/disable has a better (full barrier) semantic without polluting hot paths. I have tested the series in KVM with 100M RAM: - many small tasks (20M anon mmap) which are triggering OOM continually - s2ram which resumes automatically is triggered in a loop echo processors > /sys/power/pm_test while true do echo mem > /sys/power/state sleep 1s done - simple module which allocates and frees 20M in 8K chunks. If it sees freezing(current) then it tries another round of allocation before calling try_to_freeze - debugging messages of PM stages and OOM killer enable/disable/fail added and unmark_oom_victim is delayed by 1s after it clears TIF_MEMDIE and before it wakes up waiters. - rebased on top of the current mmotm which means some necessary updates in mm/oom_kill.c. mark_tsk_oom_victim is now called under task_lock but I think this should be OK because __thaw_task shouldn't interfere with any locking down wake_up_process. Oleg? As expected there are no OOM killed tasks after oom is disabled and allocations requested by the kernel thread are failing after all the tasks are frozen and OOM disabled. I wasn't able to catch a race where oom_killer_disable would really have to wait but I kinda expected the race is really unlikely. [ 242.609330] Killed process 2992 (mem_eater) total-vm:24412kB, anon-rss:2164kB, file-rss:4kB [ 243.628071] Unmarking 2992 OOM victim. oom_victims: 1 [ 243.636072] (elapsed 2.837 seconds) done. [ 243.641985] Trying to disable OOM killer [ 243.643032] Waiting for concurent OOM victims [ 243.644342] OOM killer disabled [ 243.645447] Freezing remaining freezable tasks ... (elapsed 0.005 seconds) done. [ 243.652983] Suspending console(s) (use no_console_suspend to debug) [ 243.903299] kmem_eater: page allocation failure: order:1, mode:0x204010 [...] [ 243.992600] PM: suspend of devices complete after 336.667 msecs [ 243.993264] PM: late suspend of devices complete after 0.660 msecs [ 243.994713] PM: noirq suspend of devices complete after 1.446 msecs [ 243.994717] ACPI: Preparing to enter system sleep state S3 [ 243.994795] PM: Saving platform NVS memory [ 243.994796] Disabling non-boot CPUs ... The first 2 patches are simple cleanups for OOM. They should go in regardless the rest IMO. Patches 3 and 4 are trivial printk -> pr_info conversion and they should go in ditto. The main patch is the last one and I would appreciate acks from Tejun and Rafael. I think the OOM part should be OK (except for __thaw_task vs. task_lock where a look from Oleg would appreciated) but I am not so sure I haven't screwed anything in the freezer code. I have found several surprises there. This patch (of 5): This patch is just a preparatory and it doesn't introduce any functional change. Note: I am utterly unhappy about lowmemory killer abusing TIF_MEMDIE just to wait for the oom victim and to prevent from new killing. This is just a side effect of the flag. The primary meaning is to give the oom victim access to the memory reserves and that shouldn't be necessary here. Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: ""Rafael J. Wysocki"" <rjw@rjwysocki.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/49550b605587924b3336386caae53200c68969d3;" This, however, still keeps : a window open when a killed task didn't manage to die by the time : freeze_processes finishes";no;yes;yes 182;MDY6Q29tbWl0MjMyNTI5ODo0OTU1MGI2MDU1ODc5MjRiMzMzNjM4NmNhYWU1MzIwMGM2ODk2OWQz;Michal Hocko;Linus Torvalds;"oom: add helpers for setting and clearing TIF_MEMDIE This patchset addresses a race which was described in the changelog for 5695be142e20 (""OOM, PM: OOM killed task shouldn't escape PM suspend""): : PM freezer relies on having all tasks frozen by the time devices are : getting frozen so that no task will touch them while they are getting : frozen. But OOM killer is allowed to kill an already frozen task in order : to handle OOM situtation. In order to protect from late wake ups OOM : killer is disabled after all tasks are frozen. This, however, still keeps : a window open when a killed task didn't manage to die by the time : freeze_processes finishes. The original patch hasn't closed the race window completely because that would require a more complex solution as it can be seen by this patchset. The primary motivation was to close the race condition between OOM killer and PM freezer _completely_. As Tejun pointed out, even though the race condition is unlikely the harder it would be to debug weird bugs deep in the PM freezer when the debugging options are reduced considerably. I can only speculate what might happen when a task is still runnable unexpectedly. On a plus side and as a side effect the oom enable/disable has a better (full barrier) semantic without polluting hot paths. I have tested the series in KVM with 100M RAM: - many small tasks (20M anon mmap) which are triggering OOM continually - s2ram which resumes automatically is triggered in a loop echo processors > /sys/power/pm_test while true do echo mem > /sys/power/state sleep 1s done - simple module which allocates and frees 20M in 8K chunks. If it sees freezing(current) then it tries another round of allocation before calling try_to_freeze - debugging messages of PM stages and OOM killer enable/disable/fail added and unmark_oom_victim is delayed by 1s after it clears TIF_MEMDIE and before it wakes up waiters. - rebased on top of the current mmotm which means some necessary updates in mm/oom_kill.c. mark_tsk_oom_victim is now called under task_lock but I think this should be OK because __thaw_task shouldn't interfere with any locking down wake_up_process. Oleg? As expected there are no OOM killed tasks after oom is disabled and allocations requested by the kernel thread are failing after all the tasks are frozen and OOM disabled. I wasn't able to catch a race where oom_killer_disable would really have to wait but I kinda expected the race is really unlikely. [ 242.609330] Killed process 2992 (mem_eater) total-vm:24412kB, anon-rss:2164kB, file-rss:4kB [ 243.628071] Unmarking 2992 OOM victim. oom_victims: 1 [ 243.636072] (elapsed 2.837 seconds) done. [ 243.641985] Trying to disable OOM killer [ 243.643032] Waiting for concurent OOM victims [ 243.644342] OOM killer disabled [ 243.645447] Freezing remaining freezable tasks ... (elapsed 0.005 seconds) done. [ 243.652983] Suspending console(s) (use no_console_suspend to debug) [ 243.903299] kmem_eater: page allocation failure: order:1, mode:0x204010 [...] [ 243.992600] PM: suspend of devices complete after 336.667 msecs [ 243.993264] PM: late suspend of devices complete after 0.660 msecs [ 243.994713] PM: noirq suspend of devices complete after 1.446 msecs [ 243.994717] ACPI: Preparing to enter system sleep state S3 [ 243.994795] PM: Saving platform NVS memory [ 243.994796] Disabling non-boot CPUs ... The first 2 patches are simple cleanups for OOM. They should go in regardless the rest IMO. Patches 3 and 4 are trivial printk -> pr_info conversion and they should go in ditto. The main patch is the last one and I would appreciate acks from Tejun and Rafael. I think the OOM part should be OK (except for __thaw_task vs. task_lock where a look from Oleg would appreciated) but I am not so sure I haven't screwed anything in the freezer code. I have found several surprises there. This patch (of 5): This patch is just a preparatory and it doesn't introduce any functional change. Note: I am utterly unhappy about lowmemory killer abusing TIF_MEMDIE just to wait for the oom victim and to prevent from new killing. This is just a side effect of the flag. The primary meaning is to give the oom victim access to the memory reserves and that shouldn't be necessary here. Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: ""Rafael J. Wysocki"" <rjw@rjwysocki.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/49550b605587924b3336386caae53200c68969d3;"The original patch hasn't closed the race window completely because that would require a more complex solution as it can be seen by this patchset";no;yes;yes 182;MDY6Q29tbWl0MjMyNTI5ODo0OTU1MGI2MDU1ODc5MjRiMzMzNjM4NmNhYWU1MzIwMGM2ODk2OWQz;Michal Hocko;Linus Torvalds;"oom: add helpers for setting and clearing TIF_MEMDIE This patchset addresses a race which was described in the changelog for 5695be142e20 (""OOM, PM: OOM killed task shouldn't escape PM suspend""): : PM freezer relies on having all tasks frozen by the time devices are : getting frozen so that no task will touch them while they are getting : frozen. But OOM killer is allowed to kill an already frozen task in order : to handle OOM situtation. In order to protect from late wake ups OOM : killer is disabled after all tasks are frozen. This, however, still keeps : a window open when a killed task didn't manage to die by the time : freeze_processes finishes. The original patch hasn't closed the race window completely because that would require a more complex solution as it can be seen by this patchset. The primary motivation was to close the race condition between OOM killer and PM freezer _completely_. As Tejun pointed out, even though the race condition is unlikely the harder it would be to debug weird bugs deep in the PM freezer when the debugging options are reduced considerably. I can only speculate what might happen when a task is still runnable unexpectedly. On a plus side and as a side effect the oom enable/disable has a better (full barrier) semantic without polluting hot paths. I have tested the series in KVM with 100M RAM: - many small tasks (20M anon mmap) which are triggering OOM continually - s2ram which resumes automatically is triggered in a loop echo processors > /sys/power/pm_test while true do echo mem > /sys/power/state sleep 1s done - simple module which allocates and frees 20M in 8K chunks. If it sees freezing(current) then it tries another round of allocation before calling try_to_freeze - debugging messages of PM stages and OOM killer enable/disable/fail added and unmark_oom_victim is delayed by 1s after it clears TIF_MEMDIE and before it wakes up waiters. - rebased on top of the current mmotm which means some necessary updates in mm/oom_kill.c. mark_tsk_oom_victim is now called under task_lock but I think this should be OK because __thaw_task shouldn't interfere with any locking down wake_up_process. Oleg? As expected there are no OOM killed tasks after oom is disabled and allocations requested by the kernel thread are failing after all the tasks are frozen and OOM disabled. I wasn't able to catch a race where oom_killer_disable would really have to wait but I kinda expected the race is really unlikely. [ 242.609330] Killed process 2992 (mem_eater) total-vm:24412kB, anon-rss:2164kB, file-rss:4kB [ 243.628071] Unmarking 2992 OOM victim. oom_victims: 1 [ 243.636072] (elapsed 2.837 seconds) done. [ 243.641985] Trying to disable OOM killer [ 243.643032] Waiting for concurent OOM victims [ 243.644342] OOM killer disabled [ 243.645447] Freezing remaining freezable tasks ... (elapsed 0.005 seconds) done. [ 243.652983] Suspending console(s) (use no_console_suspend to debug) [ 243.903299] kmem_eater: page allocation failure: order:1, mode:0x204010 [...] [ 243.992600] PM: suspend of devices complete after 336.667 msecs [ 243.993264] PM: late suspend of devices complete after 0.660 msecs [ 243.994713] PM: noirq suspend of devices complete after 1.446 msecs [ 243.994717] ACPI: Preparing to enter system sleep state S3 [ 243.994795] PM: Saving platform NVS memory [ 243.994796] Disabling non-boot CPUs ... The first 2 patches are simple cleanups for OOM. They should go in regardless the rest IMO. Patches 3 and 4 are trivial printk -> pr_info conversion and they should go in ditto. The main patch is the last one and I would appreciate acks from Tejun and Rafael. I think the OOM part should be OK (except for __thaw_task vs. task_lock where a look from Oleg would appreciated) but I am not so sure I haven't screwed anything in the freezer code. I have found several surprises there. This patch (of 5): This patch is just a preparatory and it doesn't introduce any functional change. Note: I am utterly unhappy about lowmemory killer abusing TIF_MEMDIE just to wait for the oom victim and to prevent from new killing. This is just a side effect of the flag. The primary meaning is to give the oom victim access to the memory reserves and that shouldn't be necessary here. Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: ""Rafael J. Wysocki"" <rjw@rjwysocki.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/49550b605587924b3336386caae53200c68969d3;"The primary motivation was to close the race condition between OOM killer and PM freezer _completely_";no;yes;yes 182;MDY6Q29tbWl0MjMyNTI5ODo0OTU1MGI2MDU1ODc5MjRiMzMzNjM4NmNhYWU1MzIwMGM2ODk2OWQz;Michal Hocko;Linus Torvalds;"oom: add helpers for setting and clearing TIF_MEMDIE This patchset addresses a race which was described in the changelog for 5695be142e20 (""OOM, PM: OOM killed task shouldn't escape PM suspend""): : PM freezer relies on having all tasks frozen by the time devices are : getting frozen so that no task will touch them while they are getting : frozen. But OOM killer is allowed to kill an already frozen task in order : to handle OOM situtation. In order to protect from late wake ups OOM : killer is disabled after all tasks are frozen. This, however, still keeps : a window open when a killed task didn't manage to die by the time : freeze_processes finishes. The original patch hasn't closed the race window completely because that would require a more complex solution as it can be seen by this patchset. The primary motivation was to close the race condition between OOM killer and PM freezer _completely_. As Tejun pointed out, even though the race condition is unlikely the harder it would be to debug weird bugs deep in the PM freezer when the debugging options are reduced considerably. I can only speculate what might happen when a task is still runnable unexpectedly. On a plus side and as a side effect the oom enable/disable has a better (full barrier) semantic without polluting hot paths. I have tested the series in KVM with 100M RAM: - many small tasks (20M anon mmap) which are triggering OOM continually - s2ram which resumes automatically is triggered in a loop echo processors > /sys/power/pm_test while true do echo mem > /sys/power/state sleep 1s done - simple module which allocates and frees 20M in 8K chunks. If it sees freezing(current) then it tries another round of allocation before calling try_to_freeze - debugging messages of PM stages and OOM killer enable/disable/fail added and unmark_oom_victim is delayed by 1s after it clears TIF_MEMDIE and before it wakes up waiters. - rebased on top of the current mmotm which means some necessary updates in mm/oom_kill.c. mark_tsk_oom_victim is now called under task_lock but I think this should be OK because __thaw_task shouldn't interfere with any locking down wake_up_process. Oleg? As expected there are no OOM killed tasks after oom is disabled and allocations requested by the kernel thread are failing after all the tasks are frozen and OOM disabled. I wasn't able to catch a race where oom_killer_disable would really have to wait but I kinda expected the race is really unlikely. [ 242.609330] Killed process 2992 (mem_eater) total-vm:24412kB, anon-rss:2164kB, file-rss:4kB [ 243.628071] Unmarking 2992 OOM victim. oom_victims: 1 [ 243.636072] (elapsed 2.837 seconds) done. [ 243.641985] Trying to disable OOM killer [ 243.643032] Waiting for concurent OOM victims [ 243.644342] OOM killer disabled [ 243.645447] Freezing remaining freezable tasks ... (elapsed 0.005 seconds) done. [ 243.652983] Suspending console(s) (use no_console_suspend to debug) [ 243.903299] kmem_eater: page allocation failure: order:1, mode:0x204010 [...] [ 243.992600] PM: suspend of devices complete after 336.667 msecs [ 243.993264] PM: late suspend of devices complete after 0.660 msecs [ 243.994713] PM: noirq suspend of devices complete after 1.446 msecs [ 243.994717] ACPI: Preparing to enter system sleep state S3 [ 243.994795] PM: Saving platform NVS memory [ 243.994796] Disabling non-boot CPUs ... The first 2 patches are simple cleanups for OOM. They should go in regardless the rest IMO. Patches 3 and 4 are trivial printk -> pr_info conversion and they should go in ditto. The main patch is the last one and I would appreciate acks from Tejun and Rafael. I think the OOM part should be OK (except for __thaw_task vs. task_lock where a look from Oleg would appreciated) but I am not so sure I haven't screwed anything in the freezer code. I have found several surprises there. This patch (of 5): This patch is just a preparatory and it doesn't introduce any functional change. Note: I am utterly unhappy about lowmemory killer abusing TIF_MEMDIE just to wait for the oom victim and to prevent from new killing. This is just a side effect of the flag. The primary meaning is to give the oom victim access to the memory reserves and that shouldn't be necessary here. Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: ""Rafael J. Wysocki"" <rjw@rjwysocki.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/49550b605587924b3336386caae53200c68969d3;" As Tejun pointed out, even though the race condition is unlikely the harder it would be to debug weird bugs deep in the PM freezer when the debugging options are reduced considerably";no;yes;no 182;MDY6Q29tbWl0MjMyNTI5ODo0OTU1MGI2MDU1ODc5MjRiMzMzNjM4NmNhYWU1MzIwMGM2ODk2OWQz;Michal Hocko;Linus Torvalds;"oom: add helpers for setting and clearing TIF_MEMDIE This patchset addresses a race which was described in the changelog for 5695be142e20 (""OOM, PM: OOM killed task shouldn't escape PM suspend""): : PM freezer relies on having all tasks frozen by the time devices are : getting frozen so that no task will touch them while they are getting : frozen. But OOM killer is allowed to kill an already frozen task in order : to handle OOM situtation. In order to protect from late wake ups OOM : killer is disabled after all tasks are frozen. This, however, still keeps : a window open when a killed task didn't manage to die by the time : freeze_processes finishes. The original patch hasn't closed the race window completely because that would require a more complex solution as it can be seen by this patchset. The primary motivation was to close the race condition between OOM killer and PM freezer _completely_. As Tejun pointed out, even though the race condition is unlikely the harder it would be to debug weird bugs deep in the PM freezer when the debugging options are reduced considerably. I can only speculate what might happen when a task is still runnable unexpectedly. On a plus side and as a side effect the oom enable/disable has a better (full barrier) semantic without polluting hot paths. I have tested the series in KVM with 100M RAM: - many small tasks (20M anon mmap) which are triggering OOM continually - s2ram which resumes automatically is triggered in a loop echo processors > /sys/power/pm_test while true do echo mem > /sys/power/state sleep 1s done - simple module which allocates and frees 20M in 8K chunks. If it sees freezing(current) then it tries another round of allocation before calling try_to_freeze - debugging messages of PM stages and OOM killer enable/disable/fail added and unmark_oom_victim is delayed by 1s after it clears TIF_MEMDIE and before it wakes up waiters. - rebased on top of the current mmotm which means some necessary updates in mm/oom_kill.c. mark_tsk_oom_victim is now called under task_lock but I think this should be OK because __thaw_task shouldn't interfere with any locking down wake_up_process. Oleg? As expected there are no OOM killed tasks after oom is disabled and allocations requested by the kernel thread are failing after all the tasks are frozen and OOM disabled. I wasn't able to catch a race where oom_killer_disable would really have to wait but I kinda expected the race is really unlikely. [ 242.609330] Killed process 2992 (mem_eater) total-vm:24412kB, anon-rss:2164kB, file-rss:4kB [ 243.628071] Unmarking 2992 OOM victim. oom_victims: 1 [ 243.636072] (elapsed 2.837 seconds) done. [ 243.641985] Trying to disable OOM killer [ 243.643032] Waiting for concurent OOM victims [ 243.644342] OOM killer disabled [ 243.645447] Freezing remaining freezable tasks ... (elapsed 0.005 seconds) done. [ 243.652983] Suspending console(s) (use no_console_suspend to debug) [ 243.903299] kmem_eater: page allocation failure: order:1, mode:0x204010 [...] [ 243.992600] PM: suspend of devices complete after 336.667 msecs [ 243.993264] PM: late suspend of devices complete after 0.660 msecs [ 243.994713] PM: noirq suspend of devices complete after 1.446 msecs [ 243.994717] ACPI: Preparing to enter system sleep state S3 [ 243.994795] PM: Saving platform NVS memory [ 243.994796] Disabling non-boot CPUs ... The first 2 patches are simple cleanups for OOM. They should go in regardless the rest IMO. Patches 3 and 4 are trivial printk -> pr_info conversion and they should go in ditto. The main patch is the last one and I would appreciate acks from Tejun and Rafael. I think the OOM part should be OK (except for __thaw_task vs. task_lock where a look from Oleg would appreciated) but I am not so sure I haven't screwed anything in the freezer code. I have found several surprises there. This patch (of 5): This patch is just a preparatory and it doesn't introduce any functional change. Note: I am utterly unhappy about lowmemory killer abusing TIF_MEMDIE just to wait for the oom victim and to prevent from new killing. This is just a side effect of the flag. The primary meaning is to give the oom victim access to the memory reserves and that shouldn't be necessary here. Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: ""Rafael J. Wysocki"" <rjw@rjwysocki.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/49550b605587924b3336386caae53200c68969d3;" I can only speculate what might happen when a task is still runnable unexpectedly";yes;no;yes 182;MDY6Q29tbWl0MjMyNTI5ODo0OTU1MGI2MDU1ODc5MjRiMzMzNjM4NmNhYWU1MzIwMGM2ODk2OWQz;Michal Hocko;Linus Torvalds;"oom: add helpers for setting and clearing TIF_MEMDIE This patchset addresses a race which was described in the changelog for 5695be142e20 (""OOM, PM: OOM killed task shouldn't escape PM suspend""): : PM freezer relies on having all tasks frozen by the time devices are : getting frozen so that no task will touch them while they are getting : frozen. But OOM killer is allowed to kill an already frozen task in order : to handle OOM situtation. In order to protect from late wake ups OOM : killer is disabled after all tasks are frozen. This, however, still keeps : a window open when a killed task didn't manage to die by the time : freeze_processes finishes. The original patch hasn't closed the race window completely because that would require a more complex solution as it can be seen by this patchset. The primary motivation was to close the race condition between OOM killer and PM freezer _completely_. As Tejun pointed out, even though the race condition is unlikely the harder it would be to debug weird bugs deep in the PM freezer when the debugging options are reduced considerably. I can only speculate what might happen when a task is still runnable unexpectedly. On a plus side and as a side effect the oom enable/disable has a better (full barrier) semantic without polluting hot paths. I have tested the series in KVM with 100M RAM: - many small tasks (20M anon mmap) which are triggering OOM continually - s2ram which resumes automatically is triggered in a loop echo processors > /sys/power/pm_test while true do echo mem > /sys/power/state sleep 1s done - simple module which allocates and frees 20M in 8K chunks. If it sees freezing(current) then it tries another round of allocation before calling try_to_freeze - debugging messages of PM stages and OOM killer enable/disable/fail added and unmark_oom_victim is delayed by 1s after it clears TIF_MEMDIE and before it wakes up waiters. - rebased on top of the current mmotm which means some necessary updates in mm/oom_kill.c. mark_tsk_oom_victim is now called under task_lock but I think this should be OK because __thaw_task shouldn't interfere with any locking down wake_up_process. Oleg? As expected there are no OOM killed tasks after oom is disabled and allocations requested by the kernel thread are failing after all the tasks are frozen and OOM disabled. I wasn't able to catch a race where oom_killer_disable would really have to wait but I kinda expected the race is really unlikely. [ 242.609330] Killed process 2992 (mem_eater) total-vm:24412kB, anon-rss:2164kB, file-rss:4kB [ 243.628071] Unmarking 2992 OOM victim. oom_victims: 1 [ 243.636072] (elapsed 2.837 seconds) done. [ 243.641985] Trying to disable OOM killer [ 243.643032] Waiting for concurent OOM victims [ 243.644342] OOM killer disabled [ 243.645447] Freezing remaining freezable tasks ... (elapsed 0.005 seconds) done. [ 243.652983] Suspending console(s) (use no_console_suspend to debug) [ 243.903299] kmem_eater: page allocation failure: order:1, mode:0x204010 [...] [ 243.992600] PM: suspend of devices complete after 336.667 msecs [ 243.993264] PM: late suspend of devices complete after 0.660 msecs [ 243.994713] PM: noirq suspend of devices complete after 1.446 msecs [ 243.994717] ACPI: Preparing to enter system sleep state S3 [ 243.994795] PM: Saving platform NVS memory [ 243.994796] Disabling non-boot CPUs ... The first 2 patches are simple cleanups for OOM. They should go in regardless the rest IMO. Patches 3 and 4 are trivial printk -> pr_info conversion and they should go in ditto. The main patch is the last one and I would appreciate acks from Tejun and Rafael. I think the OOM part should be OK (except for __thaw_task vs. task_lock where a look from Oleg would appreciated) but I am not so sure I haven't screwed anything in the freezer code. I have found several surprises there. This patch (of 5): This patch is just a preparatory and it doesn't introduce any functional change. Note: I am utterly unhappy about lowmemory killer abusing TIF_MEMDIE just to wait for the oom victim and to prevent from new killing. This is just a side effect of the flag. The primary meaning is to give the oom victim access to the memory reserves and that shouldn't be necessary here. Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: ""Rafael J. Wysocki"" <rjw@rjwysocki.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/49550b605587924b3336386caae53200c68969d3;"On a plus side and as a side effect the oom enable/disable has a better (full barrier) semantic without polluting hot paths";yes;yes;no 182;MDY6Q29tbWl0MjMyNTI5ODo0OTU1MGI2MDU1ODc5MjRiMzMzNjM4NmNhYWU1MzIwMGM2ODk2OWQz;Michal Hocko;Linus Torvalds;"oom: add helpers for setting and clearing TIF_MEMDIE This patchset addresses a race which was described in the changelog for 5695be142e20 (""OOM, PM: OOM killed task shouldn't escape PM suspend""): : PM freezer relies on having all tasks frozen by the time devices are : getting frozen so that no task will touch them while they are getting : frozen. But OOM killer is allowed to kill an already frozen task in order : to handle OOM situtation. In order to protect from late wake ups OOM : killer is disabled after all tasks are frozen. This, however, still keeps : a window open when a killed task didn't manage to die by the time : freeze_processes finishes. The original patch hasn't closed the race window completely because that would require a more complex solution as it can be seen by this patchset. The primary motivation was to close the race condition between OOM killer and PM freezer _completely_. As Tejun pointed out, even though the race condition is unlikely the harder it would be to debug weird bugs deep in the PM freezer when the debugging options are reduced considerably. I can only speculate what might happen when a task is still runnable unexpectedly. On a plus side and as a side effect the oom enable/disable has a better (full barrier) semantic without polluting hot paths. I have tested the series in KVM with 100M RAM: - many small tasks (20M anon mmap) which are triggering OOM continually - s2ram which resumes automatically is triggered in a loop echo processors > /sys/power/pm_test while true do echo mem > /sys/power/state sleep 1s done - simple module which allocates and frees 20M in 8K chunks. If it sees freezing(current) then it tries another round of allocation before calling try_to_freeze - debugging messages of PM stages and OOM killer enable/disable/fail added and unmark_oom_victim is delayed by 1s after it clears TIF_MEMDIE and before it wakes up waiters. - rebased on top of the current mmotm which means some necessary updates in mm/oom_kill.c. mark_tsk_oom_victim is now called under task_lock but I think this should be OK because __thaw_task shouldn't interfere with any locking down wake_up_process. Oleg? As expected there are no OOM killed tasks after oom is disabled and allocations requested by the kernel thread are failing after all the tasks are frozen and OOM disabled. I wasn't able to catch a race where oom_killer_disable would really have to wait but I kinda expected the race is really unlikely. [ 242.609330] Killed process 2992 (mem_eater) total-vm:24412kB, anon-rss:2164kB, file-rss:4kB [ 243.628071] Unmarking 2992 OOM victim. oom_victims: 1 [ 243.636072] (elapsed 2.837 seconds) done. [ 243.641985] Trying to disable OOM killer [ 243.643032] Waiting for concurent OOM victims [ 243.644342] OOM killer disabled [ 243.645447] Freezing remaining freezable tasks ... (elapsed 0.005 seconds) done. [ 243.652983] Suspending console(s) (use no_console_suspend to debug) [ 243.903299] kmem_eater: page allocation failure: order:1, mode:0x204010 [...] [ 243.992600] PM: suspend of devices complete after 336.667 msecs [ 243.993264] PM: late suspend of devices complete after 0.660 msecs [ 243.994713] PM: noirq suspend of devices complete after 1.446 msecs [ 243.994717] ACPI: Preparing to enter system sleep state S3 [ 243.994795] PM: Saving platform NVS memory [ 243.994796] Disabling non-boot CPUs ... The first 2 patches are simple cleanups for OOM. They should go in regardless the rest IMO. Patches 3 and 4 are trivial printk -> pr_info conversion and they should go in ditto. The main patch is the last one and I would appreciate acks from Tejun and Rafael. I think the OOM part should be OK (except for __thaw_task vs. task_lock where a look from Oleg would appreciated) but I am not so sure I haven't screwed anything in the freezer code. I have found several surprises there. This patch (of 5): This patch is just a preparatory and it doesn't introduce any functional change. Note: I am utterly unhappy about lowmemory killer abusing TIF_MEMDIE just to wait for the oom victim and to prevent from new killing. This is just a side effect of the flag. The primary meaning is to give the oom victim access to the memory reserves and that shouldn't be necessary here. Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: ""Rafael J. Wysocki"" <rjw@rjwysocki.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/49550b605587924b3336386caae53200c68969d3;I have tested the series in KVM with 100M RAM;yes;yes;yes 182;MDY6Q29tbWl0MjMyNTI5ODo0OTU1MGI2MDU1ODc5MjRiMzMzNjM4NmNhYWU1MzIwMGM2ODk2OWQz;Michal Hocko;Linus Torvalds;"oom: add helpers for setting and clearing TIF_MEMDIE This patchset addresses a race which was described in the changelog for 5695be142e20 (""OOM, PM: OOM killed task shouldn't escape PM suspend""): : PM freezer relies on having all tasks frozen by the time devices are : getting frozen so that no task will touch them while they are getting : frozen. But OOM killer is allowed to kill an already frozen task in order : to handle OOM situtation. In order to protect from late wake ups OOM : killer is disabled after all tasks are frozen. This, however, still keeps : a window open when a killed task didn't manage to die by the time : freeze_processes finishes. The original patch hasn't closed the race window completely because that would require a more complex solution as it can be seen by this patchset. The primary motivation was to close the race condition between OOM killer and PM freezer _completely_. As Tejun pointed out, even though the race condition is unlikely the harder it would be to debug weird bugs deep in the PM freezer when the debugging options are reduced considerably. I can only speculate what might happen when a task is still runnable unexpectedly. On a plus side and as a side effect the oom enable/disable has a better (full barrier) semantic without polluting hot paths. I have tested the series in KVM with 100M RAM: - many small tasks (20M anon mmap) which are triggering OOM continually - s2ram which resumes automatically is triggered in a loop echo processors > /sys/power/pm_test while true do echo mem > /sys/power/state sleep 1s done - simple module which allocates and frees 20M in 8K chunks. If it sees freezing(current) then it tries another round of allocation before calling try_to_freeze - debugging messages of PM stages and OOM killer enable/disable/fail added and unmark_oom_victim is delayed by 1s after it clears TIF_MEMDIE and before it wakes up waiters. - rebased on top of the current mmotm which means some necessary updates in mm/oom_kill.c. mark_tsk_oom_victim is now called under task_lock but I think this should be OK because __thaw_task shouldn't interfere with any locking down wake_up_process. Oleg? As expected there are no OOM killed tasks after oom is disabled and allocations requested by the kernel thread are failing after all the tasks are frozen and OOM disabled. I wasn't able to catch a race where oom_killer_disable would really have to wait but I kinda expected the race is really unlikely. [ 242.609330] Killed process 2992 (mem_eater) total-vm:24412kB, anon-rss:2164kB, file-rss:4kB [ 243.628071] Unmarking 2992 OOM victim. oom_victims: 1 [ 243.636072] (elapsed 2.837 seconds) done. [ 243.641985] Trying to disable OOM killer [ 243.643032] Waiting for concurent OOM victims [ 243.644342] OOM killer disabled [ 243.645447] Freezing remaining freezable tasks ... (elapsed 0.005 seconds) done. [ 243.652983] Suspending console(s) (use no_console_suspend to debug) [ 243.903299] kmem_eater: page allocation failure: order:1, mode:0x204010 [...] [ 243.992600] PM: suspend of devices complete after 336.667 msecs [ 243.993264] PM: late suspend of devices complete after 0.660 msecs [ 243.994713] PM: noirq suspend of devices complete after 1.446 msecs [ 243.994717] ACPI: Preparing to enter system sleep state S3 [ 243.994795] PM: Saving platform NVS memory [ 243.994796] Disabling non-boot CPUs ... The first 2 patches are simple cleanups for OOM. They should go in regardless the rest IMO. Patches 3 and 4 are trivial printk -> pr_info conversion and they should go in ditto. The main patch is the last one and I would appreciate acks from Tejun and Rafael. I think the OOM part should be OK (except for __thaw_task vs. task_lock where a look from Oleg would appreciated) but I am not so sure I haven't screwed anything in the freezer code. I have found several surprises there. This patch (of 5): This patch is just a preparatory and it doesn't introduce any functional change. Note: I am utterly unhappy about lowmemory killer abusing TIF_MEMDIE just to wait for the oom victim and to prevent from new killing. This is just a side effect of the flag. The primary meaning is to give the oom victim access to the memory reserves and that shouldn't be necessary here. Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: ""Rafael J. Wysocki"" <rjw@rjwysocki.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/49550b605587924b3336386caae53200c68969d3;"- many small tasks (20M anon mmap) which are triggering OOM continually - s2ram which resumes automatically is triggered in a loop echo processors > /sys/power/pm_test while true echo mem > /sys/power/state sleep 1s done - simple module which allocates and frees 20M in 8K chunks";yes;yes;yes 182;MDY6Q29tbWl0MjMyNTI5ODo0OTU1MGI2MDU1ODc5MjRiMzMzNjM4NmNhYWU1MzIwMGM2ODk2OWQz;Michal Hocko;Linus Torvalds;"oom: add helpers for setting and clearing TIF_MEMDIE This patchset addresses a race which was described in the changelog for 5695be142e20 (""OOM, PM: OOM killed task shouldn't escape PM suspend""): : PM freezer relies on having all tasks frozen by the time devices are : getting frozen so that no task will touch them while they are getting : frozen. But OOM killer is allowed to kill an already frozen task in order : to handle OOM situtation. In order to protect from late wake ups OOM : killer is disabled after all tasks are frozen. This, however, still keeps : a window open when a killed task didn't manage to die by the time : freeze_processes finishes. The original patch hasn't closed the race window completely because that would require a more complex solution as it can be seen by this patchset. The primary motivation was to close the race condition between OOM killer and PM freezer _completely_. As Tejun pointed out, even though the race condition is unlikely the harder it would be to debug weird bugs deep in the PM freezer when the debugging options are reduced considerably. I can only speculate what might happen when a task is still runnable unexpectedly. On a plus side and as a side effect the oom enable/disable has a better (full barrier) semantic without polluting hot paths. I have tested the series in KVM with 100M RAM: - many small tasks (20M anon mmap) which are triggering OOM continually - s2ram which resumes automatically is triggered in a loop echo processors > /sys/power/pm_test while true do echo mem > /sys/power/state sleep 1s done - simple module which allocates and frees 20M in 8K chunks. If it sees freezing(current) then it tries another round of allocation before calling try_to_freeze - debugging messages of PM stages and OOM killer enable/disable/fail added and unmark_oom_victim is delayed by 1s after it clears TIF_MEMDIE and before it wakes up waiters. - rebased on top of the current mmotm which means some necessary updates in mm/oom_kill.c. mark_tsk_oom_victim is now called under task_lock but I think this should be OK because __thaw_task shouldn't interfere with any locking down wake_up_process. Oleg? As expected there are no OOM killed tasks after oom is disabled and allocations requested by the kernel thread are failing after all the tasks are frozen and OOM disabled. I wasn't able to catch a race where oom_killer_disable would really have to wait but I kinda expected the race is really unlikely. [ 242.609330] Killed process 2992 (mem_eater) total-vm:24412kB, anon-rss:2164kB, file-rss:4kB [ 243.628071] Unmarking 2992 OOM victim. oom_victims: 1 [ 243.636072] (elapsed 2.837 seconds) done. [ 243.641985] Trying to disable OOM killer [ 243.643032] Waiting for concurent OOM victims [ 243.644342] OOM killer disabled [ 243.645447] Freezing remaining freezable tasks ... (elapsed 0.005 seconds) done. [ 243.652983] Suspending console(s) (use no_console_suspend to debug) [ 243.903299] kmem_eater: page allocation failure: order:1, mode:0x204010 [...] [ 243.992600] PM: suspend of devices complete after 336.667 msecs [ 243.993264] PM: late suspend of devices complete after 0.660 msecs [ 243.994713] PM: noirq suspend of devices complete after 1.446 msecs [ 243.994717] ACPI: Preparing to enter system sleep state S3 [ 243.994795] PM: Saving platform NVS memory [ 243.994796] Disabling non-boot CPUs ... The first 2 patches are simple cleanups for OOM. They should go in regardless the rest IMO. Patches 3 and 4 are trivial printk -> pr_info conversion and they should go in ditto. The main patch is the last one and I would appreciate acks from Tejun and Rafael. I think the OOM part should be OK (except for __thaw_task vs. task_lock where a look from Oleg would appreciated) but I am not so sure I haven't screwed anything in the freezer code. I have found several surprises there. This patch (of 5): This patch is just a preparatory and it doesn't introduce any functional change. Note: I am utterly unhappy about lowmemory killer abusing TIF_MEMDIE just to wait for the oom victim and to prevent from new killing. This is just a side effect of the flag. The primary meaning is to give the oom victim access to the memory reserves and that shouldn't be necessary here. Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: ""Rafael J. Wysocki"" <rjw@rjwysocki.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/49550b605587924b3336386caae53200c68969d3;"If it sees freezing(current) then it tries another round of allocation before calling try_to_freeze - debugging messages of PM stages and OOM killer enable/disable/fail added and unmark_oom_victim is delayed by 1s after it clears TIF_MEMDIE and before it wakes up waiters";yes;yes;yes 182;MDY6Q29tbWl0MjMyNTI5ODo0OTU1MGI2MDU1ODc5MjRiMzMzNjM4NmNhYWU1MzIwMGM2ODk2OWQz;Michal Hocko;Linus Torvalds;"oom: add helpers for setting and clearing TIF_MEMDIE This patchset addresses a race which was described in the changelog for 5695be142e20 (""OOM, PM: OOM killed task shouldn't escape PM suspend""): : PM freezer relies on having all tasks frozen by the time devices are : getting frozen so that no task will touch them while they are getting : frozen. But OOM killer is allowed to kill an already frozen task in order : to handle OOM situtation. In order to protect from late wake ups OOM : killer is disabled after all tasks are frozen. This, however, still keeps : a window open when a killed task didn't manage to die by the time : freeze_processes finishes. The original patch hasn't closed the race window completely because that would require a more complex solution as it can be seen by this patchset. The primary motivation was to close the race condition between OOM killer and PM freezer _completely_. As Tejun pointed out, even though the race condition is unlikely the harder it would be to debug weird bugs deep in the PM freezer when the debugging options are reduced considerably. I can only speculate what might happen when a task is still runnable unexpectedly. On a plus side and as a side effect the oom enable/disable has a better (full barrier) semantic without polluting hot paths. I have tested the series in KVM with 100M RAM: - many small tasks (20M anon mmap) which are triggering OOM continually - s2ram which resumes automatically is triggered in a loop echo processors > /sys/power/pm_test while true do echo mem > /sys/power/state sleep 1s done - simple module which allocates and frees 20M in 8K chunks. If it sees freezing(current) then it tries another round of allocation before calling try_to_freeze - debugging messages of PM stages and OOM killer enable/disable/fail added and unmark_oom_victim is delayed by 1s after it clears TIF_MEMDIE and before it wakes up waiters. - rebased on top of the current mmotm which means some necessary updates in mm/oom_kill.c. mark_tsk_oom_victim is now called under task_lock but I think this should be OK because __thaw_task shouldn't interfere with any locking down wake_up_process. Oleg? As expected there are no OOM killed tasks after oom is disabled and allocations requested by the kernel thread are failing after all the tasks are frozen and OOM disabled. I wasn't able to catch a race where oom_killer_disable would really have to wait but I kinda expected the race is really unlikely. [ 242.609330] Killed process 2992 (mem_eater) total-vm:24412kB, anon-rss:2164kB, file-rss:4kB [ 243.628071] Unmarking 2992 OOM victim. oom_victims: 1 [ 243.636072] (elapsed 2.837 seconds) done. [ 243.641985] Trying to disable OOM killer [ 243.643032] Waiting for concurent OOM victims [ 243.644342] OOM killer disabled [ 243.645447] Freezing remaining freezable tasks ... (elapsed 0.005 seconds) done. [ 243.652983] Suspending console(s) (use no_console_suspend to debug) [ 243.903299] kmem_eater: page allocation failure: order:1, mode:0x204010 [...] [ 243.992600] PM: suspend of devices complete after 336.667 msecs [ 243.993264] PM: late suspend of devices complete after 0.660 msecs [ 243.994713] PM: noirq suspend of devices complete after 1.446 msecs [ 243.994717] ACPI: Preparing to enter system sleep state S3 [ 243.994795] PM: Saving platform NVS memory [ 243.994796] Disabling non-boot CPUs ... The first 2 patches are simple cleanups for OOM. They should go in regardless the rest IMO. Patches 3 and 4 are trivial printk -> pr_info conversion and they should go in ditto. The main patch is the last one and I would appreciate acks from Tejun and Rafael. I think the OOM part should be OK (except for __thaw_task vs. task_lock where a look from Oleg would appreciated) but I am not so sure I haven't screwed anything in the freezer code. I have found several surprises there. This patch (of 5): This patch is just a preparatory and it doesn't introduce any functional change. Note: I am utterly unhappy about lowmemory killer abusing TIF_MEMDIE just to wait for the oom victim and to prevent from new killing. This is just a side effect of the flag. The primary meaning is to give the oom victim access to the memory reserves and that shouldn't be necessary here. Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: ""Rafael J. Wysocki"" <rjw@rjwysocki.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/49550b605587924b3336386caae53200c68969d3;#NAME?;yes;yes;yes 182;MDY6Q29tbWl0MjMyNTI5ODo0OTU1MGI2MDU1ODc5MjRiMzMzNjM4NmNhYWU1MzIwMGM2ODk2OWQz;Michal Hocko;Linus Torvalds;"oom: add helpers for setting and clearing TIF_MEMDIE This patchset addresses a race which was described in the changelog for 5695be142e20 (""OOM, PM: OOM killed task shouldn't escape PM suspend""): : PM freezer relies on having all tasks frozen by the time devices are : getting frozen so that no task will touch them while they are getting : frozen. But OOM killer is allowed to kill an already frozen task in order : to handle OOM situtation. In order to protect from late wake ups OOM : killer is disabled after all tasks are frozen. This, however, still keeps : a window open when a killed task didn't manage to die by the time : freeze_processes finishes. The original patch hasn't closed the race window completely because that would require a more complex solution as it can be seen by this patchset. The primary motivation was to close the race condition between OOM killer and PM freezer _completely_. As Tejun pointed out, even though the race condition is unlikely the harder it would be to debug weird bugs deep in the PM freezer when the debugging options are reduced considerably. I can only speculate what might happen when a task is still runnable unexpectedly. On a plus side and as a side effect the oom enable/disable has a better (full barrier) semantic without polluting hot paths. I have tested the series in KVM with 100M RAM: - many small tasks (20M anon mmap) which are triggering OOM continually - s2ram which resumes automatically is triggered in a loop echo processors > /sys/power/pm_test while true do echo mem > /sys/power/state sleep 1s done - simple module which allocates and frees 20M in 8K chunks. If it sees freezing(current) then it tries another round of allocation before calling try_to_freeze - debugging messages of PM stages and OOM killer enable/disable/fail added and unmark_oom_victim is delayed by 1s after it clears TIF_MEMDIE and before it wakes up waiters. - rebased on top of the current mmotm which means some necessary updates in mm/oom_kill.c. mark_tsk_oom_victim is now called under task_lock but I think this should be OK because __thaw_task shouldn't interfere with any locking down wake_up_process. Oleg? As expected there are no OOM killed tasks after oom is disabled and allocations requested by the kernel thread are failing after all the tasks are frozen and OOM disabled. I wasn't able to catch a race where oom_killer_disable would really have to wait but I kinda expected the race is really unlikely. [ 242.609330] Killed process 2992 (mem_eater) total-vm:24412kB, anon-rss:2164kB, file-rss:4kB [ 243.628071] Unmarking 2992 OOM victim. oom_victims: 1 [ 243.636072] (elapsed 2.837 seconds) done. [ 243.641985] Trying to disable OOM killer [ 243.643032] Waiting for concurent OOM victims [ 243.644342] OOM killer disabled [ 243.645447] Freezing remaining freezable tasks ... (elapsed 0.005 seconds) done. [ 243.652983] Suspending console(s) (use no_console_suspend to debug) [ 243.903299] kmem_eater: page allocation failure: order:1, mode:0x204010 [...] [ 243.992600] PM: suspend of devices complete after 336.667 msecs [ 243.993264] PM: late suspend of devices complete after 0.660 msecs [ 243.994713] PM: noirq suspend of devices complete after 1.446 msecs [ 243.994717] ACPI: Preparing to enter system sleep state S3 [ 243.994795] PM: Saving platform NVS memory [ 243.994796] Disabling non-boot CPUs ... The first 2 patches are simple cleanups for OOM. They should go in regardless the rest IMO. Patches 3 and 4 are trivial printk -> pr_info conversion and they should go in ditto. The main patch is the last one and I would appreciate acks from Tejun and Rafael. I think the OOM part should be OK (except for __thaw_task vs. task_lock where a look from Oleg would appreciated) but I am not so sure I haven't screwed anything in the freezer code. I have found several surprises there. This patch (of 5): This patch is just a preparatory and it doesn't introduce any functional change. Note: I am utterly unhappy about lowmemory killer abusing TIF_MEMDIE just to wait for the oom victim and to prevent from new killing. This is just a side effect of the flag. The primary meaning is to give the oom victim access to the memory reserves and that shouldn't be necessary here. Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: ""Rafael J. Wysocki"" <rjw@rjwysocki.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/49550b605587924b3336386caae53200c68969d3;"mark_tsk_oom_victim is now called under task_lock but I think this should be OK because __thaw_task shouldn't interfere with any locking down wake_up_process";yes;yes;yes 182;MDY6Q29tbWl0MjMyNTI5ODo0OTU1MGI2MDU1ODc5MjRiMzMzNjM4NmNhYWU1MzIwMGM2ODk2OWQz;Michal Hocko;Linus Torvalds;"oom: add helpers for setting and clearing TIF_MEMDIE This patchset addresses a race which was described in the changelog for 5695be142e20 (""OOM, PM: OOM killed task shouldn't escape PM suspend""): : PM freezer relies on having all tasks frozen by the time devices are : getting frozen so that no task will touch them while they are getting : frozen. But OOM killer is allowed to kill an already frozen task in order : to handle OOM situtation. In order to protect from late wake ups OOM : killer is disabled after all tasks are frozen. This, however, still keeps : a window open when a killed task didn't manage to die by the time : freeze_processes finishes. The original patch hasn't closed the race window completely because that would require a more complex solution as it can be seen by this patchset. The primary motivation was to close the race condition between OOM killer and PM freezer _completely_. As Tejun pointed out, even though the race condition is unlikely the harder it would be to debug weird bugs deep in the PM freezer when the debugging options are reduced considerably. I can only speculate what might happen when a task is still runnable unexpectedly. On a plus side and as a side effect the oom enable/disable has a better (full barrier) semantic without polluting hot paths. I have tested the series in KVM with 100M RAM: - many small tasks (20M anon mmap) which are triggering OOM continually - s2ram which resumes automatically is triggered in a loop echo processors > /sys/power/pm_test while true do echo mem > /sys/power/state sleep 1s done - simple module which allocates and frees 20M in 8K chunks. If it sees freezing(current) then it tries another round of allocation before calling try_to_freeze - debugging messages of PM stages and OOM killer enable/disable/fail added and unmark_oom_victim is delayed by 1s after it clears TIF_MEMDIE and before it wakes up waiters. - rebased on top of the current mmotm which means some necessary updates in mm/oom_kill.c. mark_tsk_oom_victim is now called under task_lock but I think this should be OK because __thaw_task shouldn't interfere with any locking down wake_up_process. Oleg? As expected there are no OOM killed tasks after oom is disabled and allocations requested by the kernel thread are failing after all the tasks are frozen and OOM disabled. I wasn't able to catch a race where oom_killer_disable would really have to wait but I kinda expected the race is really unlikely. [ 242.609330] Killed process 2992 (mem_eater) total-vm:24412kB, anon-rss:2164kB, file-rss:4kB [ 243.628071] Unmarking 2992 OOM victim. oom_victims: 1 [ 243.636072] (elapsed 2.837 seconds) done. [ 243.641985] Trying to disable OOM killer [ 243.643032] Waiting for concurent OOM victims [ 243.644342] OOM killer disabled [ 243.645447] Freezing remaining freezable tasks ... (elapsed 0.005 seconds) done. [ 243.652983] Suspending console(s) (use no_console_suspend to debug) [ 243.903299] kmem_eater: page allocation failure: order:1, mode:0x204010 [...] [ 243.992600] PM: suspend of devices complete after 336.667 msecs [ 243.993264] PM: late suspend of devices complete after 0.660 msecs [ 243.994713] PM: noirq suspend of devices complete after 1.446 msecs [ 243.994717] ACPI: Preparing to enter system sleep state S3 [ 243.994795] PM: Saving platform NVS memory [ 243.994796] Disabling non-boot CPUs ... The first 2 patches are simple cleanups for OOM. They should go in regardless the rest IMO. Patches 3 and 4 are trivial printk -> pr_info conversion and they should go in ditto. The main patch is the last one and I would appreciate acks from Tejun and Rafael. I think the OOM part should be OK (except for __thaw_task vs. task_lock where a look from Oleg would appreciated) but I am not so sure I haven't screwed anything in the freezer code. I have found several surprises there. This patch (of 5): This patch is just a preparatory and it doesn't introduce any functional change. Note: I am utterly unhappy about lowmemory killer abusing TIF_MEMDIE just to wait for the oom victim and to prevent from new killing. This is just a side effect of the flag. The primary meaning is to give the oom victim access to the memory reserves and that shouldn't be necessary here. Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: ""Rafael J. Wysocki"" <rjw@rjwysocki.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/49550b605587924b3336386caae53200c68969d3;"Oleg? As expected there are no OOM killed tasks after oom is disabled and allocations requested by the kernel thread are failing after all the tasks are frozen and OOM disabled";yes;no;yes 182;MDY6Q29tbWl0MjMyNTI5ODo0OTU1MGI2MDU1ODc5MjRiMzMzNjM4NmNhYWU1MzIwMGM2ODk2OWQz;Michal Hocko;Linus Torvalds;"oom: add helpers for setting and clearing TIF_MEMDIE This patchset addresses a race which was described in the changelog for 5695be142e20 (""OOM, PM: OOM killed task shouldn't escape PM suspend""): : PM freezer relies on having all tasks frozen by the time devices are : getting frozen so that no task will touch them while they are getting : frozen. But OOM killer is allowed to kill an already frozen task in order : to handle OOM situtation. In order to protect from late wake ups OOM : killer is disabled after all tasks are frozen. This, however, still keeps : a window open when a killed task didn't manage to die by the time : freeze_processes finishes. The original patch hasn't closed the race window completely because that would require a more complex solution as it can be seen by this patchset. The primary motivation was to close the race condition between OOM killer and PM freezer _completely_. As Tejun pointed out, even though the race condition is unlikely the harder it would be to debug weird bugs deep in the PM freezer when the debugging options are reduced considerably. I can only speculate what might happen when a task is still runnable unexpectedly. On a plus side and as a side effect the oom enable/disable has a better (full barrier) semantic without polluting hot paths. I have tested the series in KVM with 100M RAM: - many small tasks (20M anon mmap) which are triggering OOM continually - s2ram which resumes automatically is triggered in a loop echo processors > /sys/power/pm_test while true do echo mem > /sys/power/state sleep 1s done - simple module which allocates and frees 20M in 8K chunks. If it sees freezing(current) then it tries another round of allocation before calling try_to_freeze - debugging messages of PM stages and OOM killer enable/disable/fail added and unmark_oom_victim is delayed by 1s after it clears TIF_MEMDIE and before it wakes up waiters. - rebased on top of the current mmotm which means some necessary updates in mm/oom_kill.c. mark_tsk_oom_victim is now called under task_lock but I think this should be OK because __thaw_task shouldn't interfere with any locking down wake_up_process. Oleg? As expected there are no OOM killed tasks after oom is disabled and allocations requested by the kernel thread are failing after all the tasks are frozen and OOM disabled. I wasn't able to catch a race where oom_killer_disable would really have to wait but I kinda expected the race is really unlikely. [ 242.609330] Killed process 2992 (mem_eater) total-vm:24412kB, anon-rss:2164kB, file-rss:4kB [ 243.628071] Unmarking 2992 OOM victim. oom_victims: 1 [ 243.636072] (elapsed 2.837 seconds) done. [ 243.641985] Trying to disable OOM killer [ 243.643032] Waiting for concurent OOM victims [ 243.644342] OOM killer disabled [ 243.645447] Freezing remaining freezable tasks ... (elapsed 0.005 seconds) done. [ 243.652983] Suspending console(s) (use no_console_suspend to debug) [ 243.903299] kmem_eater: page allocation failure: order:1, mode:0x204010 [...] [ 243.992600] PM: suspend of devices complete after 336.667 msecs [ 243.993264] PM: late suspend of devices complete after 0.660 msecs [ 243.994713] PM: noirq suspend of devices complete after 1.446 msecs [ 243.994717] ACPI: Preparing to enter system sleep state S3 [ 243.994795] PM: Saving platform NVS memory [ 243.994796] Disabling non-boot CPUs ... The first 2 patches are simple cleanups for OOM. They should go in regardless the rest IMO. Patches 3 and 4 are trivial printk -> pr_info conversion and they should go in ditto. The main patch is the last one and I would appreciate acks from Tejun and Rafael. I think the OOM part should be OK (except for __thaw_task vs. task_lock where a look from Oleg would appreciated) but I am not so sure I haven't screwed anything in the freezer code. I have found several surprises there. This patch (of 5): This patch is just a preparatory and it doesn't introduce any functional change. Note: I am utterly unhappy about lowmemory killer abusing TIF_MEMDIE just to wait for the oom victim and to prevent from new killing. This is just a side effect of the flag. The primary meaning is to give the oom victim access to the memory reserves and that shouldn't be necessary here. Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: ""Rafael J. Wysocki"" <rjw@rjwysocki.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/49550b605587924b3336386caae53200c68969d3;" I wasn't able to catch a race where oom_killer_disable would really have to wait but I kinda expected the race is really unlikely";yes;yes;yes 182;MDY6Q29tbWl0MjMyNTI5ODo0OTU1MGI2MDU1ODc5MjRiMzMzNjM4NmNhYWU1MzIwMGM2ODk2OWQz;Michal Hocko;Linus Torvalds;"oom: add helpers for setting and clearing TIF_MEMDIE This patchset addresses a race which was described in the changelog for 5695be142e20 (""OOM, PM: OOM killed task shouldn't escape PM suspend""): : PM freezer relies on having all tasks frozen by the time devices are : getting frozen so that no task will touch them while they are getting : frozen. But OOM killer is allowed to kill an already frozen task in order : to handle OOM situtation. In order to protect from late wake ups OOM : killer is disabled after all tasks are frozen. This, however, still keeps : a window open when a killed task didn't manage to die by the time : freeze_processes finishes. The original patch hasn't closed the race window completely because that would require a more complex solution as it can be seen by this patchset. The primary motivation was to close the race condition between OOM killer and PM freezer _completely_. As Tejun pointed out, even though the race condition is unlikely the harder it would be to debug weird bugs deep in the PM freezer when the debugging options are reduced considerably. I can only speculate what might happen when a task is still runnable unexpectedly. On a plus side and as a side effect the oom enable/disable has a better (full barrier) semantic without polluting hot paths. I have tested the series in KVM with 100M RAM: - many small tasks (20M anon mmap) which are triggering OOM continually - s2ram which resumes automatically is triggered in a loop echo processors > /sys/power/pm_test while true do echo mem > /sys/power/state sleep 1s done - simple module which allocates and frees 20M in 8K chunks. If it sees freezing(current) then it tries another round of allocation before calling try_to_freeze - debugging messages of PM stages and OOM killer enable/disable/fail added and unmark_oom_victim is delayed by 1s after it clears TIF_MEMDIE and before it wakes up waiters. - rebased on top of the current mmotm which means some necessary updates in mm/oom_kill.c. mark_tsk_oom_victim is now called under task_lock but I think this should be OK because __thaw_task shouldn't interfere with any locking down wake_up_process. Oleg? As expected there are no OOM killed tasks after oom is disabled and allocations requested by the kernel thread are failing after all the tasks are frozen and OOM disabled. I wasn't able to catch a race where oom_killer_disable would really have to wait but I kinda expected the race is really unlikely. [ 242.609330] Killed process 2992 (mem_eater) total-vm:24412kB, anon-rss:2164kB, file-rss:4kB [ 243.628071] Unmarking 2992 OOM victim. oom_victims: 1 [ 243.636072] (elapsed 2.837 seconds) done. [ 243.641985] Trying to disable OOM killer [ 243.643032] Waiting for concurent OOM victims [ 243.644342] OOM killer disabled [ 243.645447] Freezing remaining freezable tasks ... (elapsed 0.005 seconds) done. [ 243.652983] Suspending console(s) (use no_console_suspend to debug) [ 243.903299] kmem_eater: page allocation failure: order:1, mode:0x204010 [...] [ 243.992600] PM: suspend of devices complete after 336.667 msecs [ 243.993264] PM: late suspend of devices complete after 0.660 msecs [ 243.994713] PM: noirq suspend of devices complete after 1.446 msecs [ 243.994717] ACPI: Preparing to enter system sleep state S3 [ 243.994795] PM: Saving platform NVS memory [ 243.994796] Disabling non-boot CPUs ... The first 2 patches are simple cleanups for OOM. They should go in regardless the rest IMO. Patches 3 and 4 are trivial printk -> pr_info conversion and they should go in ditto. The main patch is the last one and I would appreciate acks from Tejun and Rafael. I think the OOM part should be OK (except for __thaw_task vs. task_lock where a look from Oleg would appreciated) but I am not so sure I haven't screwed anything in the freezer code. I have found several surprises there. This patch (of 5): This patch is just a preparatory and it doesn't introduce any functional change. Note: I am utterly unhappy about lowmemory killer abusing TIF_MEMDIE just to wait for the oom victim and to prevent from new killing. This is just a side effect of the flag. The primary meaning is to give the oom victim access to the memory reserves and that shouldn't be necessary here. Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: ""Rafael J. Wysocki"" <rjw@rjwysocki.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/49550b605587924b3336386caae53200c68969d3;The first 2 patches are simple cleanups for OOM;yes;yes;no 182;MDY6Q29tbWl0MjMyNTI5ODo0OTU1MGI2MDU1ODc5MjRiMzMzNjM4NmNhYWU1MzIwMGM2ODk2OWQz;Michal Hocko;Linus Torvalds;"oom: add helpers for setting and clearing TIF_MEMDIE This patchset addresses a race which was described in the changelog for 5695be142e20 (""OOM, PM: OOM killed task shouldn't escape PM suspend""): : PM freezer relies on having all tasks frozen by the time devices are : getting frozen so that no task will touch them while they are getting : frozen. But OOM killer is allowed to kill an already frozen task in order : to handle OOM situtation. In order to protect from late wake ups OOM : killer is disabled after all tasks are frozen. This, however, still keeps : a window open when a killed task didn't manage to die by the time : freeze_processes finishes. The original patch hasn't closed the race window completely because that would require a more complex solution as it can be seen by this patchset. The primary motivation was to close the race condition between OOM killer and PM freezer _completely_. As Tejun pointed out, even though the race condition is unlikely the harder it would be to debug weird bugs deep in the PM freezer when the debugging options are reduced considerably. I can only speculate what might happen when a task is still runnable unexpectedly. On a plus side and as a side effect the oom enable/disable has a better (full barrier) semantic without polluting hot paths. I have tested the series in KVM with 100M RAM: - many small tasks (20M anon mmap) which are triggering OOM continually - s2ram which resumes automatically is triggered in a loop echo processors > /sys/power/pm_test while true do echo mem > /sys/power/state sleep 1s done - simple module which allocates and frees 20M in 8K chunks. If it sees freezing(current) then it tries another round of allocation before calling try_to_freeze - debugging messages of PM stages and OOM killer enable/disable/fail added and unmark_oom_victim is delayed by 1s after it clears TIF_MEMDIE and before it wakes up waiters. - rebased on top of the current mmotm which means some necessary updates in mm/oom_kill.c. mark_tsk_oom_victim is now called under task_lock but I think this should be OK because __thaw_task shouldn't interfere with any locking down wake_up_process. Oleg? As expected there are no OOM killed tasks after oom is disabled and allocations requested by the kernel thread are failing after all the tasks are frozen and OOM disabled. I wasn't able to catch a race where oom_killer_disable would really have to wait but I kinda expected the race is really unlikely. [ 242.609330] Killed process 2992 (mem_eater) total-vm:24412kB, anon-rss:2164kB, file-rss:4kB [ 243.628071] Unmarking 2992 OOM victim. oom_victims: 1 [ 243.636072] (elapsed 2.837 seconds) done. [ 243.641985] Trying to disable OOM killer [ 243.643032] Waiting for concurent OOM victims [ 243.644342] OOM killer disabled [ 243.645447] Freezing remaining freezable tasks ... (elapsed 0.005 seconds) done. [ 243.652983] Suspending console(s) (use no_console_suspend to debug) [ 243.903299] kmem_eater: page allocation failure: order:1, mode:0x204010 [...] [ 243.992600] PM: suspend of devices complete after 336.667 msecs [ 243.993264] PM: late suspend of devices complete after 0.660 msecs [ 243.994713] PM: noirq suspend of devices complete after 1.446 msecs [ 243.994717] ACPI: Preparing to enter system sleep state S3 [ 243.994795] PM: Saving platform NVS memory [ 243.994796] Disabling non-boot CPUs ... The first 2 patches are simple cleanups for OOM. They should go in regardless the rest IMO. Patches 3 and 4 are trivial printk -> pr_info conversion and they should go in ditto. The main patch is the last one and I would appreciate acks from Tejun and Rafael. I think the OOM part should be OK (except for __thaw_task vs. task_lock where a look from Oleg would appreciated) but I am not so sure I haven't screwed anything in the freezer code. I have found several surprises there. This patch (of 5): This patch is just a preparatory and it doesn't introduce any functional change. Note: I am utterly unhappy about lowmemory killer abusing TIF_MEMDIE just to wait for the oom victim and to prevent from new killing. This is just a side effect of the flag. The primary meaning is to give the oom victim access to the memory reserves and that shouldn't be necessary here. Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: ""Rafael J. Wysocki"" <rjw@rjwysocki.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/49550b605587924b3336386caae53200c68969d3;" They should go in regardless the rest IMO";yes;no;no 182;MDY6Q29tbWl0MjMyNTI5ODo0OTU1MGI2MDU1ODc5MjRiMzMzNjM4NmNhYWU1MzIwMGM2ODk2OWQz;Michal Hocko;Linus Torvalds;"oom: add helpers for setting and clearing TIF_MEMDIE This patchset addresses a race which was described in the changelog for 5695be142e20 (""OOM, PM: OOM killed task shouldn't escape PM suspend""): : PM freezer relies on having all tasks frozen by the time devices are : getting frozen so that no task will touch them while they are getting : frozen. But OOM killer is allowed to kill an already frozen task in order : to handle OOM situtation. In order to protect from late wake ups OOM : killer is disabled after all tasks are frozen. This, however, still keeps : a window open when a killed task didn't manage to die by the time : freeze_processes finishes. The original patch hasn't closed the race window completely because that would require a more complex solution as it can be seen by this patchset. The primary motivation was to close the race condition between OOM killer and PM freezer _completely_. As Tejun pointed out, even though the race condition is unlikely the harder it would be to debug weird bugs deep in the PM freezer when the debugging options are reduced considerably. I can only speculate what might happen when a task is still runnable unexpectedly. On a plus side and as a side effect the oom enable/disable has a better (full barrier) semantic without polluting hot paths. I have tested the series in KVM with 100M RAM: - many small tasks (20M anon mmap) which are triggering OOM continually - s2ram which resumes automatically is triggered in a loop echo processors > /sys/power/pm_test while true do echo mem > /sys/power/state sleep 1s done - simple module which allocates and frees 20M in 8K chunks. If it sees freezing(current) then it tries another round of allocation before calling try_to_freeze - debugging messages of PM stages and OOM killer enable/disable/fail added and unmark_oom_victim is delayed by 1s after it clears TIF_MEMDIE and before it wakes up waiters. - rebased on top of the current mmotm which means some necessary updates in mm/oom_kill.c. mark_tsk_oom_victim is now called under task_lock but I think this should be OK because __thaw_task shouldn't interfere with any locking down wake_up_process. Oleg? As expected there are no OOM killed tasks after oom is disabled and allocations requested by the kernel thread are failing after all the tasks are frozen and OOM disabled. I wasn't able to catch a race where oom_killer_disable would really have to wait but I kinda expected the race is really unlikely. [ 242.609330] Killed process 2992 (mem_eater) total-vm:24412kB, anon-rss:2164kB, file-rss:4kB [ 243.628071] Unmarking 2992 OOM victim. oom_victims: 1 [ 243.636072] (elapsed 2.837 seconds) done. [ 243.641985] Trying to disable OOM killer [ 243.643032] Waiting for concurent OOM victims [ 243.644342] OOM killer disabled [ 243.645447] Freezing remaining freezable tasks ... (elapsed 0.005 seconds) done. [ 243.652983] Suspending console(s) (use no_console_suspend to debug) [ 243.903299] kmem_eater: page allocation failure: order:1, mode:0x204010 [...] [ 243.992600] PM: suspend of devices complete after 336.667 msecs [ 243.993264] PM: late suspend of devices complete after 0.660 msecs [ 243.994713] PM: noirq suspend of devices complete after 1.446 msecs [ 243.994717] ACPI: Preparing to enter system sleep state S3 [ 243.994795] PM: Saving platform NVS memory [ 243.994796] Disabling non-boot CPUs ... The first 2 patches are simple cleanups for OOM. They should go in regardless the rest IMO. Patches 3 and 4 are trivial printk -> pr_info conversion and they should go in ditto. The main patch is the last one and I would appreciate acks from Tejun and Rafael. I think the OOM part should be OK (except for __thaw_task vs. task_lock where a look from Oleg would appreciated) but I am not so sure I haven't screwed anything in the freezer code. I have found several surprises there. This patch (of 5): This patch is just a preparatory and it doesn't introduce any functional change. Note: I am utterly unhappy about lowmemory killer abusing TIF_MEMDIE just to wait for the oom victim and to prevent from new killing. This is just a side effect of the flag. The primary meaning is to give the oom victim access to the memory reserves and that shouldn't be necessary here. Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: ""Rafael J. Wysocki"" <rjw@rjwysocki.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/49550b605587924b3336386caae53200c68969d3;"Patches 3 and 4 are trivial printk -> pr_info conversion and they should go in ditto";yes;yes;no 182;MDY6Q29tbWl0MjMyNTI5ODo0OTU1MGI2MDU1ODc5MjRiMzMzNjM4NmNhYWU1MzIwMGM2ODk2OWQz;Michal Hocko;Linus Torvalds;"oom: add helpers for setting and clearing TIF_MEMDIE This patchset addresses a race which was described in the changelog for 5695be142e20 (""OOM, PM: OOM killed task shouldn't escape PM suspend""): : PM freezer relies on having all tasks frozen by the time devices are : getting frozen so that no task will touch them while they are getting : frozen. But OOM killer is allowed to kill an already frozen task in order : to handle OOM situtation. In order to protect from late wake ups OOM : killer is disabled after all tasks are frozen. This, however, still keeps : a window open when a killed task didn't manage to die by the time : freeze_processes finishes. The original patch hasn't closed the race window completely because that would require a more complex solution as it can be seen by this patchset. The primary motivation was to close the race condition between OOM killer and PM freezer _completely_. As Tejun pointed out, even though the race condition is unlikely the harder it would be to debug weird bugs deep in the PM freezer when the debugging options are reduced considerably. I can only speculate what might happen when a task is still runnable unexpectedly. On a plus side and as a side effect the oom enable/disable has a better (full barrier) semantic without polluting hot paths. I have tested the series in KVM with 100M RAM: - many small tasks (20M anon mmap) which are triggering OOM continually - s2ram which resumes automatically is triggered in a loop echo processors > /sys/power/pm_test while true do echo mem > /sys/power/state sleep 1s done - simple module which allocates and frees 20M in 8K chunks. If it sees freezing(current) then it tries another round of allocation before calling try_to_freeze - debugging messages of PM stages and OOM killer enable/disable/fail added and unmark_oom_victim is delayed by 1s after it clears TIF_MEMDIE and before it wakes up waiters. - rebased on top of the current mmotm which means some necessary updates in mm/oom_kill.c. mark_tsk_oom_victim is now called under task_lock but I think this should be OK because __thaw_task shouldn't interfere with any locking down wake_up_process. Oleg? As expected there are no OOM killed tasks after oom is disabled and allocations requested by the kernel thread are failing after all the tasks are frozen and OOM disabled. I wasn't able to catch a race where oom_killer_disable would really have to wait but I kinda expected the race is really unlikely. [ 242.609330] Killed process 2992 (mem_eater) total-vm:24412kB, anon-rss:2164kB, file-rss:4kB [ 243.628071] Unmarking 2992 OOM victim. oom_victims: 1 [ 243.636072] (elapsed 2.837 seconds) done. [ 243.641985] Trying to disable OOM killer [ 243.643032] Waiting for concurent OOM victims [ 243.644342] OOM killer disabled [ 243.645447] Freezing remaining freezable tasks ... (elapsed 0.005 seconds) done. [ 243.652983] Suspending console(s) (use no_console_suspend to debug) [ 243.903299] kmem_eater: page allocation failure: order:1, mode:0x204010 [...] [ 243.992600] PM: suspend of devices complete after 336.667 msecs [ 243.993264] PM: late suspend of devices complete after 0.660 msecs [ 243.994713] PM: noirq suspend of devices complete after 1.446 msecs [ 243.994717] ACPI: Preparing to enter system sleep state S3 [ 243.994795] PM: Saving platform NVS memory [ 243.994796] Disabling non-boot CPUs ... The first 2 patches are simple cleanups for OOM. They should go in regardless the rest IMO. Patches 3 and 4 are trivial printk -> pr_info conversion and they should go in ditto. The main patch is the last one and I would appreciate acks from Tejun and Rafael. I think the OOM part should be OK (except for __thaw_task vs. task_lock where a look from Oleg would appreciated) but I am not so sure I haven't screwed anything in the freezer code. I have found several surprises there. This patch (of 5): This patch is just a preparatory and it doesn't introduce any functional change. Note: I am utterly unhappy about lowmemory killer abusing TIF_MEMDIE just to wait for the oom victim and to prevent from new killing. This is just a side effect of the flag. The primary meaning is to give the oom victim access to the memory reserves and that shouldn't be necessary here. Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: ""Rafael J. Wysocki"" <rjw@rjwysocki.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/49550b605587924b3336386caae53200c68969d3;"The main patch is the last one and I would appreciate acks from Tejun and Rafael";yes;no;no 182;MDY6Q29tbWl0MjMyNTI5ODo0OTU1MGI2MDU1ODc5MjRiMzMzNjM4NmNhYWU1MzIwMGM2ODk2OWQz;Michal Hocko;Linus Torvalds;"oom: add helpers for setting and clearing TIF_MEMDIE This patchset addresses a race which was described in the changelog for 5695be142e20 (""OOM, PM: OOM killed task shouldn't escape PM suspend""): : PM freezer relies on having all tasks frozen by the time devices are : getting frozen so that no task will touch them while they are getting : frozen. But OOM killer is allowed to kill an already frozen task in order : to handle OOM situtation. In order to protect from late wake ups OOM : killer is disabled after all tasks are frozen. This, however, still keeps : a window open when a killed task didn't manage to die by the time : freeze_processes finishes. The original patch hasn't closed the race window completely because that would require a more complex solution as it can be seen by this patchset. The primary motivation was to close the race condition between OOM killer and PM freezer _completely_. As Tejun pointed out, even though the race condition is unlikely the harder it would be to debug weird bugs deep in the PM freezer when the debugging options are reduced considerably. I can only speculate what might happen when a task is still runnable unexpectedly. On a plus side and as a side effect the oom enable/disable has a better (full barrier) semantic without polluting hot paths. I have tested the series in KVM with 100M RAM: - many small tasks (20M anon mmap) which are triggering OOM continually - s2ram which resumes automatically is triggered in a loop echo processors > /sys/power/pm_test while true do echo mem > /sys/power/state sleep 1s done - simple module which allocates and frees 20M in 8K chunks. If it sees freezing(current) then it tries another round of allocation before calling try_to_freeze - debugging messages of PM stages and OOM killer enable/disable/fail added and unmark_oom_victim is delayed by 1s after it clears TIF_MEMDIE and before it wakes up waiters. - rebased on top of the current mmotm which means some necessary updates in mm/oom_kill.c. mark_tsk_oom_victim is now called under task_lock but I think this should be OK because __thaw_task shouldn't interfere with any locking down wake_up_process. Oleg? As expected there are no OOM killed tasks after oom is disabled and allocations requested by the kernel thread are failing after all the tasks are frozen and OOM disabled. I wasn't able to catch a race where oom_killer_disable would really have to wait but I kinda expected the race is really unlikely. [ 242.609330] Killed process 2992 (mem_eater) total-vm:24412kB, anon-rss:2164kB, file-rss:4kB [ 243.628071] Unmarking 2992 OOM victim. oom_victims: 1 [ 243.636072] (elapsed 2.837 seconds) done. [ 243.641985] Trying to disable OOM killer [ 243.643032] Waiting for concurent OOM victims [ 243.644342] OOM killer disabled [ 243.645447] Freezing remaining freezable tasks ... (elapsed 0.005 seconds) done. [ 243.652983] Suspending console(s) (use no_console_suspend to debug) [ 243.903299] kmem_eater: page allocation failure: order:1, mode:0x204010 [...] [ 243.992600] PM: suspend of devices complete after 336.667 msecs [ 243.993264] PM: late suspend of devices complete after 0.660 msecs [ 243.994713] PM: noirq suspend of devices complete after 1.446 msecs [ 243.994717] ACPI: Preparing to enter system sleep state S3 [ 243.994795] PM: Saving platform NVS memory [ 243.994796] Disabling non-boot CPUs ... The first 2 patches are simple cleanups for OOM. They should go in regardless the rest IMO. Patches 3 and 4 are trivial printk -> pr_info conversion and they should go in ditto. The main patch is the last one and I would appreciate acks from Tejun and Rafael. I think the OOM part should be OK (except for __thaw_task vs. task_lock where a look from Oleg would appreciated) but I am not so sure I haven't screwed anything in the freezer code. I have found several surprises there. This patch (of 5): This patch is just a preparatory and it doesn't introduce any functional change. Note: I am utterly unhappy about lowmemory killer abusing TIF_MEMDIE just to wait for the oom victim and to prevent from new killing. This is just a side effect of the flag. The primary meaning is to give the oom victim access to the memory reserves and that shouldn't be necessary here. Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: ""Rafael J. Wysocki"" <rjw@rjwysocki.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/49550b605587924b3336386caae53200c68969d3; I think the OOM part should be OK (except for __thaw_task vs;yes;no;no 182;MDY6Q29tbWl0MjMyNTI5ODo0OTU1MGI2MDU1ODc5MjRiMzMzNjM4NmNhYWU1MzIwMGM2ODk2OWQz;Michal Hocko;Linus Torvalds;"oom: add helpers for setting and clearing TIF_MEMDIE This patchset addresses a race which was described in the changelog for 5695be142e20 (""OOM, PM: OOM killed task shouldn't escape PM suspend""): : PM freezer relies on having all tasks frozen by the time devices are : getting frozen so that no task will touch them while they are getting : frozen. But OOM killer is allowed to kill an already frozen task in order : to handle OOM situtation. In order to protect from late wake ups OOM : killer is disabled after all tasks are frozen. This, however, still keeps : a window open when a killed task didn't manage to die by the time : freeze_processes finishes. The original patch hasn't closed the race window completely because that would require a more complex solution as it can be seen by this patchset. The primary motivation was to close the race condition between OOM killer and PM freezer _completely_. As Tejun pointed out, even though the race condition is unlikely the harder it would be to debug weird bugs deep in the PM freezer when the debugging options are reduced considerably. I can only speculate what might happen when a task is still runnable unexpectedly. On a plus side and as a side effect the oom enable/disable has a better (full barrier) semantic without polluting hot paths. I have tested the series in KVM with 100M RAM: - many small tasks (20M anon mmap) which are triggering OOM continually - s2ram which resumes automatically is triggered in a loop echo processors > /sys/power/pm_test while true do echo mem > /sys/power/state sleep 1s done - simple module which allocates and frees 20M in 8K chunks. If it sees freezing(current) then it tries another round of allocation before calling try_to_freeze - debugging messages of PM stages and OOM killer enable/disable/fail added and unmark_oom_victim is delayed by 1s after it clears TIF_MEMDIE and before it wakes up waiters. - rebased on top of the current mmotm which means some necessary updates in mm/oom_kill.c. mark_tsk_oom_victim is now called under task_lock but I think this should be OK because __thaw_task shouldn't interfere with any locking down wake_up_process. Oleg? As expected there are no OOM killed tasks after oom is disabled and allocations requested by the kernel thread are failing after all the tasks are frozen and OOM disabled. I wasn't able to catch a race where oom_killer_disable would really have to wait but I kinda expected the race is really unlikely. [ 242.609330] Killed process 2992 (mem_eater) total-vm:24412kB, anon-rss:2164kB, file-rss:4kB [ 243.628071] Unmarking 2992 OOM victim. oom_victims: 1 [ 243.636072] (elapsed 2.837 seconds) done. [ 243.641985] Trying to disable OOM killer [ 243.643032] Waiting for concurent OOM victims [ 243.644342] OOM killer disabled [ 243.645447] Freezing remaining freezable tasks ... (elapsed 0.005 seconds) done. [ 243.652983] Suspending console(s) (use no_console_suspend to debug) [ 243.903299] kmem_eater: page allocation failure: order:1, mode:0x204010 [...] [ 243.992600] PM: suspend of devices complete after 336.667 msecs [ 243.993264] PM: late suspend of devices complete after 0.660 msecs [ 243.994713] PM: noirq suspend of devices complete after 1.446 msecs [ 243.994717] ACPI: Preparing to enter system sleep state S3 [ 243.994795] PM: Saving platform NVS memory [ 243.994796] Disabling non-boot CPUs ... The first 2 patches are simple cleanups for OOM. They should go in regardless the rest IMO. Patches 3 and 4 are trivial printk -> pr_info conversion and they should go in ditto. The main patch is the last one and I would appreciate acks from Tejun and Rafael. I think the OOM part should be OK (except for __thaw_task vs. task_lock where a look from Oleg would appreciated) but I am not so sure I haven't screwed anything in the freezer code. I have found several surprises there. This patch (of 5): This patch is just a preparatory and it doesn't introduce any functional change. Note: I am utterly unhappy about lowmemory killer abusing TIF_MEMDIE just to wait for the oom victim and to prevent from new killing. This is just a side effect of the flag. The primary meaning is to give the oom victim access to the memory reserves and that shouldn't be necessary here. Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: ""Rafael J. Wysocki"" <rjw@rjwysocki.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/49550b605587924b3336386caae53200c68969d3;"task_lock where a look from Oleg would appreciated) but I am not so sure I haven't screwed anything in the freezer code";yes;no;no 182;MDY6Q29tbWl0MjMyNTI5ODo0OTU1MGI2MDU1ODc5MjRiMzMzNjM4NmNhYWU1MzIwMGM2ODk2OWQz;Michal Hocko;Linus Torvalds;"oom: add helpers for setting and clearing TIF_MEMDIE This patchset addresses a race which was described in the changelog for 5695be142e20 (""OOM, PM: OOM killed task shouldn't escape PM suspend""): : PM freezer relies on having all tasks frozen by the time devices are : getting frozen so that no task will touch them while they are getting : frozen. But OOM killer is allowed to kill an already frozen task in order : to handle OOM situtation. In order to protect from late wake ups OOM : killer is disabled after all tasks are frozen. This, however, still keeps : a window open when a killed task didn't manage to die by the time : freeze_processes finishes. The original patch hasn't closed the race window completely because that would require a more complex solution as it can be seen by this patchset. The primary motivation was to close the race condition between OOM killer and PM freezer _completely_. As Tejun pointed out, even though the race condition is unlikely the harder it would be to debug weird bugs deep in the PM freezer when the debugging options are reduced considerably. I can only speculate what might happen when a task is still runnable unexpectedly. On a plus side and as a side effect the oom enable/disable has a better (full barrier) semantic without polluting hot paths. I have tested the series in KVM with 100M RAM: - many small tasks (20M anon mmap) which are triggering OOM continually - s2ram which resumes automatically is triggered in a loop echo processors > /sys/power/pm_test while true do echo mem > /sys/power/state sleep 1s done - simple module which allocates and frees 20M in 8K chunks. If it sees freezing(current) then it tries another round of allocation before calling try_to_freeze - debugging messages of PM stages and OOM killer enable/disable/fail added and unmark_oom_victim is delayed by 1s after it clears TIF_MEMDIE and before it wakes up waiters. - rebased on top of the current mmotm which means some necessary updates in mm/oom_kill.c. mark_tsk_oom_victim is now called under task_lock but I think this should be OK because __thaw_task shouldn't interfere with any locking down wake_up_process. Oleg? As expected there are no OOM killed tasks after oom is disabled and allocations requested by the kernel thread are failing after all the tasks are frozen and OOM disabled. I wasn't able to catch a race where oom_killer_disable would really have to wait but I kinda expected the race is really unlikely. [ 242.609330] Killed process 2992 (mem_eater) total-vm:24412kB, anon-rss:2164kB, file-rss:4kB [ 243.628071] Unmarking 2992 OOM victim. oom_victims: 1 [ 243.636072] (elapsed 2.837 seconds) done. [ 243.641985] Trying to disable OOM killer [ 243.643032] Waiting for concurent OOM victims [ 243.644342] OOM killer disabled [ 243.645447] Freezing remaining freezable tasks ... (elapsed 0.005 seconds) done. [ 243.652983] Suspending console(s) (use no_console_suspend to debug) [ 243.903299] kmem_eater: page allocation failure: order:1, mode:0x204010 [...] [ 243.992600] PM: suspend of devices complete after 336.667 msecs [ 243.993264] PM: late suspend of devices complete after 0.660 msecs [ 243.994713] PM: noirq suspend of devices complete after 1.446 msecs [ 243.994717] ACPI: Preparing to enter system sleep state S3 [ 243.994795] PM: Saving platform NVS memory [ 243.994796] Disabling non-boot CPUs ... The first 2 patches are simple cleanups for OOM. They should go in regardless the rest IMO. Patches 3 and 4 are trivial printk -> pr_info conversion and they should go in ditto. The main patch is the last one and I would appreciate acks from Tejun and Rafael. I think the OOM part should be OK (except for __thaw_task vs. task_lock where a look from Oleg would appreciated) but I am not so sure I haven't screwed anything in the freezer code. I have found several surprises there. This patch (of 5): This patch is just a preparatory and it doesn't introduce any functional change. Note: I am utterly unhappy about lowmemory killer abusing TIF_MEMDIE just to wait for the oom victim and to prevent from new killing. This is just a side effect of the flag. The primary meaning is to give the oom victim access to the memory reserves and that shouldn't be necessary here. Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: ""Rafael J. Wysocki"" <rjw@rjwysocki.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/49550b605587924b3336386caae53200c68969d3;" I have found several surprises there";yes;no;yes 182;MDY6Q29tbWl0MjMyNTI5ODo0OTU1MGI2MDU1ODc5MjRiMzMzNjM4NmNhYWU1MzIwMGM2ODk2OWQz;Michal Hocko;Linus Torvalds;"oom: add helpers for setting and clearing TIF_MEMDIE This patchset addresses a race which was described in the changelog for 5695be142e20 (""OOM, PM: OOM killed task shouldn't escape PM suspend""): : PM freezer relies on having all tasks frozen by the time devices are : getting frozen so that no task will touch them while they are getting : frozen. But OOM killer is allowed to kill an already frozen task in order : to handle OOM situtation. In order to protect from late wake ups OOM : killer is disabled after all tasks are frozen. This, however, still keeps : a window open when a killed task didn't manage to die by the time : freeze_processes finishes. The original patch hasn't closed the race window completely because that would require a more complex solution as it can be seen by this patchset. The primary motivation was to close the race condition between OOM killer and PM freezer _completely_. As Tejun pointed out, even though the race condition is unlikely the harder it would be to debug weird bugs deep in the PM freezer when the debugging options are reduced considerably. I can only speculate what might happen when a task is still runnable unexpectedly. On a plus side and as a side effect the oom enable/disable has a better (full barrier) semantic without polluting hot paths. I have tested the series in KVM with 100M RAM: - many small tasks (20M anon mmap) which are triggering OOM continually - s2ram which resumes automatically is triggered in a loop echo processors > /sys/power/pm_test while true do echo mem > /sys/power/state sleep 1s done - simple module which allocates and frees 20M in 8K chunks. If it sees freezing(current) then it tries another round of allocation before calling try_to_freeze - debugging messages of PM stages and OOM killer enable/disable/fail added and unmark_oom_victim is delayed by 1s after it clears TIF_MEMDIE and before it wakes up waiters. - rebased on top of the current mmotm which means some necessary updates in mm/oom_kill.c. mark_tsk_oom_victim is now called under task_lock but I think this should be OK because __thaw_task shouldn't interfere with any locking down wake_up_process. Oleg? As expected there are no OOM killed tasks after oom is disabled and allocations requested by the kernel thread are failing after all the tasks are frozen and OOM disabled. I wasn't able to catch a race where oom_killer_disable would really have to wait but I kinda expected the race is really unlikely. [ 242.609330] Killed process 2992 (mem_eater) total-vm:24412kB, anon-rss:2164kB, file-rss:4kB [ 243.628071] Unmarking 2992 OOM victim. oom_victims: 1 [ 243.636072] (elapsed 2.837 seconds) done. [ 243.641985] Trying to disable OOM killer [ 243.643032] Waiting for concurent OOM victims [ 243.644342] OOM killer disabled [ 243.645447] Freezing remaining freezable tasks ... (elapsed 0.005 seconds) done. [ 243.652983] Suspending console(s) (use no_console_suspend to debug) [ 243.903299] kmem_eater: page allocation failure: order:1, mode:0x204010 [...] [ 243.992600] PM: suspend of devices complete after 336.667 msecs [ 243.993264] PM: late suspend of devices complete after 0.660 msecs [ 243.994713] PM: noirq suspend of devices complete after 1.446 msecs [ 243.994717] ACPI: Preparing to enter system sleep state S3 [ 243.994795] PM: Saving platform NVS memory [ 243.994796] Disabling non-boot CPUs ... The first 2 patches are simple cleanups for OOM. They should go in regardless the rest IMO. Patches 3 and 4 are trivial printk -> pr_info conversion and they should go in ditto. The main patch is the last one and I would appreciate acks from Tejun and Rafael. I think the OOM part should be OK (except for __thaw_task vs. task_lock where a look from Oleg would appreciated) but I am not so sure I haven't screwed anything in the freezer code. I have found several surprises there. This patch (of 5): This patch is just a preparatory and it doesn't introduce any functional change. Note: I am utterly unhappy about lowmemory killer abusing TIF_MEMDIE just to wait for the oom victim and to prevent from new killing. This is just a side effect of the flag. The primary meaning is to give the oom victim access to the memory reserves and that shouldn't be necessary here. Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: ""Rafael J. Wysocki"" <rjw@rjwysocki.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/49550b605587924b3336386caae53200c68969d3;This patch (of 5);yes;no;no 182;MDY6Q29tbWl0MjMyNTI5ODo0OTU1MGI2MDU1ODc5MjRiMzMzNjM4NmNhYWU1MzIwMGM2ODk2OWQz;Michal Hocko;Linus Torvalds;"oom: add helpers for setting and clearing TIF_MEMDIE This patchset addresses a race which was described in the changelog for 5695be142e20 (""OOM, PM: OOM killed task shouldn't escape PM suspend""): : PM freezer relies on having all tasks frozen by the time devices are : getting frozen so that no task will touch them while they are getting : frozen. But OOM killer is allowed to kill an already frozen task in order : to handle OOM situtation. In order to protect from late wake ups OOM : killer is disabled after all tasks are frozen. This, however, still keeps : a window open when a killed task didn't manage to die by the time : freeze_processes finishes. The original patch hasn't closed the race window completely because that would require a more complex solution as it can be seen by this patchset. The primary motivation was to close the race condition between OOM killer and PM freezer _completely_. As Tejun pointed out, even though the race condition is unlikely the harder it would be to debug weird bugs deep in the PM freezer when the debugging options are reduced considerably. I can only speculate what might happen when a task is still runnable unexpectedly. On a plus side and as a side effect the oom enable/disable has a better (full barrier) semantic without polluting hot paths. I have tested the series in KVM with 100M RAM: - many small tasks (20M anon mmap) which are triggering OOM continually - s2ram which resumes automatically is triggered in a loop echo processors > /sys/power/pm_test while true do echo mem > /sys/power/state sleep 1s done - simple module which allocates and frees 20M in 8K chunks. If it sees freezing(current) then it tries another round of allocation before calling try_to_freeze - debugging messages of PM stages and OOM killer enable/disable/fail added and unmark_oom_victim is delayed by 1s after it clears TIF_MEMDIE and before it wakes up waiters. - rebased on top of the current mmotm which means some necessary updates in mm/oom_kill.c. mark_tsk_oom_victim is now called under task_lock but I think this should be OK because __thaw_task shouldn't interfere with any locking down wake_up_process. Oleg? As expected there are no OOM killed tasks after oom is disabled and allocations requested by the kernel thread are failing after all the tasks are frozen and OOM disabled. I wasn't able to catch a race where oom_killer_disable would really have to wait but I kinda expected the race is really unlikely. [ 242.609330] Killed process 2992 (mem_eater) total-vm:24412kB, anon-rss:2164kB, file-rss:4kB [ 243.628071] Unmarking 2992 OOM victim. oom_victims: 1 [ 243.636072] (elapsed 2.837 seconds) done. [ 243.641985] Trying to disable OOM killer [ 243.643032] Waiting for concurent OOM victims [ 243.644342] OOM killer disabled [ 243.645447] Freezing remaining freezable tasks ... (elapsed 0.005 seconds) done. [ 243.652983] Suspending console(s) (use no_console_suspend to debug) [ 243.903299] kmem_eater: page allocation failure: order:1, mode:0x204010 [...] [ 243.992600] PM: suspend of devices complete after 336.667 msecs [ 243.993264] PM: late suspend of devices complete after 0.660 msecs [ 243.994713] PM: noirq suspend of devices complete after 1.446 msecs [ 243.994717] ACPI: Preparing to enter system sleep state S3 [ 243.994795] PM: Saving platform NVS memory [ 243.994796] Disabling non-boot CPUs ... The first 2 patches are simple cleanups for OOM. They should go in regardless the rest IMO. Patches 3 and 4 are trivial printk -> pr_info conversion and they should go in ditto. The main patch is the last one and I would appreciate acks from Tejun and Rafael. I think the OOM part should be OK (except for __thaw_task vs. task_lock where a look from Oleg would appreciated) but I am not so sure I haven't screwed anything in the freezer code. I have found several surprises there. This patch (of 5): This patch is just a preparatory and it doesn't introduce any functional change. Note: I am utterly unhappy about lowmemory killer abusing TIF_MEMDIE just to wait for the oom victim and to prevent from new killing. This is just a side effect of the flag. The primary meaning is to give the oom victim access to the memory reserves and that shouldn't be necessary here. Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: ""Rafael J. Wysocki"" <rjw@rjwysocki.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/49550b605587924b3336386caae53200c68969d3;"This patch is just a preparatory and it doesn't introduce any functional change";yes;yes;no 182;MDY6Q29tbWl0MjMyNTI5ODo0OTU1MGI2MDU1ODc5MjRiMzMzNjM4NmNhYWU1MzIwMGM2ODk2OWQz;Michal Hocko;Linus Torvalds;"oom: add helpers for setting and clearing TIF_MEMDIE This patchset addresses a race which was described in the changelog for 5695be142e20 (""OOM, PM: OOM killed task shouldn't escape PM suspend""): : PM freezer relies on having all tasks frozen by the time devices are : getting frozen so that no task will touch them while they are getting : frozen. But OOM killer is allowed to kill an already frozen task in order : to handle OOM situtation. In order to protect from late wake ups OOM : killer is disabled after all tasks are frozen. This, however, still keeps : a window open when a killed task didn't manage to die by the time : freeze_processes finishes. The original patch hasn't closed the race window completely because that would require a more complex solution as it can be seen by this patchset. The primary motivation was to close the race condition between OOM killer and PM freezer _completely_. As Tejun pointed out, even though the race condition is unlikely the harder it would be to debug weird bugs deep in the PM freezer when the debugging options are reduced considerably. I can only speculate what might happen when a task is still runnable unexpectedly. On a plus side and as a side effect the oom enable/disable has a better (full barrier) semantic without polluting hot paths. I have tested the series in KVM with 100M RAM: - many small tasks (20M anon mmap) which are triggering OOM continually - s2ram which resumes automatically is triggered in a loop echo processors > /sys/power/pm_test while true do echo mem > /sys/power/state sleep 1s done - simple module which allocates and frees 20M in 8K chunks. If it sees freezing(current) then it tries another round of allocation before calling try_to_freeze - debugging messages of PM stages and OOM killer enable/disable/fail added and unmark_oom_victim is delayed by 1s after it clears TIF_MEMDIE and before it wakes up waiters. - rebased on top of the current mmotm which means some necessary updates in mm/oom_kill.c. mark_tsk_oom_victim is now called under task_lock but I think this should be OK because __thaw_task shouldn't interfere with any locking down wake_up_process. Oleg? As expected there are no OOM killed tasks after oom is disabled and allocations requested by the kernel thread are failing after all the tasks are frozen and OOM disabled. I wasn't able to catch a race where oom_killer_disable would really have to wait but I kinda expected the race is really unlikely. [ 242.609330] Killed process 2992 (mem_eater) total-vm:24412kB, anon-rss:2164kB, file-rss:4kB [ 243.628071] Unmarking 2992 OOM victim. oom_victims: 1 [ 243.636072] (elapsed 2.837 seconds) done. [ 243.641985] Trying to disable OOM killer [ 243.643032] Waiting for concurent OOM victims [ 243.644342] OOM killer disabled [ 243.645447] Freezing remaining freezable tasks ... (elapsed 0.005 seconds) done. [ 243.652983] Suspending console(s) (use no_console_suspend to debug) [ 243.903299] kmem_eater: page allocation failure: order:1, mode:0x204010 [...] [ 243.992600] PM: suspend of devices complete after 336.667 msecs [ 243.993264] PM: late suspend of devices complete after 0.660 msecs [ 243.994713] PM: noirq suspend of devices complete after 1.446 msecs [ 243.994717] ACPI: Preparing to enter system sleep state S3 [ 243.994795] PM: Saving platform NVS memory [ 243.994796] Disabling non-boot CPUs ... The first 2 patches are simple cleanups for OOM. They should go in regardless the rest IMO. Patches 3 and 4 are trivial printk -> pr_info conversion and they should go in ditto. The main patch is the last one and I would appreciate acks from Tejun and Rafael. I think the OOM part should be OK (except for __thaw_task vs. task_lock where a look from Oleg would appreciated) but I am not so sure I haven't screwed anything in the freezer code. I have found several surprises there. This patch (of 5): This patch is just a preparatory and it doesn't introduce any functional change. Note: I am utterly unhappy about lowmemory killer abusing TIF_MEMDIE just to wait for the oom victim and to prevent from new killing. This is just a side effect of the flag. The primary meaning is to give the oom victim access to the memory reserves and that shouldn't be necessary here. Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: ""Rafael J. Wysocki"" <rjw@rjwysocki.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/49550b605587924b3336386caae53200c68969d3;Note;no;no;no 182;MDY6Q29tbWl0MjMyNTI5ODo0OTU1MGI2MDU1ODc5MjRiMzMzNjM4NmNhYWU1MzIwMGM2ODk2OWQz;Michal Hocko;Linus Torvalds;"oom: add helpers for setting and clearing TIF_MEMDIE This patchset addresses a race which was described in the changelog for 5695be142e20 (""OOM, PM: OOM killed task shouldn't escape PM suspend""): : PM freezer relies on having all tasks frozen by the time devices are : getting frozen so that no task will touch them while they are getting : frozen. But OOM killer is allowed to kill an already frozen task in order : to handle OOM situtation. In order to protect from late wake ups OOM : killer is disabled after all tasks are frozen. This, however, still keeps : a window open when a killed task didn't manage to die by the time : freeze_processes finishes. The original patch hasn't closed the race window completely because that would require a more complex solution as it can be seen by this patchset. The primary motivation was to close the race condition between OOM killer and PM freezer _completely_. As Tejun pointed out, even though the race condition is unlikely the harder it would be to debug weird bugs deep in the PM freezer when the debugging options are reduced considerably. I can only speculate what might happen when a task is still runnable unexpectedly. On a plus side and as a side effect the oom enable/disable has a better (full barrier) semantic without polluting hot paths. I have tested the series in KVM with 100M RAM: - many small tasks (20M anon mmap) which are triggering OOM continually - s2ram which resumes automatically is triggered in a loop echo processors > /sys/power/pm_test while true do echo mem > /sys/power/state sleep 1s done - simple module which allocates and frees 20M in 8K chunks. If it sees freezing(current) then it tries another round of allocation before calling try_to_freeze - debugging messages of PM stages and OOM killer enable/disable/fail added and unmark_oom_victim is delayed by 1s after it clears TIF_MEMDIE and before it wakes up waiters. - rebased on top of the current mmotm which means some necessary updates in mm/oom_kill.c. mark_tsk_oom_victim is now called under task_lock but I think this should be OK because __thaw_task shouldn't interfere with any locking down wake_up_process. Oleg? As expected there are no OOM killed tasks after oom is disabled and allocations requested by the kernel thread are failing after all the tasks are frozen and OOM disabled. I wasn't able to catch a race where oom_killer_disable would really have to wait but I kinda expected the race is really unlikely. [ 242.609330] Killed process 2992 (mem_eater) total-vm:24412kB, anon-rss:2164kB, file-rss:4kB [ 243.628071] Unmarking 2992 OOM victim. oom_victims: 1 [ 243.636072] (elapsed 2.837 seconds) done. [ 243.641985] Trying to disable OOM killer [ 243.643032] Waiting for concurent OOM victims [ 243.644342] OOM killer disabled [ 243.645447] Freezing remaining freezable tasks ... (elapsed 0.005 seconds) done. [ 243.652983] Suspending console(s) (use no_console_suspend to debug) [ 243.903299] kmem_eater: page allocation failure: order:1, mode:0x204010 [...] [ 243.992600] PM: suspend of devices complete after 336.667 msecs [ 243.993264] PM: late suspend of devices complete after 0.660 msecs [ 243.994713] PM: noirq suspend of devices complete after 1.446 msecs [ 243.994717] ACPI: Preparing to enter system sleep state S3 [ 243.994795] PM: Saving platform NVS memory [ 243.994796] Disabling non-boot CPUs ... The first 2 patches are simple cleanups for OOM. They should go in regardless the rest IMO. Patches 3 and 4 are trivial printk -> pr_info conversion and they should go in ditto. The main patch is the last one and I would appreciate acks from Tejun and Rafael. I think the OOM part should be OK (except for __thaw_task vs. task_lock where a look from Oleg would appreciated) but I am not so sure I haven't screwed anything in the freezer code. I have found several surprises there. This patch (of 5): This patch is just a preparatory and it doesn't introduce any functional change. Note: I am utterly unhappy about lowmemory killer abusing TIF_MEMDIE just to wait for the oom victim and to prevent from new killing. This is just a side effect of the flag. The primary meaning is to give the oom victim access to the memory reserves and that shouldn't be necessary here. Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: ""Rafael J. Wysocki"" <rjw@rjwysocki.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/49550b605587924b3336386caae53200c68969d3;"I am utterly unhappy about lowmemory killer abusing TIF_MEMDIE just to wait for the oom victim and to prevent from new killing";yes;no;yes 182;MDY6Q29tbWl0MjMyNTI5ODo0OTU1MGI2MDU1ODc5MjRiMzMzNjM4NmNhYWU1MzIwMGM2ODk2OWQz;Michal Hocko;Linus Torvalds;"oom: add helpers for setting and clearing TIF_MEMDIE This patchset addresses a race which was described in the changelog for 5695be142e20 (""OOM, PM: OOM killed task shouldn't escape PM suspend""): : PM freezer relies on having all tasks frozen by the time devices are : getting frozen so that no task will touch them while they are getting : frozen. But OOM killer is allowed to kill an already frozen task in order : to handle OOM situtation. In order to protect from late wake ups OOM : killer is disabled after all tasks are frozen. This, however, still keeps : a window open when a killed task didn't manage to die by the time : freeze_processes finishes. The original patch hasn't closed the race window completely because that would require a more complex solution as it can be seen by this patchset. The primary motivation was to close the race condition between OOM killer and PM freezer _completely_. As Tejun pointed out, even though the race condition is unlikely the harder it would be to debug weird bugs deep in the PM freezer when the debugging options are reduced considerably. I can only speculate what might happen when a task is still runnable unexpectedly. On a plus side and as a side effect the oom enable/disable has a better (full barrier) semantic without polluting hot paths. I have tested the series in KVM with 100M RAM: - many small tasks (20M anon mmap) which are triggering OOM continually - s2ram which resumes automatically is triggered in a loop echo processors > /sys/power/pm_test while true do echo mem > /sys/power/state sleep 1s done - simple module which allocates and frees 20M in 8K chunks. If it sees freezing(current) then it tries another round of allocation before calling try_to_freeze - debugging messages of PM stages and OOM killer enable/disable/fail added and unmark_oom_victim is delayed by 1s after it clears TIF_MEMDIE and before it wakes up waiters. - rebased on top of the current mmotm which means some necessary updates in mm/oom_kill.c. mark_tsk_oom_victim is now called under task_lock but I think this should be OK because __thaw_task shouldn't interfere with any locking down wake_up_process. Oleg? As expected there are no OOM killed tasks after oom is disabled and allocations requested by the kernel thread are failing after all the tasks are frozen and OOM disabled. I wasn't able to catch a race where oom_killer_disable would really have to wait but I kinda expected the race is really unlikely. [ 242.609330] Killed process 2992 (mem_eater) total-vm:24412kB, anon-rss:2164kB, file-rss:4kB [ 243.628071] Unmarking 2992 OOM victim. oom_victims: 1 [ 243.636072] (elapsed 2.837 seconds) done. [ 243.641985] Trying to disable OOM killer [ 243.643032] Waiting for concurent OOM victims [ 243.644342] OOM killer disabled [ 243.645447] Freezing remaining freezable tasks ... (elapsed 0.005 seconds) done. [ 243.652983] Suspending console(s) (use no_console_suspend to debug) [ 243.903299] kmem_eater: page allocation failure: order:1, mode:0x204010 [...] [ 243.992600] PM: suspend of devices complete after 336.667 msecs [ 243.993264] PM: late suspend of devices complete after 0.660 msecs [ 243.994713] PM: noirq suspend of devices complete after 1.446 msecs [ 243.994717] ACPI: Preparing to enter system sleep state S3 [ 243.994795] PM: Saving platform NVS memory [ 243.994796] Disabling non-boot CPUs ... The first 2 patches are simple cleanups for OOM. They should go in regardless the rest IMO. Patches 3 and 4 are trivial printk -> pr_info conversion and they should go in ditto. The main patch is the last one and I would appreciate acks from Tejun and Rafael. I think the OOM part should be OK (except for __thaw_task vs. task_lock where a look from Oleg would appreciated) but I am not so sure I haven't screwed anything in the freezer code. I have found several surprises there. This patch (of 5): This patch is just a preparatory and it doesn't introduce any functional change. Note: I am utterly unhappy about lowmemory killer abusing TIF_MEMDIE just to wait for the oom victim and to prevent from new killing. This is just a side effect of the flag. The primary meaning is to give the oom victim access to the memory reserves and that shouldn't be necessary here. Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: ""Rafael J. Wysocki"" <rjw@rjwysocki.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/49550b605587924b3336386caae53200c68969d3;"This is just a side effect of the flag";yes;no;yes 182;MDY6Q29tbWl0MjMyNTI5ODo0OTU1MGI2MDU1ODc5MjRiMzMzNjM4NmNhYWU1MzIwMGM2ODk2OWQz;Michal Hocko;Linus Torvalds;"oom: add helpers for setting and clearing TIF_MEMDIE This patchset addresses a race which was described in the changelog for 5695be142e20 (""OOM, PM: OOM killed task shouldn't escape PM suspend""): : PM freezer relies on having all tasks frozen by the time devices are : getting frozen so that no task will touch them while they are getting : frozen. But OOM killer is allowed to kill an already frozen task in order : to handle OOM situtation. In order to protect from late wake ups OOM : killer is disabled after all tasks are frozen. This, however, still keeps : a window open when a killed task didn't manage to die by the time : freeze_processes finishes. The original patch hasn't closed the race window completely because that would require a more complex solution as it can be seen by this patchset. The primary motivation was to close the race condition between OOM killer and PM freezer _completely_. As Tejun pointed out, even though the race condition is unlikely the harder it would be to debug weird bugs deep in the PM freezer when the debugging options are reduced considerably. I can only speculate what might happen when a task is still runnable unexpectedly. On a plus side and as a side effect the oom enable/disable has a better (full barrier) semantic without polluting hot paths. I have tested the series in KVM with 100M RAM: - many small tasks (20M anon mmap) which are triggering OOM continually - s2ram which resumes automatically is triggered in a loop echo processors > /sys/power/pm_test while true do echo mem > /sys/power/state sleep 1s done - simple module which allocates and frees 20M in 8K chunks. If it sees freezing(current) then it tries another round of allocation before calling try_to_freeze - debugging messages of PM stages and OOM killer enable/disable/fail added and unmark_oom_victim is delayed by 1s after it clears TIF_MEMDIE and before it wakes up waiters. - rebased on top of the current mmotm which means some necessary updates in mm/oom_kill.c. mark_tsk_oom_victim is now called under task_lock but I think this should be OK because __thaw_task shouldn't interfere with any locking down wake_up_process. Oleg? As expected there are no OOM killed tasks after oom is disabled and allocations requested by the kernel thread are failing after all the tasks are frozen and OOM disabled. I wasn't able to catch a race where oom_killer_disable would really have to wait but I kinda expected the race is really unlikely. [ 242.609330] Killed process 2992 (mem_eater) total-vm:24412kB, anon-rss:2164kB, file-rss:4kB [ 243.628071] Unmarking 2992 OOM victim. oom_victims: 1 [ 243.636072] (elapsed 2.837 seconds) done. [ 243.641985] Trying to disable OOM killer [ 243.643032] Waiting for concurent OOM victims [ 243.644342] OOM killer disabled [ 243.645447] Freezing remaining freezable tasks ... (elapsed 0.005 seconds) done. [ 243.652983] Suspending console(s) (use no_console_suspend to debug) [ 243.903299] kmem_eater: page allocation failure: order:1, mode:0x204010 [...] [ 243.992600] PM: suspend of devices complete after 336.667 msecs [ 243.993264] PM: late suspend of devices complete after 0.660 msecs [ 243.994713] PM: noirq suspend of devices complete after 1.446 msecs [ 243.994717] ACPI: Preparing to enter system sleep state S3 [ 243.994795] PM: Saving platform NVS memory [ 243.994796] Disabling non-boot CPUs ... The first 2 patches are simple cleanups for OOM. They should go in regardless the rest IMO. Patches 3 and 4 are trivial printk -> pr_info conversion and they should go in ditto. The main patch is the last one and I would appreciate acks from Tejun and Rafael. I think the OOM part should be OK (except for __thaw_task vs. task_lock where a look from Oleg would appreciated) but I am not so sure I haven't screwed anything in the freezer code. I have found several surprises there. This patch (of 5): This patch is just a preparatory and it doesn't introduce any functional change. Note: I am utterly unhappy about lowmemory killer abusing TIF_MEMDIE just to wait for the oom victim and to prevent from new killing. This is just a side effect of the flag. The primary meaning is to give the oom victim access to the memory reserves and that shouldn't be necessary here. Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: ""Rafael J. Wysocki"" <rjw@rjwysocki.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/49550b605587924b3336386caae53200c68969d3;"The primary meaning is to give the oom victim access to the memory reserves and that shouldn't be necessary here.";yes;yes;yes 183;MDY6Q29tbWl0MjMyNTI5ODo4MzM2M2I5MTdhMjk4MmRkNTA5YTVlMjEyNWU5MDViNjg3MzUwNWEz;Michal Hocko;Linus Torvalds;"oom: make sure that TIF_MEMDIE is set under task_lock OOM killer tries to exclude tasks which do not have mm_struct associated because killing such a task wouldn't help much. The OOM victim gets TIF_MEMDIE set to disable OOM killer while the current victim releases the memory and then enables the OOM killer again by dropping the flag. oom_kill_process is currently prone to a race condition when the OOM victim is already exiting and TIF_MEMDIE is set after the task releases its address space. This might theoretically lead to OOM livelock if the OOM victim blocks on an allocation later during exiting because it wouldn't kill any other process and the exiting one won't be able to exit. The situation is highly unlikely because the OOM victim is expected to release some memory which should help to sort out OOM situation. Fix this by checking task->mm and setting TIF_MEMDIE flag under task_lock which will serialize the OOM killer with exit_mm which sets task->mm to NULL. Setting the flag for current is not necessary because check and set is not racy. Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/83363b917a2982dd509a5e2125e905b6873505a3;oom: make sure that TIF_MEMDIE is set under task_lock;yes;no;no 183;MDY6Q29tbWl0MjMyNTI5ODo4MzM2M2I5MTdhMjk4MmRkNTA5YTVlMjEyNWU5MDViNjg3MzUwNWEz;Michal Hocko;Linus Torvalds;"oom: make sure that TIF_MEMDIE is set under task_lock OOM killer tries to exclude tasks which do not have mm_struct associated because killing such a task wouldn't help much. The OOM victim gets TIF_MEMDIE set to disable OOM killer while the current victim releases the memory and then enables the OOM killer again by dropping the flag. oom_kill_process is currently prone to a race condition when the OOM victim is already exiting and TIF_MEMDIE is set after the task releases its address space. This might theoretically lead to OOM livelock if the OOM victim blocks on an allocation later during exiting because it wouldn't kill any other process and the exiting one won't be able to exit. The situation is highly unlikely because the OOM victim is expected to release some memory which should help to sort out OOM situation. Fix this by checking task->mm and setting TIF_MEMDIE flag under task_lock which will serialize the OOM killer with exit_mm which sets task->mm to NULL. Setting the flag for current is not necessary because check and set is not racy. Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/83363b917a2982dd509a5e2125e905b6873505a3;"OOM killer tries to exclude tasks which do not have mm_struct associated because killing such a task wouldn't help much";no;no;yes 183;MDY6Q29tbWl0MjMyNTI5ODo4MzM2M2I5MTdhMjk4MmRkNTA5YTVlMjEyNWU5MDViNjg3MzUwNWEz;Michal Hocko;Linus Torvalds;"oom: make sure that TIF_MEMDIE is set under task_lock OOM killer tries to exclude tasks which do not have mm_struct associated because killing such a task wouldn't help much. The OOM victim gets TIF_MEMDIE set to disable OOM killer while the current victim releases the memory and then enables the OOM killer again by dropping the flag. oom_kill_process is currently prone to a race condition when the OOM victim is already exiting and TIF_MEMDIE is set after the task releases its address space. This might theoretically lead to OOM livelock if the OOM victim blocks on an allocation later during exiting because it wouldn't kill any other process and the exiting one won't be able to exit. The situation is highly unlikely because the OOM victim is expected to release some memory which should help to sort out OOM situation. Fix this by checking task->mm and setting TIF_MEMDIE flag under task_lock which will serialize the OOM killer with exit_mm which sets task->mm to NULL. Setting the flag for current is not necessary because check and set is not racy. Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/83363b917a2982dd509a5e2125e905b6873505a3;" The OOM victim gets TIF_MEMDIE set to disable OOM killer while the current victim releases the memory and then enables the OOM killer again by dropping the flag";no;no;yes 183;MDY6Q29tbWl0MjMyNTI5ODo4MzM2M2I5MTdhMjk4MmRkNTA5YTVlMjEyNWU5MDViNjg3MzUwNWEz;Michal Hocko;Linus Torvalds;"oom: make sure that TIF_MEMDIE is set under task_lock OOM killer tries to exclude tasks which do not have mm_struct associated because killing such a task wouldn't help much. The OOM victim gets TIF_MEMDIE set to disable OOM killer while the current victim releases the memory and then enables the OOM killer again by dropping the flag. oom_kill_process is currently prone to a race condition when the OOM victim is already exiting and TIF_MEMDIE is set after the task releases its address space. This might theoretically lead to OOM livelock if the OOM victim blocks on an allocation later during exiting because it wouldn't kill any other process and the exiting one won't be able to exit. The situation is highly unlikely because the OOM victim is expected to release some memory which should help to sort out OOM situation. Fix this by checking task->mm and setting TIF_MEMDIE flag under task_lock which will serialize the OOM killer with exit_mm which sets task->mm to NULL. Setting the flag for current is not necessary because check and set is not racy. Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/83363b917a2982dd509a5e2125e905b6873505a3;"oom_kill_process is currently prone to a race condition when the OOM victim is already exiting and TIF_MEMDIE is set after the task releases its address space";no;yes;yes 183;MDY6Q29tbWl0MjMyNTI5ODo4MzM2M2I5MTdhMjk4MmRkNTA5YTVlMjEyNWU5MDViNjg3MzUwNWEz;Michal Hocko;Linus Torvalds;"oom: make sure that TIF_MEMDIE is set under task_lock OOM killer tries to exclude tasks which do not have mm_struct associated because killing such a task wouldn't help much. The OOM victim gets TIF_MEMDIE set to disable OOM killer while the current victim releases the memory and then enables the OOM killer again by dropping the flag. oom_kill_process is currently prone to a race condition when the OOM victim is already exiting and TIF_MEMDIE is set after the task releases its address space. This might theoretically lead to OOM livelock if the OOM victim blocks on an allocation later during exiting because it wouldn't kill any other process and the exiting one won't be able to exit. The situation is highly unlikely because the OOM victim is expected to release some memory which should help to sort out OOM situation. Fix this by checking task->mm and setting TIF_MEMDIE flag under task_lock which will serialize the OOM killer with exit_mm which sets task->mm to NULL. Setting the flag for current is not necessary because check and set is not racy. Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/83363b917a2982dd509a5e2125e905b6873505a3;" This might theoretically lead to OOM livelock if the OOM victim blocks on an allocation later during exiting because it wouldn't kill any other process and the exiting one won't be able to exit";no;yes;yes 183;MDY6Q29tbWl0MjMyNTI5ODo4MzM2M2I5MTdhMjk4MmRkNTA5YTVlMjEyNWU5MDViNjg3MzUwNWEz;Michal Hocko;Linus Torvalds;"oom: make sure that TIF_MEMDIE is set under task_lock OOM killer tries to exclude tasks which do not have mm_struct associated because killing such a task wouldn't help much. The OOM victim gets TIF_MEMDIE set to disable OOM killer while the current victim releases the memory and then enables the OOM killer again by dropping the flag. oom_kill_process is currently prone to a race condition when the OOM victim is already exiting and TIF_MEMDIE is set after the task releases its address space. This might theoretically lead to OOM livelock if the OOM victim blocks on an allocation later during exiting because it wouldn't kill any other process and the exiting one won't be able to exit. The situation is highly unlikely because the OOM victim is expected to release some memory which should help to sort out OOM situation. Fix this by checking task->mm and setting TIF_MEMDIE flag under task_lock which will serialize the OOM killer with exit_mm which sets task->mm to NULL. Setting the flag for current is not necessary because check and set is not racy. Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/83363b917a2982dd509a5e2125e905b6873505a3;" The situation is highly unlikely because the OOM victim is expected to release some memory which should help to sort out OOM situation";no;yes;yes 183;MDY6Q29tbWl0MjMyNTI5ODo4MzM2M2I5MTdhMjk4MmRkNTA5YTVlMjEyNWU5MDViNjg3MzUwNWEz;Michal Hocko;Linus Torvalds;"oom: make sure that TIF_MEMDIE is set under task_lock OOM killer tries to exclude tasks which do not have mm_struct associated because killing such a task wouldn't help much. The OOM victim gets TIF_MEMDIE set to disable OOM killer while the current victim releases the memory and then enables the OOM killer again by dropping the flag. oom_kill_process is currently prone to a race condition when the OOM victim is already exiting and TIF_MEMDIE is set after the task releases its address space. This might theoretically lead to OOM livelock if the OOM victim blocks on an allocation later during exiting because it wouldn't kill any other process and the exiting one won't be able to exit. The situation is highly unlikely because the OOM victim is expected to release some memory which should help to sort out OOM situation. Fix this by checking task->mm and setting TIF_MEMDIE flag under task_lock which will serialize the OOM killer with exit_mm which sets task->mm to NULL. Setting the flag for current is not necessary because check and set is not racy. Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/83363b917a2982dd509a5e2125e905b6873505a3;"Fix this by checking task->mm and setting TIF_MEMDIE flag under task_lock which will serialize the OOM killer with exit_mm which sets task->mm to NULL";yes;yes;no 183;MDY6Q29tbWl0MjMyNTI5ODo4MzM2M2I5MTdhMjk4MmRkNTA5YTVlMjEyNWU5MDViNjg3MzUwNWEz;Michal Hocko;Linus Torvalds;"oom: make sure that TIF_MEMDIE is set under task_lock OOM killer tries to exclude tasks which do not have mm_struct associated because killing such a task wouldn't help much. The OOM victim gets TIF_MEMDIE set to disable OOM killer while the current victim releases the memory and then enables the OOM killer again by dropping the flag. oom_kill_process is currently prone to a race condition when the OOM victim is already exiting and TIF_MEMDIE is set after the task releases its address space. This might theoretically lead to OOM livelock if the OOM victim blocks on an allocation later during exiting because it wouldn't kill any other process and the exiting one won't be able to exit. The situation is highly unlikely because the OOM victim is expected to release some memory which should help to sort out OOM situation. Fix this by checking task->mm and setting TIF_MEMDIE flag under task_lock which will serialize the OOM killer with exit_mm which sets task->mm to NULL. Setting the flag for current is not necessary because check and set is not racy. Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/83363b917a2982dd509a5e2125e905b6873505a3;" Setting the flag for current is not necessary because check and set is not racy.";yes;yes;no 184;MDY6Q29tbWl0MjMyNTI5ODpkN2E5NGU3ZTExYmFkZjg0MDRkNDBiNDFlMDA4YzMxMzFhM2NlYmUz;Tetsuo Handa;Linus Torvalds;"oom: don't count on mm-less current process out_of_memory() doesn't trigger the OOM killer if the current task is already exiting or it has fatal signals pending, and gives the task access to memory reserves instead. However, doing so is wrong if out_of_memory() is called by an allocation (e.g. from exit_task_work()) after the current task has already released its memory and cleared TIF_MEMDIE at exit_mm(). If we again set TIF_MEMDIE to post-exit_mm() current task, the OOM killer will be blocked by the task sitting in the final schedule() waiting for its parent to reap it. It will trigger an OOM livelock if its parent is unable to reap it due to doing an allocation and waiting for the OOM killer to kill it. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/d7a94e7e11badf8404d40b41e008c3131a3cebe3;oom: don't count on mm-less current process;yes;no;no 184;MDY6Q29tbWl0MjMyNTI5ODpkN2E5NGU3ZTExYmFkZjg0MDRkNDBiNDFlMDA4YzMxMzFhM2NlYmUz;Tetsuo Handa;Linus Torvalds;"oom: don't count on mm-less current process out_of_memory() doesn't trigger the OOM killer if the current task is already exiting or it has fatal signals pending, and gives the task access to memory reserves instead. However, doing so is wrong if out_of_memory() is called by an allocation (e.g. from exit_task_work()) after the current task has already released its memory and cleared TIF_MEMDIE at exit_mm(). If we again set TIF_MEMDIE to post-exit_mm() current task, the OOM killer will be blocked by the task sitting in the final schedule() waiting for its parent to reap it. It will trigger an OOM livelock if its parent is unable to reap it due to doing an allocation and waiting for the OOM killer to kill it. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/d7a94e7e11badf8404d40b41e008c3131a3cebe3;"out_of_memory() doesn't trigger the OOM killer if the current task is already exiting or it has fatal signals pending, and gives the task access to memory reserves instead";no;no;yes 184;MDY6Q29tbWl0MjMyNTI5ODpkN2E5NGU3ZTExYmFkZjg0MDRkNDBiNDFlMDA4YzMxMzFhM2NlYmUz;Tetsuo Handa;Linus Torvalds;"oom: don't count on mm-less current process out_of_memory() doesn't trigger the OOM killer if the current task is already exiting or it has fatal signals pending, and gives the task access to memory reserves instead. However, doing so is wrong if out_of_memory() is called by an allocation (e.g. from exit_task_work()) after the current task has already released its memory and cleared TIF_MEMDIE at exit_mm(). If we again set TIF_MEMDIE to post-exit_mm() current task, the OOM killer will be blocked by the task sitting in the final schedule() waiting for its parent to reap it. It will trigger an OOM livelock if its parent is unable to reap it due to doing an allocation and waiting for the OOM killer to kill it. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/d7a94e7e11badf8404d40b41e008c3131a3cebe3;" However, doing so is wrong if out_of_memory() is called by an allocation (e.g";no;yes;yes 184;MDY6Q29tbWl0MjMyNTI5ODpkN2E5NGU3ZTExYmFkZjg0MDRkNDBiNDFlMDA4YzMxMzFhM2NlYmUz;Tetsuo Handa;Linus Torvalds;"oom: don't count on mm-less current process out_of_memory() doesn't trigger the OOM killer if the current task is already exiting or it has fatal signals pending, and gives the task access to memory reserves instead. However, doing so is wrong if out_of_memory() is called by an allocation (e.g. from exit_task_work()) after the current task has already released its memory and cleared TIF_MEMDIE at exit_mm(). If we again set TIF_MEMDIE to post-exit_mm() current task, the OOM killer will be blocked by the task sitting in the final schedule() waiting for its parent to reap it. It will trigger an OOM livelock if its parent is unable to reap it due to doing an allocation and waiting for the OOM killer to kill it. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/d7a94e7e11badf8404d40b41e008c3131a3cebe3;"from exit_task_work()) after the current task has already released its memory and cleared TIF_MEMDIE at exit_mm()";no;yes;yes 184;MDY6Q29tbWl0MjMyNTI5ODpkN2E5NGU3ZTExYmFkZjg0MDRkNDBiNDFlMDA4YzMxMzFhM2NlYmUz;Tetsuo Handa;Linus Torvalds;"oom: don't count on mm-less current process out_of_memory() doesn't trigger the OOM killer if the current task is already exiting or it has fatal signals pending, and gives the task access to memory reserves instead. However, doing so is wrong if out_of_memory() is called by an allocation (e.g. from exit_task_work()) after the current task has already released its memory and cleared TIF_MEMDIE at exit_mm(). If we again set TIF_MEMDIE to post-exit_mm() current task, the OOM killer will be blocked by the task sitting in the final schedule() waiting for its parent to reap it. It will trigger an OOM livelock if its parent is unable to reap it due to doing an allocation and waiting for the OOM killer to kill it. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/d7a94e7e11badf8404d40b41e008c3131a3cebe3;" If we again set TIF_MEMDIE to post-exit_mm() current task, the OOM killer will be blocked by the task sitting in the final schedule() waiting for its parent to reap it";yes;no;yes 184;MDY6Q29tbWl0MjMyNTI5ODpkN2E5NGU3ZTExYmFkZjg0MDRkNDBiNDFlMDA4YzMxMzFhM2NlYmUz;Tetsuo Handa;Linus Torvalds;"oom: don't count on mm-less current process out_of_memory() doesn't trigger the OOM killer if the current task is already exiting or it has fatal signals pending, and gives the task access to memory reserves instead. However, doing so is wrong if out_of_memory() is called by an allocation (e.g. from exit_task_work()) after the current task has already released its memory and cleared TIF_MEMDIE at exit_mm(). If we again set TIF_MEMDIE to post-exit_mm() current task, the OOM killer will be blocked by the task sitting in the final schedule() waiting for its parent to reap it. It will trigger an OOM livelock if its parent is unable to reap it due to doing an allocation and waiting for the OOM killer to kill it. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/d7a94e7e11badf8404d40b41e008c3131a3cebe3;" It will trigger an OOM livelock if its parent is unable to reap it due to doing an allocation and waiting for the OOM killer to kill it.";yes;yes;yes 185;MDY6Q29tbWl0MjMyNTI5ODo2YTJkNTY3OWI0YTg1MmEzYmY4MGM1NzA2NDQ0NTZhYjQ2NmFiNzE0;Oleg Nesterov;Linus Torvalds;"oom: kill the insufficient and no longer needed PT_TRACE_EXIT check After the previous patch we can remove the PT_TRACE_EXIT check in oom_scan_process_thread(), it was added to handle the case when the coredumping was ""frozen"" by ptrace, but it doesn't really work. If nothing else, we would need to check all threads which could share the same ->mm to make it more or less correct. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: ""Rafael J. Wysocki"" <rjw@rjwysocki.net> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/6a2d5679b4a852a3bf80c570644456ab466ab714;oom: kill the insufficient and no longer needed PT_TRACE_EXIT check;yes;yes;no 185;MDY6Q29tbWl0MjMyNTI5ODo2YTJkNTY3OWI0YTg1MmEzYmY4MGM1NzA2NDQ0NTZhYjQ2NmFiNzE0;Oleg Nesterov;Linus Torvalds;"oom: kill the insufficient and no longer needed PT_TRACE_EXIT check After the previous patch we can remove the PT_TRACE_EXIT check in oom_scan_process_thread(), it was added to handle the case when the coredumping was ""frozen"" by ptrace, but it doesn't really work. If nothing else, we would need to check all threads which could share the same ->mm to make it more or less correct. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: ""Rafael J. Wysocki"" <rjw@rjwysocki.net> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/6a2d5679b4a852a3bf80c570644456ab466ab714;"After the previous patch we can remove the PT_TRACE_EXIT check in oom_scan_process_thread(), it was added to handle the case when the coredumping was ""frozen"" by ptrace, but it doesn't really work";yes;yes;yes 185;MDY6Q29tbWl0MjMyNTI5ODo2YTJkNTY3OWI0YTg1MmEzYmY4MGM1NzA2NDQ0NTZhYjQ2NmFiNzE0;Oleg Nesterov;Linus Torvalds;"oom: kill the insufficient and no longer needed PT_TRACE_EXIT check After the previous patch we can remove the PT_TRACE_EXIT check in oom_scan_process_thread(), it was added to handle the case when the coredumping was ""frozen"" by ptrace, but it doesn't really work. If nothing else, we would need to check all threads which could share the same ->mm to make it more or less correct. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: ""Rafael J. Wysocki"" <rjw@rjwysocki.net> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/6a2d5679b4a852a3bf80c570644456ab466ab714;" If nothing else, we would need to check all threads which could share the same ->mm to make it more or less correct.";yes;yes;yes 186;MDY6Q29tbWl0MjMyNTI5ODpkMDAzZjM3MWIyNzAxNjM1NGMzOTI0NjQ4MTk1MzBkNDdhOTE1NzY1;Oleg Nesterov;Linus Torvalds;"oom: don't assume that a coredumping thread will exit soon oom_kill.c assumes that PF_EXITING task should exit and free the memory soon. This is wrong in many ways and one important case is the coredump. A task can sleep in exit_mm() ""forever"" while the coredumping sub-thread can need more memory. Change the PF_EXITING checks to take SIGNAL_GROUP_COREDUMP into account, we add the new trivial helper for that. Note: this is only the first step, this patch doesn't try to solve other problems. The SIGNAL_GROUP_COREDUMP check is obviously racy, a task can participate in coredump after it was already observed in PF_EXITING state, so TIF_MEMDIE (which also blocks oom-killer) still can be wrongly set. fatal_signal_pending() can be true because of SIGNAL_GROUP_COREDUMP so out_of_memory() and mem_cgroup_out_of_memory() shouldn't blindly trust it. And even the name/usage of the new helper is confusing, an exiting thread can only free its ->mm if it is the only/last task in thread group. [akpm@linux-foundation.org: add comment] Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: ""Rafael J. Wysocki"" <rjw@rjwysocki.net> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/d003f371b27016354c392464819530d47a915765;oom: don't assume that a coredumping thread will exit soon;yes;no;no 186;MDY6Q29tbWl0MjMyNTI5ODpkMDAzZjM3MWIyNzAxNjM1NGMzOTI0NjQ4MTk1MzBkNDdhOTE1NzY1;Oleg Nesterov;Linus Torvalds;"oom: don't assume that a coredumping thread will exit soon oom_kill.c assumes that PF_EXITING task should exit and free the memory soon. This is wrong in many ways and one important case is the coredump. A task can sleep in exit_mm() ""forever"" while the coredumping sub-thread can need more memory. Change the PF_EXITING checks to take SIGNAL_GROUP_COREDUMP into account, we add the new trivial helper for that. Note: this is only the first step, this patch doesn't try to solve other problems. The SIGNAL_GROUP_COREDUMP check is obviously racy, a task can participate in coredump after it was already observed in PF_EXITING state, so TIF_MEMDIE (which also blocks oom-killer) still can be wrongly set. fatal_signal_pending() can be true because of SIGNAL_GROUP_COREDUMP so out_of_memory() and mem_cgroup_out_of_memory() shouldn't blindly trust it. And even the name/usage of the new helper is confusing, an exiting thread can only free its ->mm if it is the only/last task in thread group. [akpm@linux-foundation.org: add comment] Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: ""Rafael J. Wysocki"" <rjw@rjwysocki.net> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/d003f371b27016354c392464819530d47a915765;"oom_kill.c assumes that PF_EXITING task should exit and free the memory soon";no;no;yes 186;MDY6Q29tbWl0MjMyNTI5ODpkMDAzZjM3MWIyNzAxNjM1NGMzOTI0NjQ4MTk1MzBkNDdhOTE1NzY1;Oleg Nesterov;Linus Torvalds;"oom: don't assume that a coredumping thread will exit soon oom_kill.c assumes that PF_EXITING task should exit and free the memory soon. This is wrong in many ways and one important case is the coredump. A task can sleep in exit_mm() ""forever"" while the coredumping sub-thread can need more memory. Change the PF_EXITING checks to take SIGNAL_GROUP_COREDUMP into account, we add the new trivial helper for that. Note: this is only the first step, this patch doesn't try to solve other problems. The SIGNAL_GROUP_COREDUMP check is obviously racy, a task can participate in coredump after it was already observed in PF_EXITING state, so TIF_MEMDIE (which also blocks oom-killer) still can be wrongly set. fatal_signal_pending() can be true because of SIGNAL_GROUP_COREDUMP so out_of_memory() and mem_cgroup_out_of_memory() shouldn't blindly trust it. And even the name/usage of the new helper is confusing, an exiting thread can only free its ->mm if it is the only/last task in thread group. [akpm@linux-foundation.org: add comment] Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: ""Rafael J. Wysocki"" <rjw@rjwysocki.net> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/d003f371b27016354c392464819530d47a915765; This is wrong in many ways and one important case is the coredump;no;yes;yes 186;MDY6Q29tbWl0MjMyNTI5ODpkMDAzZjM3MWIyNzAxNjM1NGMzOTI0NjQ4MTk1MzBkNDdhOTE1NzY1;Oleg Nesterov;Linus Torvalds;"oom: don't assume that a coredumping thread will exit soon oom_kill.c assumes that PF_EXITING task should exit and free the memory soon. This is wrong in many ways and one important case is the coredump. A task can sleep in exit_mm() ""forever"" while the coredumping sub-thread can need more memory. Change the PF_EXITING checks to take SIGNAL_GROUP_COREDUMP into account, we add the new trivial helper for that. Note: this is only the first step, this patch doesn't try to solve other problems. The SIGNAL_GROUP_COREDUMP check is obviously racy, a task can participate in coredump after it was already observed in PF_EXITING state, so TIF_MEMDIE (which also blocks oom-killer) still can be wrongly set. fatal_signal_pending() can be true because of SIGNAL_GROUP_COREDUMP so out_of_memory() and mem_cgroup_out_of_memory() shouldn't blindly trust it. And even the name/usage of the new helper is confusing, an exiting thread can only free its ->mm if it is the only/last task in thread group. [akpm@linux-foundation.org: add comment] Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: ""Rafael J. Wysocki"" <rjw@rjwysocki.net> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/d003f371b27016354c392464819530d47a915765;"A task can sleep in exit_mm() ""forever"" while the coredumping sub-thread can need more memory";no;yes;yes 186;MDY6Q29tbWl0MjMyNTI5ODpkMDAzZjM3MWIyNzAxNjM1NGMzOTI0NjQ4MTk1MzBkNDdhOTE1NzY1;Oleg Nesterov;Linus Torvalds;"oom: don't assume that a coredumping thread will exit soon oom_kill.c assumes that PF_EXITING task should exit and free the memory soon. This is wrong in many ways and one important case is the coredump. A task can sleep in exit_mm() ""forever"" while the coredumping sub-thread can need more memory. Change the PF_EXITING checks to take SIGNAL_GROUP_COREDUMP into account, we add the new trivial helper for that. Note: this is only the first step, this patch doesn't try to solve other problems. The SIGNAL_GROUP_COREDUMP check is obviously racy, a task can participate in coredump after it was already observed in PF_EXITING state, so TIF_MEMDIE (which also blocks oom-killer) still can be wrongly set. fatal_signal_pending() can be true because of SIGNAL_GROUP_COREDUMP so out_of_memory() and mem_cgroup_out_of_memory() shouldn't blindly trust it. And even the name/usage of the new helper is confusing, an exiting thread can only free its ->mm if it is the only/last task in thread group. [akpm@linux-foundation.org: add comment] Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: ""Rafael J. Wysocki"" <rjw@rjwysocki.net> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/d003f371b27016354c392464819530d47a915765;"Change the PF_EXITING checks to take SIGNAL_GROUP_COREDUMP into account, we add the new trivial helper for that";yes;yes;no 186;MDY6Q29tbWl0MjMyNTI5ODpkMDAzZjM3MWIyNzAxNjM1NGMzOTI0NjQ4MTk1MzBkNDdhOTE1NzY1;Oleg Nesterov;Linus Torvalds;"oom: don't assume that a coredumping thread will exit soon oom_kill.c assumes that PF_EXITING task should exit and free the memory soon. This is wrong in many ways and one important case is the coredump. A task can sleep in exit_mm() ""forever"" while the coredumping sub-thread can need more memory. Change the PF_EXITING checks to take SIGNAL_GROUP_COREDUMP into account, we add the new trivial helper for that. Note: this is only the first step, this patch doesn't try to solve other problems. The SIGNAL_GROUP_COREDUMP check is obviously racy, a task can participate in coredump after it was already observed in PF_EXITING state, so TIF_MEMDIE (which also blocks oom-killer) still can be wrongly set. fatal_signal_pending() can be true because of SIGNAL_GROUP_COREDUMP so out_of_memory() and mem_cgroup_out_of_memory() shouldn't blindly trust it. And even the name/usage of the new helper is confusing, an exiting thread can only free its ->mm if it is the only/last task in thread group. [akpm@linux-foundation.org: add comment] Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: ""Rafael J. Wysocki"" <rjw@rjwysocki.net> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/d003f371b27016354c392464819530d47a915765;"Note: this is only the first step, this patch doesn't try to solve other problems";yes;no;no 186;MDY6Q29tbWl0MjMyNTI5ODpkMDAzZjM3MWIyNzAxNjM1NGMzOTI0NjQ4MTk1MzBkNDdhOTE1NzY1;Oleg Nesterov;Linus Torvalds;"oom: don't assume that a coredumping thread will exit soon oom_kill.c assumes that PF_EXITING task should exit and free the memory soon. This is wrong in many ways and one important case is the coredump. A task can sleep in exit_mm() ""forever"" while the coredumping sub-thread can need more memory. Change the PF_EXITING checks to take SIGNAL_GROUP_COREDUMP into account, we add the new trivial helper for that. Note: this is only the first step, this patch doesn't try to solve other problems. The SIGNAL_GROUP_COREDUMP check is obviously racy, a task can participate in coredump after it was already observed in PF_EXITING state, so TIF_MEMDIE (which also blocks oom-killer) still can be wrongly set. fatal_signal_pending() can be true because of SIGNAL_GROUP_COREDUMP so out_of_memory() and mem_cgroup_out_of_memory() shouldn't blindly trust it. And even the name/usage of the new helper is confusing, an exiting thread can only free its ->mm if it is the only/last task in thread group. [akpm@linux-foundation.org: add comment] Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: ""Rafael J. Wysocki"" <rjw@rjwysocki.net> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/d003f371b27016354c392464819530d47a915765;" The SIGNAL_GROUP_COREDUMP check is obviously racy, a task can participate in coredump after it was already observed in PF_EXITING state, so TIF_MEMDIE (which also blocks oom-killer) still can be wrongly set";yes;no;yes 186;MDY6Q29tbWl0MjMyNTI5ODpkMDAzZjM3MWIyNzAxNjM1NGMzOTI0NjQ4MTk1MzBkNDdhOTE1NzY1;Oleg Nesterov;Linus Torvalds;"oom: don't assume that a coredumping thread will exit soon oom_kill.c assumes that PF_EXITING task should exit and free the memory soon. This is wrong in many ways and one important case is the coredump. A task can sleep in exit_mm() ""forever"" while the coredumping sub-thread can need more memory. Change the PF_EXITING checks to take SIGNAL_GROUP_COREDUMP into account, we add the new trivial helper for that. Note: this is only the first step, this patch doesn't try to solve other problems. The SIGNAL_GROUP_COREDUMP check is obviously racy, a task can participate in coredump after it was already observed in PF_EXITING state, so TIF_MEMDIE (which also blocks oom-killer) still can be wrongly set. fatal_signal_pending() can be true because of SIGNAL_GROUP_COREDUMP so out_of_memory() and mem_cgroup_out_of_memory() shouldn't blindly trust it. And even the name/usage of the new helper is confusing, an exiting thread can only free its ->mm if it is the only/last task in thread group. [akpm@linux-foundation.org: add comment] Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: ""Rafael J. Wysocki"" <rjw@rjwysocki.net> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/d003f371b27016354c392464819530d47a915765;"fatal_signal_pending() can be true because of SIGNAL_GROUP_COREDUMP so out_of_memory() and mem_cgroup_out_of_memory() shouldn't blindly trust it";yes;yes;yes 186;MDY6Q29tbWl0MjMyNTI5ODpkMDAzZjM3MWIyNzAxNjM1NGMzOTI0NjQ4MTk1MzBkNDdhOTE1NzY1;Oleg Nesterov;Linus Torvalds;"oom: don't assume that a coredumping thread will exit soon oom_kill.c assumes that PF_EXITING task should exit and free the memory soon. This is wrong in many ways and one important case is the coredump. A task can sleep in exit_mm() ""forever"" while the coredumping sub-thread can need more memory. Change the PF_EXITING checks to take SIGNAL_GROUP_COREDUMP into account, we add the new trivial helper for that. Note: this is only the first step, this patch doesn't try to solve other problems. The SIGNAL_GROUP_COREDUMP check is obviously racy, a task can participate in coredump after it was already observed in PF_EXITING state, so TIF_MEMDIE (which also blocks oom-killer) still can be wrongly set. fatal_signal_pending() can be true because of SIGNAL_GROUP_COREDUMP so out_of_memory() and mem_cgroup_out_of_memory() shouldn't blindly trust it. And even the name/usage of the new helper is confusing, an exiting thread can only free its ->mm if it is the only/last task in thread group. [akpm@linux-foundation.org: add comment] Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: ""Rafael J. Wysocki"" <rjw@rjwysocki.net> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/d003f371b27016354c392464819530d47a915765;" And even the name/usage of the new helper is confusing, an exiting thread can only free its ->mm if it is the only/last task in thread group.";yes;no;yes 187;MDY6Q29tbWl0MjMyNTI5ODoyMzE0YjQyZGI2N2JlMzBiNzQ3MTIyZDY1YzZjZDJjODVkYTM0NTM4;Johannes Weiner;Linus Torvalds;"mm: memcontrol: drop bogus RCU locking from mem_cgroup_same_or_subtree() None of the mem_cgroup_same_or_subtree() callers actually require it to take the RCU lock, either because they hold it themselves or they have css references. Remove it. To make the API change clear, rename the leftover helper to mem_cgroup_is_descendant() to match cgroup_is_descendant(). Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/2314b42db67be30b747122d65c6cd2c85da34538;mm: memcontrol: drop bogus RCU locking from mem_cgroup_same_or_subtree();yes;yes;no 187;MDY6Q29tbWl0MjMyNTI5ODoyMzE0YjQyZGI2N2JlMzBiNzQ3MTIyZDY1YzZjZDJjODVkYTM0NTM4;Johannes Weiner;Linus Torvalds;"mm: memcontrol: drop bogus RCU locking from mem_cgroup_same_or_subtree() None of the mem_cgroup_same_or_subtree() callers actually require it to take the RCU lock, either because they hold it themselves or they have css references. Remove it. To make the API change clear, rename the leftover helper to mem_cgroup_is_descendant() to match cgroup_is_descendant(). Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/2314b42db67be30b747122d65c6cd2c85da34538;"None of the mem_cgroup_same_or_subtree() callers actually require it to take the RCU lock, either because they hold it themselves or they have css references";no;yes;yes 187;MDY6Q29tbWl0MjMyNTI5ODoyMzE0YjQyZGI2N2JlMzBiNzQ3MTIyZDY1YzZjZDJjODVkYTM0NTM4;Johannes Weiner;Linus Torvalds;"mm: memcontrol: drop bogus RCU locking from mem_cgroup_same_or_subtree() None of the mem_cgroup_same_or_subtree() callers actually require it to take the RCU lock, either because they hold it themselves or they have css references. Remove it. To make the API change clear, rename the leftover helper to mem_cgroup_is_descendant() to match cgroup_is_descendant(). Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/2314b42db67be30b747122d65c6cd2c85da34538; Remove it;yes;no;no 187;MDY6Q29tbWl0MjMyNTI5ODoyMzE0YjQyZGI2N2JlMzBiNzQ3MTIyZDY1YzZjZDJjODVkYTM0NTM4;Johannes Weiner;Linus Torvalds;"mm: memcontrol: drop bogus RCU locking from mem_cgroup_same_or_subtree() None of the mem_cgroup_same_or_subtree() callers actually require it to take the RCU lock, either because they hold it themselves or they have css references. Remove it. To make the API change clear, rename the leftover helper to mem_cgroup_is_descendant() to match cgroup_is_descendant(). Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/2314b42db67be30b747122d65c6cd2c85da34538;"To make the API change clear, rename the leftover helper to mem_cgroup_is_descendant() to match cgroup_is_descendant().";yes;yes;no 188;MDY6Q29tbWl0MjMyNTI5ODozNDQ3MzZmMjliMzU5NzkwZmFjZDBiN2E1MjFlMzY3ZjE3MTVjMTFj;Vladimir Davydov;Tejun Heo;"cpuset: simplify cpuset_node_allowed API Current cpuset API for checking if a zone/node is allowed to allocate from looks rather awkward. We have hardwall and softwall versions of cpuset_node_allowed with the softwall version doing literally the same as the hardwall version if __GFP_HARDWALL is passed to it in gfp flags. If it isn't, the softwall version may check the given node against the enclosing hardwall cpuset, which it needs to take the callback lock to do. Such a distinction was introduced by commit 02a0e53d8227 (""cpuset: rework cpuset_zone_allowed api""). Before, we had the only version with the __GFP_HARDWALL flag determining its behavior. The purpose of the commit was to avoid sleep-in-atomic bugs when someone would mistakenly call the function without the __GFP_HARDWALL flag for an atomic allocation. The suffixes introduced were intended to make the callers think before using the function. However, since the callback lock was converted from mutex to spinlock by the previous patch, the softwall check function cannot sleep, and these precautions are no longer necessary. So let's simplify the API back to the single check. Suggested-by: David Rientjes <rientjes@google.com> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>";https://api.github.com/repos/torvalds/linux/git/commits/344736f29b359790facd0b7a521e367f1715c11c;cpuset: simplify cpuset_node_allowed API;yes;yes;no 188;MDY6Q29tbWl0MjMyNTI5ODozNDQ3MzZmMjliMzU5NzkwZmFjZDBiN2E1MjFlMzY3ZjE3MTVjMTFj;Vladimir Davydov;Tejun Heo;"cpuset: simplify cpuset_node_allowed API Current cpuset API for checking if a zone/node is allowed to allocate from looks rather awkward. We have hardwall and softwall versions of cpuset_node_allowed with the softwall version doing literally the same as the hardwall version if __GFP_HARDWALL is passed to it in gfp flags. If it isn't, the softwall version may check the given node against the enclosing hardwall cpuset, which it needs to take the callback lock to do. Such a distinction was introduced by commit 02a0e53d8227 (""cpuset: rework cpuset_zone_allowed api""). Before, we had the only version with the __GFP_HARDWALL flag determining its behavior. The purpose of the commit was to avoid sleep-in-atomic bugs when someone would mistakenly call the function without the __GFP_HARDWALL flag for an atomic allocation. The suffixes introduced were intended to make the callers think before using the function. However, since the callback lock was converted from mutex to spinlock by the previous patch, the softwall check function cannot sleep, and these precautions are no longer necessary. So let's simplify the API back to the single check. Suggested-by: David Rientjes <rientjes@google.com> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>";https://api.github.com/repos/torvalds/linux/git/commits/344736f29b359790facd0b7a521e367f1715c11c;"Current cpuset API for checking if a zone/node is allowed to allocate from looks rather awkward";no;yes;yes 188;MDY6Q29tbWl0MjMyNTI5ODozNDQ3MzZmMjliMzU5NzkwZmFjZDBiN2E1MjFlMzY3ZjE3MTVjMTFj;Vladimir Davydov;Tejun Heo;"cpuset: simplify cpuset_node_allowed API Current cpuset API for checking if a zone/node is allowed to allocate from looks rather awkward. We have hardwall and softwall versions of cpuset_node_allowed with the softwall version doing literally the same as the hardwall version if __GFP_HARDWALL is passed to it in gfp flags. If it isn't, the softwall version may check the given node against the enclosing hardwall cpuset, which it needs to take the callback lock to do. Such a distinction was introduced by commit 02a0e53d8227 (""cpuset: rework cpuset_zone_allowed api""). Before, we had the only version with the __GFP_HARDWALL flag determining its behavior. The purpose of the commit was to avoid sleep-in-atomic bugs when someone would mistakenly call the function without the __GFP_HARDWALL flag for an atomic allocation. The suffixes introduced were intended to make the callers think before using the function. However, since the callback lock was converted from mutex to spinlock by the previous patch, the softwall check function cannot sleep, and these precautions are no longer necessary. So let's simplify the API back to the single check. Suggested-by: David Rientjes <rientjes@google.com> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>";https://api.github.com/repos/torvalds/linux/git/commits/344736f29b359790facd0b7a521e367f1715c11c;"We have hardwall and softwall versions of cpuset_node_allowed with the softwall version doing literally the same as the hardwall version if __GFP_HARDWALL is passed to it in gfp flags";no;yes;yes 188;MDY6Q29tbWl0MjMyNTI5ODozNDQ3MzZmMjliMzU5NzkwZmFjZDBiN2E1MjFlMzY3ZjE3MTVjMTFj;Vladimir Davydov;Tejun Heo;"cpuset: simplify cpuset_node_allowed API Current cpuset API for checking if a zone/node is allowed to allocate from looks rather awkward. We have hardwall and softwall versions of cpuset_node_allowed with the softwall version doing literally the same as the hardwall version if __GFP_HARDWALL is passed to it in gfp flags. If it isn't, the softwall version may check the given node against the enclosing hardwall cpuset, which it needs to take the callback lock to do. Such a distinction was introduced by commit 02a0e53d8227 (""cpuset: rework cpuset_zone_allowed api""). Before, we had the only version with the __GFP_HARDWALL flag determining its behavior. The purpose of the commit was to avoid sleep-in-atomic bugs when someone would mistakenly call the function without the __GFP_HARDWALL flag for an atomic allocation. The suffixes introduced were intended to make the callers think before using the function. However, since the callback lock was converted from mutex to spinlock by the previous patch, the softwall check function cannot sleep, and these precautions are no longer necessary. So let's simplify the API back to the single check. Suggested-by: David Rientjes <rientjes@google.com> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>";https://api.github.com/repos/torvalds/linux/git/commits/344736f29b359790facd0b7a521e367f1715c11c;"If it isn't, the softwall version may check the given node against the enclosing hardwall cpuset, which it needs to take the callback lock to Such a distinction was introduced by commit 02a0e53d8227 (""cpuset";no;no;yes 188;MDY6Q29tbWl0MjMyNTI5ODozNDQ3MzZmMjliMzU5NzkwZmFjZDBiN2E1MjFlMzY3ZjE3MTVjMTFj;Vladimir Davydov;Tejun Heo;"cpuset: simplify cpuset_node_allowed API Current cpuset API for checking if a zone/node is allowed to allocate from looks rather awkward. We have hardwall and softwall versions of cpuset_node_allowed with the softwall version doing literally the same as the hardwall version if __GFP_HARDWALL is passed to it in gfp flags. If it isn't, the softwall version may check the given node against the enclosing hardwall cpuset, which it needs to take the callback lock to do. Such a distinction was introduced by commit 02a0e53d8227 (""cpuset: rework cpuset_zone_allowed api""). Before, we had the only version with the __GFP_HARDWALL flag determining its behavior. The purpose of the commit was to avoid sleep-in-atomic bugs when someone would mistakenly call the function without the __GFP_HARDWALL flag for an atomic allocation. The suffixes introduced were intended to make the callers think before using the function. However, since the callback lock was converted from mutex to spinlock by the previous patch, the softwall check function cannot sleep, and these precautions are no longer necessary. So let's simplify the API back to the single check. Suggested-by: David Rientjes <rientjes@google.com> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>";https://api.github.com/repos/torvalds/linux/git/commits/344736f29b359790facd0b7a521e367f1715c11c;"rework cpuset_zone_allowed api"")";no;no;yes 188;MDY6Q29tbWl0MjMyNTI5ODozNDQ3MzZmMjliMzU5NzkwZmFjZDBiN2E1MjFlMzY3ZjE3MTVjMTFj;Vladimir Davydov;Tejun Heo;"cpuset: simplify cpuset_node_allowed API Current cpuset API for checking if a zone/node is allowed to allocate from looks rather awkward. We have hardwall and softwall versions of cpuset_node_allowed with the softwall version doing literally the same as the hardwall version if __GFP_HARDWALL is passed to it in gfp flags. If it isn't, the softwall version may check the given node against the enclosing hardwall cpuset, which it needs to take the callback lock to do. Such a distinction was introduced by commit 02a0e53d8227 (""cpuset: rework cpuset_zone_allowed api""). Before, we had the only version with the __GFP_HARDWALL flag determining its behavior. The purpose of the commit was to avoid sleep-in-atomic bugs when someone would mistakenly call the function without the __GFP_HARDWALL flag for an atomic allocation. The suffixes introduced were intended to make the callers think before using the function. However, since the callback lock was converted from mutex to spinlock by the previous patch, the softwall check function cannot sleep, and these precautions are no longer necessary. So let's simplify the API back to the single check. Suggested-by: David Rientjes <rientjes@google.com> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>";https://api.github.com/repos/torvalds/linux/git/commits/344736f29b359790facd0b7a521e367f1715c11c;"Before, we had the only version with the __GFP_HARDWALL flag determining its behavior";no;no;yes 188;MDY6Q29tbWl0MjMyNTI5ODozNDQ3MzZmMjliMzU5NzkwZmFjZDBiN2E1MjFlMzY3ZjE3MTVjMTFj;Vladimir Davydov;Tejun Heo;"cpuset: simplify cpuset_node_allowed API Current cpuset API for checking if a zone/node is allowed to allocate from looks rather awkward. We have hardwall and softwall versions of cpuset_node_allowed with the softwall version doing literally the same as the hardwall version if __GFP_HARDWALL is passed to it in gfp flags. If it isn't, the softwall version may check the given node against the enclosing hardwall cpuset, which it needs to take the callback lock to do. Such a distinction was introduced by commit 02a0e53d8227 (""cpuset: rework cpuset_zone_allowed api""). Before, we had the only version with the __GFP_HARDWALL flag determining its behavior. The purpose of the commit was to avoid sleep-in-atomic bugs when someone would mistakenly call the function without the __GFP_HARDWALL flag for an atomic allocation. The suffixes introduced were intended to make the callers think before using the function. However, since the callback lock was converted from mutex to spinlock by the previous patch, the softwall check function cannot sleep, and these precautions are no longer necessary. So let's simplify the API back to the single check. Suggested-by: David Rientjes <rientjes@google.com> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>";https://api.github.com/repos/torvalds/linux/git/commits/344736f29b359790facd0b7a521e367f1715c11c;"The purpose of the commit was to avoid sleep-in-atomic bugs when someone would mistakenly call the function without the __GFP_HARDWALL flag for an atomic allocation";no;no;yes 188;MDY6Q29tbWl0MjMyNTI5ODozNDQ3MzZmMjliMzU5NzkwZmFjZDBiN2E1MjFlMzY3ZjE3MTVjMTFj;Vladimir Davydov;Tejun Heo;"cpuset: simplify cpuset_node_allowed API Current cpuset API for checking if a zone/node is allowed to allocate from looks rather awkward. We have hardwall and softwall versions of cpuset_node_allowed with the softwall version doing literally the same as the hardwall version if __GFP_HARDWALL is passed to it in gfp flags. If it isn't, the softwall version may check the given node against the enclosing hardwall cpuset, which it needs to take the callback lock to do. Such a distinction was introduced by commit 02a0e53d8227 (""cpuset: rework cpuset_zone_allowed api""). Before, we had the only version with the __GFP_HARDWALL flag determining its behavior. The purpose of the commit was to avoid sleep-in-atomic bugs when someone would mistakenly call the function without the __GFP_HARDWALL flag for an atomic allocation. The suffixes introduced were intended to make the callers think before using the function. However, since the callback lock was converted from mutex to spinlock by the previous patch, the softwall check function cannot sleep, and these precautions are no longer necessary. So let's simplify the API back to the single check. Suggested-by: David Rientjes <rientjes@google.com> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>";https://api.github.com/repos/torvalds/linux/git/commits/344736f29b359790facd0b7a521e367f1715c11c;"The suffixes introduced were intended to make the callers think before using the function";no;no;yes 188;MDY6Q29tbWl0MjMyNTI5ODozNDQ3MzZmMjliMzU5NzkwZmFjZDBiN2E1MjFlMzY3ZjE3MTVjMTFj;Vladimir Davydov;Tejun Heo;"cpuset: simplify cpuset_node_allowed API Current cpuset API for checking if a zone/node is allowed to allocate from looks rather awkward. We have hardwall and softwall versions of cpuset_node_allowed with the softwall version doing literally the same as the hardwall version if __GFP_HARDWALL is passed to it in gfp flags. If it isn't, the softwall version may check the given node against the enclosing hardwall cpuset, which it needs to take the callback lock to do. Such a distinction was introduced by commit 02a0e53d8227 (""cpuset: rework cpuset_zone_allowed api""). Before, we had the only version with the __GFP_HARDWALL flag determining its behavior. The purpose of the commit was to avoid sleep-in-atomic bugs when someone would mistakenly call the function without the __GFP_HARDWALL flag for an atomic allocation. The suffixes introduced were intended to make the callers think before using the function. However, since the callback lock was converted from mutex to spinlock by the previous patch, the softwall check function cannot sleep, and these precautions are no longer necessary. So let's simplify the API back to the single check. Suggested-by: David Rientjes <rientjes@google.com> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>";https://api.github.com/repos/torvalds/linux/git/commits/344736f29b359790facd0b7a521e367f1715c11c;"However, since the callback lock was converted from mutex to spinlock by the previous patch, the softwall check function cannot sleep, and these precautions are no longer necessary";no;yes;yes 188;MDY6Q29tbWl0MjMyNTI5ODozNDQ3MzZmMjliMzU5NzkwZmFjZDBiN2E1MjFlMzY3ZjE3MTVjMTFj;Vladimir Davydov;Tejun Heo;"cpuset: simplify cpuset_node_allowed API Current cpuset API for checking if a zone/node is allowed to allocate from looks rather awkward. We have hardwall and softwall versions of cpuset_node_allowed with the softwall version doing literally the same as the hardwall version if __GFP_HARDWALL is passed to it in gfp flags. If it isn't, the softwall version may check the given node against the enclosing hardwall cpuset, which it needs to take the callback lock to do. Such a distinction was introduced by commit 02a0e53d8227 (""cpuset: rework cpuset_zone_allowed api""). Before, we had the only version with the __GFP_HARDWALL flag determining its behavior. The purpose of the commit was to avoid sleep-in-atomic bugs when someone would mistakenly call the function without the __GFP_HARDWALL flag for an atomic allocation. The suffixes introduced were intended to make the callers think before using the function. However, since the callback lock was converted from mutex to spinlock by the previous patch, the softwall check function cannot sleep, and these precautions are no longer necessary. So let's simplify the API back to the single check. Suggested-by: David Rientjes <rientjes@google.com> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>";https://api.github.com/repos/torvalds/linux/git/commits/344736f29b359790facd0b7a521e367f1715c11c;So let's simplify the API back to the single check.;yes;yes;no 189;MDY6Q29tbWl0MjMyNTI5ODo1Njk1YmUxNDJlMjAzMTY3ZTNjYjUxNWVmODZhODg0MjRmMzUyNGVi;Michal Hocko;Rafael J. Wysocki;"OOM, PM: OOM killed task shouldn't escape PM suspend PM freezer relies on having all tasks frozen by the time devices are getting frozen so that no task will touch them while they are getting frozen. But OOM killer is allowed to kill an already frozen task in order to handle OOM situtation. In order to protect from late wake ups OOM killer is disabled after all tasks are frozen. This, however, still keeps a window open when a killed task didn't manage to die by the time freeze_processes finishes. Reduce the race window by checking all tasks after OOM killer has been disabled. This is still not race free completely unfortunately because oom_killer_disable cannot stop an already ongoing OOM killer so a task might still wake up from the fridge and get killed without freeze_processes noticing. Full synchronization of OOM and freezer is, however, too heavy weight for this highly unlikely case. Introduce and check oom_kills counter which gets incremented early when the allocator enters __alloc_pages_may_oom path and only check all the tasks if the counter changes during the freezing attempt. The counter is updated so early to reduce the race window since allocator checked oom_killer_disabled which is set by PM-freezing code. A false positive will push the PM-freezer into a slow path but that is not a big deal. Changes since v1 - push the re-check loop out of freeze_processes into check_frozen_processes and invert the condition to make the code more readable as per Rafael Fixes: f660daac474c6f (oom: thaw threads if oom killed thread is frozen before deferring) Cc: 3.2+ <stable@vger.kernel.org> # 3.2+ Signed-off-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>";https://api.github.com/repos/torvalds/linux/git/commits/5695be142e203167e3cb515ef86a88424f3524eb;OOM, PM: OOM killed task shouldn't escape PM suspend;yes;yes;no 189;MDY6Q29tbWl0MjMyNTI5ODo1Njk1YmUxNDJlMjAzMTY3ZTNjYjUxNWVmODZhODg0MjRmMzUyNGVi;Michal Hocko;Rafael J. Wysocki;"OOM, PM: OOM killed task shouldn't escape PM suspend PM freezer relies on having all tasks frozen by the time devices are getting frozen so that no task will touch them while they are getting frozen. But OOM killer is allowed to kill an already frozen task in order to handle OOM situtation. In order to protect from late wake ups OOM killer is disabled after all tasks are frozen. This, however, still keeps a window open when a killed task didn't manage to die by the time freeze_processes finishes. Reduce the race window by checking all tasks after OOM killer has been disabled. This is still not race free completely unfortunately because oom_killer_disable cannot stop an already ongoing OOM killer so a task might still wake up from the fridge and get killed without freeze_processes noticing. Full synchronization of OOM and freezer is, however, too heavy weight for this highly unlikely case. Introduce and check oom_kills counter which gets incremented early when the allocator enters __alloc_pages_may_oom path and only check all the tasks if the counter changes during the freezing attempt. The counter is updated so early to reduce the race window since allocator checked oom_killer_disabled which is set by PM-freezing code. A false positive will push the PM-freezer into a slow path but that is not a big deal. Changes since v1 - push the re-check loop out of freeze_processes into check_frozen_processes and invert the condition to make the code more readable as per Rafael Fixes: f660daac474c6f (oom: thaw threads if oom killed thread is frozen before deferring) Cc: 3.2+ <stable@vger.kernel.org> # 3.2+ Signed-off-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>";https://api.github.com/repos/torvalds/linux/git/commits/5695be142e203167e3cb515ef86a88424f3524eb;"PM freezer relies on having all tasks frozen by the time devices are getting frozen so that no task will touch them while they are getting frozen";no;no;yes 189;MDY6Q29tbWl0MjMyNTI5ODo1Njk1YmUxNDJlMjAzMTY3ZTNjYjUxNWVmODZhODg0MjRmMzUyNGVi;Michal Hocko;Rafael J. Wysocki;"OOM, PM: OOM killed task shouldn't escape PM suspend PM freezer relies on having all tasks frozen by the time devices are getting frozen so that no task will touch them while they are getting frozen. But OOM killer is allowed to kill an already frozen task in order to handle OOM situtation. In order to protect from late wake ups OOM killer is disabled after all tasks are frozen. This, however, still keeps a window open when a killed task didn't manage to die by the time freeze_processes finishes. Reduce the race window by checking all tasks after OOM killer has been disabled. This is still not race free completely unfortunately because oom_killer_disable cannot stop an already ongoing OOM killer so a task might still wake up from the fridge and get killed without freeze_processes noticing. Full synchronization of OOM and freezer is, however, too heavy weight for this highly unlikely case. Introduce and check oom_kills counter which gets incremented early when the allocator enters __alloc_pages_may_oom path and only check all the tasks if the counter changes during the freezing attempt. The counter is updated so early to reduce the race window since allocator checked oom_killer_disabled which is set by PM-freezing code. A false positive will push the PM-freezer into a slow path but that is not a big deal. Changes since v1 - push the re-check loop out of freeze_processes into check_frozen_processes and invert the condition to make the code more readable as per Rafael Fixes: f660daac474c6f (oom: thaw threads if oom killed thread is frozen before deferring) Cc: 3.2+ <stable@vger.kernel.org> # 3.2+ Signed-off-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>";https://api.github.com/repos/torvalds/linux/git/commits/5695be142e203167e3cb515ef86a88424f3524eb;"But OOM killer is allowed to kill an already frozen task in order to handle OOM situtation";no;no;yes 189;MDY6Q29tbWl0MjMyNTI5ODo1Njk1YmUxNDJlMjAzMTY3ZTNjYjUxNWVmODZhODg0MjRmMzUyNGVi;Michal Hocko;Rafael J. Wysocki;"OOM, PM: OOM killed task shouldn't escape PM suspend PM freezer relies on having all tasks frozen by the time devices are getting frozen so that no task will touch them while they are getting frozen. But OOM killer is allowed to kill an already frozen task in order to handle OOM situtation. In order to protect from late wake ups OOM killer is disabled after all tasks are frozen. This, however, still keeps a window open when a killed task didn't manage to die by the time freeze_processes finishes. Reduce the race window by checking all tasks after OOM killer has been disabled. This is still not race free completely unfortunately because oom_killer_disable cannot stop an already ongoing OOM killer so a task might still wake up from the fridge and get killed without freeze_processes noticing. Full synchronization of OOM and freezer is, however, too heavy weight for this highly unlikely case. Introduce and check oom_kills counter which gets incremented early when the allocator enters __alloc_pages_may_oom path and only check all the tasks if the counter changes during the freezing attempt. The counter is updated so early to reduce the race window since allocator checked oom_killer_disabled which is set by PM-freezing code. A false positive will push the PM-freezer into a slow path but that is not a big deal. Changes since v1 - push the re-check loop out of freeze_processes into check_frozen_processes and invert the condition to make the code more readable as per Rafael Fixes: f660daac474c6f (oom: thaw threads if oom killed thread is frozen before deferring) Cc: 3.2+ <stable@vger.kernel.org> # 3.2+ Signed-off-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>";https://api.github.com/repos/torvalds/linux/git/commits/5695be142e203167e3cb515ef86a88424f3524eb;"In order to protect from late wake ups OOM killer is disabled after all tasks are frozen";no;no;yes 189;MDY6Q29tbWl0MjMyNTI5ODo1Njk1YmUxNDJlMjAzMTY3ZTNjYjUxNWVmODZhODg0MjRmMzUyNGVi;Michal Hocko;Rafael J. Wysocki;"OOM, PM: OOM killed task shouldn't escape PM suspend PM freezer relies on having all tasks frozen by the time devices are getting frozen so that no task will touch them while they are getting frozen. But OOM killer is allowed to kill an already frozen task in order to handle OOM situtation. In order to protect from late wake ups OOM killer is disabled after all tasks are frozen. This, however, still keeps a window open when a killed task didn't manage to die by the time freeze_processes finishes. Reduce the race window by checking all tasks after OOM killer has been disabled. This is still not race free completely unfortunately because oom_killer_disable cannot stop an already ongoing OOM killer so a task might still wake up from the fridge and get killed without freeze_processes noticing. Full synchronization of OOM and freezer is, however, too heavy weight for this highly unlikely case. Introduce and check oom_kills counter which gets incremented early when the allocator enters __alloc_pages_may_oom path and only check all the tasks if the counter changes during the freezing attempt. The counter is updated so early to reduce the race window since allocator checked oom_killer_disabled which is set by PM-freezing code. A false positive will push the PM-freezer into a slow path but that is not a big deal. Changes since v1 - push the re-check loop out of freeze_processes into check_frozen_processes and invert the condition to make the code more readable as per Rafael Fixes: f660daac474c6f (oom: thaw threads if oom killed thread is frozen before deferring) Cc: 3.2+ <stable@vger.kernel.org> # 3.2+ Signed-off-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>";https://api.github.com/repos/torvalds/linux/git/commits/5695be142e203167e3cb515ef86a88424f3524eb;"This, however, still keeps a window open when a killed task didn't manage to die by the time freeze_processes finishes";no;yes;yes 189;MDY6Q29tbWl0MjMyNTI5ODo1Njk1YmUxNDJlMjAzMTY3ZTNjYjUxNWVmODZhODg0MjRmMzUyNGVi;Michal Hocko;Rafael J. Wysocki;"OOM, PM: OOM killed task shouldn't escape PM suspend PM freezer relies on having all tasks frozen by the time devices are getting frozen so that no task will touch them while they are getting frozen. But OOM killer is allowed to kill an already frozen task in order to handle OOM situtation. In order to protect from late wake ups OOM killer is disabled after all tasks are frozen. This, however, still keeps a window open when a killed task didn't manage to die by the time freeze_processes finishes. Reduce the race window by checking all tasks after OOM killer has been disabled. This is still not race free completely unfortunately because oom_killer_disable cannot stop an already ongoing OOM killer so a task might still wake up from the fridge and get killed without freeze_processes noticing. Full synchronization of OOM and freezer is, however, too heavy weight for this highly unlikely case. Introduce and check oom_kills counter which gets incremented early when the allocator enters __alloc_pages_may_oom path and only check all the tasks if the counter changes during the freezing attempt. The counter is updated so early to reduce the race window since allocator checked oom_killer_disabled which is set by PM-freezing code. A false positive will push the PM-freezer into a slow path but that is not a big deal. Changes since v1 - push the re-check loop out of freeze_processes into check_frozen_processes and invert the condition to make the code more readable as per Rafael Fixes: f660daac474c6f (oom: thaw threads if oom killed thread is frozen before deferring) Cc: 3.2+ <stable@vger.kernel.org> # 3.2+ Signed-off-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>";https://api.github.com/repos/torvalds/linux/git/commits/5695be142e203167e3cb515ef86a88424f3524eb;"Reduce the race window by checking all tasks after OOM killer has been disabled";yes;yes;no 189;MDY6Q29tbWl0MjMyNTI5ODo1Njk1YmUxNDJlMjAzMTY3ZTNjYjUxNWVmODZhODg0MjRmMzUyNGVi;Michal Hocko;Rafael J. Wysocki;"OOM, PM: OOM killed task shouldn't escape PM suspend PM freezer relies on having all tasks frozen by the time devices are getting frozen so that no task will touch them while they are getting frozen. But OOM killer is allowed to kill an already frozen task in order to handle OOM situtation. In order to protect from late wake ups OOM killer is disabled after all tasks are frozen. This, however, still keeps a window open when a killed task didn't manage to die by the time freeze_processes finishes. Reduce the race window by checking all tasks after OOM killer has been disabled. This is still not race free completely unfortunately because oom_killer_disable cannot stop an already ongoing OOM killer so a task might still wake up from the fridge and get killed without freeze_processes noticing. Full synchronization of OOM and freezer is, however, too heavy weight for this highly unlikely case. Introduce and check oom_kills counter which gets incremented early when the allocator enters __alloc_pages_may_oom path and only check all the tasks if the counter changes during the freezing attempt. The counter is updated so early to reduce the race window since allocator checked oom_killer_disabled which is set by PM-freezing code. A false positive will push the PM-freezer into a slow path but that is not a big deal. Changes since v1 - push the re-check loop out of freeze_processes into check_frozen_processes and invert the condition to make the code more readable as per Rafael Fixes: f660daac474c6f (oom: thaw threads if oom killed thread is frozen before deferring) Cc: 3.2+ <stable@vger.kernel.org> # 3.2+ Signed-off-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>";https://api.github.com/repos/torvalds/linux/git/commits/5695be142e203167e3cb515ef86a88424f3524eb;"This is still not race free completely unfortunately because oom_killer_disable cannot stop an already ongoing OOM killer so a task might still wake up from the fridge and get killed without freeze_processes noticing";yes;yes;yes 189;MDY6Q29tbWl0MjMyNTI5ODo1Njk1YmUxNDJlMjAzMTY3ZTNjYjUxNWVmODZhODg0MjRmMzUyNGVi;Michal Hocko;Rafael J. Wysocki;"OOM, PM: OOM killed task shouldn't escape PM suspend PM freezer relies on having all tasks frozen by the time devices are getting frozen so that no task will touch them while they are getting frozen. But OOM killer is allowed to kill an already frozen task in order to handle OOM situtation. In order to protect from late wake ups OOM killer is disabled after all tasks are frozen. This, however, still keeps a window open when a killed task didn't manage to die by the time freeze_processes finishes. Reduce the race window by checking all tasks after OOM killer has been disabled. This is still not race free completely unfortunately because oom_killer_disable cannot stop an already ongoing OOM killer so a task might still wake up from the fridge and get killed without freeze_processes noticing. Full synchronization of OOM and freezer is, however, too heavy weight for this highly unlikely case. Introduce and check oom_kills counter which gets incremented early when the allocator enters __alloc_pages_may_oom path and only check all the tasks if the counter changes during the freezing attempt. The counter is updated so early to reduce the race window since allocator checked oom_killer_disabled which is set by PM-freezing code. A false positive will push the PM-freezer into a slow path but that is not a big deal. Changes since v1 - push the re-check loop out of freeze_processes into check_frozen_processes and invert the condition to make the code more readable as per Rafael Fixes: f660daac474c6f (oom: thaw threads if oom killed thread is frozen before deferring) Cc: 3.2+ <stable@vger.kernel.org> # 3.2+ Signed-off-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>";https://api.github.com/repos/torvalds/linux/git/commits/5695be142e203167e3cb515ef86a88424f3524eb;"Full synchronization of OOM and freezer is, however, too heavy weight for this highly unlikely case";yes;yes;no 189;MDY6Q29tbWl0MjMyNTI5ODo1Njk1YmUxNDJlMjAzMTY3ZTNjYjUxNWVmODZhODg0MjRmMzUyNGVi;Michal Hocko;Rafael J. Wysocki;"OOM, PM: OOM killed task shouldn't escape PM suspend PM freezer relies on having all tasks frozen by the time devices are getting frozen so that no task will touch them while they are getting frozen. But OOM killer is allowed to kill an already frozen task in order to handle OOM situtation. In order to protect from late wake ups OOM killer is disabled after all tasks are frozen. This, however, still keeps a window open when a killed task didn't manage to die by the time freeze_processes finishes. Reduce the race window by checking all tasks after OOM killer has been disabled. This is still not race free completely unfortunately because oom_killer_disable cannot stop an already ongoing OOM killer so a task might still wake up from the fridge and get killed without freeze_processes noticing. Full synchronization of OOM and freezer is, however, too heavy weight for this highly unlikely case. Introduce and check oom_kills counter which gets incremented early when the allocator enters __alloc_pages_may_oom path and only check all the tasks if the counter changes during the freezing attempt. The counter is updated so early to reduce the race window since allocator checked oom_killer_disabled which is set by PM-freezing code. A false positive will push the PM-freezer into a slow path but that is not a big deal. Changes since v1 - push the re-check loop out of freeze_processes into check_frozen_processes and invert the condition to make the code more readable as per Rafael Fixes: f660daac474c6f (oom: thaw threads if oom killed thread is frozen before deferring) Cc: 3.2+ <stable@vger.kernel.org> # 3.2+ Signed-off-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>";https://api.github.com/repos/torvalds/linux/git/commits/5695be142e203167e3cb515ef86a88424f3524eb;"Introduce and check oom_kills counter which gets incremented early when the allocator enters __alloc_pages_may_oom path and only check all the tasks if the counter changes during the freezing attempt";yes;no;yes 189;MDY6Q29tbWl0MjMyNTI5ODo1Njk1YmUxNDJlMjAzMTY3ZTNjYjUxNWVmODZhODg0MjRmMzUyNGVi;Michal Hocko;Rafael J. Wysocki;"OOM, PM: OOM killed task shouldn't escape PM suspend PM freezer relies on having all tasks frozen by the time devices are getting frozen so that no task will touch them while they are getting frozen. But OOM killer is allowed to kill an already frozen task in order to handle OOM situtation. In order to protect from late wake ups OOM killer is disabled after all tasks are frozen. This, however, still keeps a window open when a killed task didn't manage to die by the time freeze_processes finishes. Reduce the race window by checking all tasks after OOM killer has been disabled. This is still not race free completely unfortunately because oom_killer_disable cannot stop an already ongoing OOM killer so a task might still wake up from the fridge and get killed without freeze_processes noticing. Full synchronization of OOM and freezer is, however, too heavy weight for this highly unlikely case. Introduce and check oom_kills counter which gets incremented early when the allocator enters __alloc_pages_may_oom path and only check all the tasks if the counter changes during the freezing attempt. The counter is updated so early to reduce the race window since allocator checked oom_killer_disabled which is set by PM-freezing code. A false positive will push the PM-freezer into a slow path but that is not a big deal. Changes since v1 - push the re-check loop out of freeze_processes into check_frozen_processes and invert the condition to make the code more readable as per Rafael Fixes: f660daac474c6f (oom: thaw threads if oom killed thread is frozen before deferring) Cc: 3.2+ <stable@vger.kernel.org> # 3.2+ Signed-off-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>";https://api.github.com/repos/torvalds/linux/git/commits/5695be142e203167e3cb515ef86a88424f3524eb;"The counter is updated so early to reduce the race window since allocator checked oom_killer_disabled which is set by PM-freezing code";yes;yes;no 189;MDY6Q29tbWl0MjMyNTI5ODo1Njk1YmUxNDJlMjAzMTY3ZTNjYjUxNWVmODZhODg0MjRmMzUyNGVi;Michal Hocko;Rafael J. Wysocki;"OOM, PM: OOM killed task shouldn't escape PM suspend PM freezer relies on having all tasks frozen by the time devices are getting frozen so that no task will touch them while they are getting frozen. But OOM killer is allowed to kill an already frozen task in order to handle OOM situtation. In order to protect from late wake ups OOM killer is disabled after all tasks are frozen. This, however, still keeps a window open when a killed task didn't manage to die by the time freeze_processes finishes. Reduce the race window by checking all tasks after OOM killer has been disabled. This is still not race free completely unfortunately because oom_killer_disable cannot stop an already ongoing OOM killer so a task might still wake up from the fridge and get killed without freeze_processes noticing. Full synchronization of OOM and freezer is, however, too heavy weight for this highly unlikely case. Introduce and check oom_kills counter which gets incremented early when the allocator enters __alloc_pages_may_oom path and only check all the tasks if the counter changes during the freezing attempt. The counter is updated so early to reduce the race window since allocator checked oom_killer_disabled which is set by PM-freezing code. A false positive will push the PM-freezer into a slow path but that is not a big deal. Changes since v1 - push the re-check loop out of freeze_processes into check_frozen_processes and invert the condition to make the code more readable as per Rafael Fixes: f660daac474c6f (oom: thaw threads if oom killed thread is frozen before deferring) Cc: 3.2+ <stable@vger.kernel.org> # 3.2+ Signed-off-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>";https://api.github.com/repos/torvalds/linux/git/commits/5695be142e203167e3cb515ef86a88424f3524eb;"A false positive will push the PM-freezer into a slow path but that is not a big deal";yes;no;yes 189;MDY6Q29tbWl0MjMyNTI5ODo1Njk1YmUxNDJlMjAzMTY3ZTNjYjUxNWVmODZhODg0MjRmMzUyNGVi;Michal Hocko;Rafael J. Wysocki;"OOM, PM: OOM killed task shouldn't escape PM suspend PM freezer relies on having all tasks frozen by the time devices are getting frozen so that no task will touch them while they are getting frozen. But OOM killer is allowed to kill an already frozen task in order to handle OOM situtation. In order to protect from late wake ups OOM killer is disabled after all tasks are frozen. This, however, still keeps a window open when a killed task didn't manage to die by the time freeze_processes finishes. Reduce the race window by checking all tasks after OOM killer has been disabled. This is still not race free completely unfortunately because oom_killer_disable cannot stop an already ongoing OOM killer so a task might still wake up from the fridge and get killed without freeze_processes noticing. Full synchronization of OOM and freezer is, however, too heavy weight for this highly unlikely case. Introduce and check oom_kills counter which gets incremented early when the allocator enters __alloc_pages_may_oom path and only check all the tasks if the counter changes during the freezing attempt. The counter is updated so early to reduce the race window since allocator checked oom_killer_disabled which is set by PM-freezing code. A false positive will push the PM-freezer into a slow path but that is not a big deal. Changes since v1 - push the re-check loop out of freeze_processes into check_frozen_processes and invert the condition to make the code more readable as per Rafael Fixes: f660daac474c6f (oom: thaw threads if oom killed thread is frozen before deferring) Cc: 3.2+ <stable@vger.kernel.org> # 3.2+ Signed-off-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>";https://api.github.com/repos/torvalds/linux/git/commits/5695be142e203167e3cb515ef86a88424f3524eb;"Changes since v1 - push the re-check loop out of freeze_processes into check_frozen_processes and invert the condition to make the code more readable as per Rafael";yes;yes;no 191;MDY6Q29tbWl0MjMyNTI5ODpmYjc5NGJjYmI0ZTU1NTIyNDJmOWE0YzVlMWZmZTRjNmRhMjlhOTY4;David Rientjes;Linus Torvalds;"mm, oom: remove unnecessary exit_state check The oom killer scans each process and determines whether it is eligible for oom kill or whether the oom killer should abort because of concurrent memory freeing. It will abort when an eligible process is found to have TIF_MEMDIE set, meaning it has already been oom killed and we're waiting for it to exit. Processes with task->mm == NULL should not be considered because they are either kthreads or have already detached their memory and killing them would not lead to memory freeing. That memory is only freed after exit_mm() has returned, however, and not when task->mm is first set to NULL. Clear TIF_MEMDIE after exit_mm()'s mmput() so that an oom killed process is no longer considered for oom kill, but only until exit_mm() has returned. This was fragile in the past because it relied on exit_notify() to be reached before no longer considering TIF_MEMDIE processes. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/fb794bcbb4e5552242f9a4c5e1ffe4c6da29a968;mm, oom: remove unnecessary exit_state check;yes;yes;no 191;MDY6Q29tbWl0MjMyNTI5ODpmYjc5NGJjYmI0ZTU1NTIyNDJmOWE0YzVlMWZmZTRjNmRhMjlhOTY4;David Rientjes;Linus Torvalds;"mm, oom: remove unnecessary exit_state check The oom killer scans each process and determines whether it is eligible for oom kill or whether the oom killer should abort because of concurrent memory freeing. It will abort when an eligible process is found to have TIF_MEMDIE set, meaning it has already been oom killed and we're waiting for it to exit. Processes with task->mm == NULL should not be considered because they are either kthreads or have already detached their memory and killing them would not lead to memory freeing. That memory is only freed after exit_mm() has returned, however, and not when task->mm is first set to NULL. Clear TIF_MEMDIE after exit_mm()'s mmput() so that an oom killed process is no longer considered for oom kill, but only until exit_mm() has returned. This was fragile in the past because it relied on exit_notify() to be reached before no longer considering TIF_MEMDIE processes. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/fb794bcbb4e5552242f9a4c5e1ffe4c6da29a968;"The oom killer scans each process and determines whether it is eligible for oom kill or whether the oom killer should abort because of concurrent memory freeing";no;no;yes 191;MDY6Q29tbWl0MjMyNTI5ODpmYjc5NGJjYmI0ZTU1NTIyNDJmOWE0YzVlMWZmZTRjNmRhMjlhOTY4;David Rientjes;Linus Torvalds;"mm, oom: remove unnecessary exit_state check The oom killer scans each process and determines whether it is eligible for oom kill or whether the oom killer should abort because of concurrent memory freeing. It will abort when an eligible process is found to have TIF_MEMDIE set, meaning it has already been oom killed and we're waiting for it to exit. Processes with task->mm == NULL should not be considered because they are either kthreads or have already detached their memory and killing them would not lead to memory freeing. That memory is only freed after exit_mm() has returned, however, and not when task->mm is first set to NULL. Clear TIF_MEMDIE after exit_mm()'s mmput() so that an oom killed process is no longer considered for oom kill, but only until exit_mm() has returned. This was fragile in the past because it relied on exit_notify() to be reached before no longer considering TIF_MEMDIE processes. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/fb794bcbb4e5552242f9a4c5e1ffe4c6da29a968;" It will abort when an eligible process is found to have TIF_MEMDIE set, meaning it has already been oom killed and we're waiting for it to exit";no;no;yes 191;MDY6Q29tbWl0MjMyNTI5ODpmYjc5NGJjYmI0ZTU1NTIyNDJmOWE0YzVlMWZmZTRjNmRhMjlhOTY4;David Rientjes;Linus Torvalds;"mm, oom: remove unnecessary exit_state check The oom killer scans each process and determines whether it is eligible for oom kill or whether the oom killer should abort because of concurrent memory freeing. It will abort when an eligible process is found to have TIF_MEMDIE set, meaning it has already been oom killed and we're waiting for it to exit. Processes with task->mm == NULL should not be considered because they are either kthreads or have already detached their memory and killing them would not lead to memory freeing. That memory is only freed after exit_mm() has returned, however, and not when task->mm is first set to NULL. Clear TIF_MEMDIE after exit_mm()'s mmput() so that an oom killed process is no longer considered for oom kill, but only until exit_mm() has returned. This was fragile in the past because it relied on exit_notify() to be reached before no longer considering TIF_MEMDIE processes. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/fb794bcbb4e5552242f9a4c5e1ffe4c6da29a968;"Processes with task->mm == NULL should not be considered because they are either kthreads or have already detached their memory and killing them would not lead to memory freeing";no;yes;yes 191;MDY6Q29tbWl0MjMyNTI5ODpmYjc5NGJjYmI0ZTU1NTIyNDJmOWE0YzVlMWZmZTRjNmRhMjlhOTY4;David Rientjes;Linus Torvalds;"mm, oom: remove unnecessary exit_state check The oom killer scans each process and determines whether it is eligible for oom kill or whether the oom killer should abort because of concurrent memory freeing. It will abort when an eligible process is found to have TIF_MEMDIE set, meaning it has already been oom killed and we're waiting for it to exit. Processes with task->mm == NULL should not be considered because they are either kthreads or have already detached their memory and killing them would not lead to memory freeing. That memory is only freed after exit_mm() has returned, however, and not when task->mm is first set to NULL. Clear TIF_MEMDIE after exit_mm()'s mmput() so that an oom killed process is no longer considered for oom kill, but only until exit_mm() has returned. This was fragile in the past because it relied on exit_notify() to be reached before no longer considering TIF_MEMDIE processes. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/fb794bcbb4e5552242f9a4c5e1ffe4c6da29a968;" That memory is only freed after exit_mm() has returned, however, and not when task->mm is first set to NULL";no;no;yes 191;MDY6Q29tbWl0MjMyNTI5ODpmYjc5NGJjYmI0ZTU1NTIyNDJmOWE0YzVlMWZmZTRjNmRhMjlhOTY4;David Rientjes;Linus Torvalds;"mm, oom: remove unnecessary exit_state check The oom killer scans each process and determines whether it is eligible for oom kill or whether the oom killer should abort because of concurrent memory freeing. It will abort when an eligible process is found to have TIF_MEMDIE set, meaning it has already been oom killed and we're waiting for it to exit. Processes with task->mm == NULL should not be considered because they are either kthreads or have already detached their memory and killing them would not lead to memory freeing. That memory is only freed after exit_mm() has returned, however, and not when task->mm is first set to NULL. Clear TIF_MEMDIE after exit_mm()'s mmput() so that an oom killed process is no longer considered for oom kill, but only until exit_mm() has returned. This was fragile in the past because it relied on exit_notify() to be reached before no longer considering TIF_MEMDIE processes. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/fb794bcbb4e5552242f9a4c5e1ffe4c6da29a968;"Clear TIF_MEMDIE after exit_mm()'s mmput() so that an oom killed process is no longer considered for oom kill, but only until exit_mm() has returned";yes;yes;no 191;MDY6Q29tbWl0MjMyNTI5ODpmYjc5NGJjYmI0ZTU1NTIyNDJmOWE0YzVlMWZmZTRjNmRhMjlhOTY4;David Rientjes;Linus Torvalds;"mm, oom: remove unnecessary exit_state check The oom killer scans each process and determines whether it is eligible for oom kill or whether the oom killer should abort because of concurrent memory freeing. It will abort when an eligible process is found to have TIF_MEMDIE set, meaning it has already been oom killed and we're waiting for it to exit. Processes with task->mm == NULL should not be considered because they are either kthreads or have already detached their memory and killing them would not lead to memory freeing. That memory is only freed after exit_mm() has returned, however, and not when task->mm is first set to NULL. Clear TIF_MEMDIE after exit_mm()'s mmput() so that an oom killed process is no longer considered for oom kill, but only until exit_mm() has returned. This was fragile in the past because it relied on exit_notify() to be reached before no longer considering TIF_MEMDIE processes. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/fb794bcbb4e5552242f9a4c5e1ffe4c6da29a968;" This was fragile in the past because it relied on exit_notify() to be reached before no longer considering TIF_MEMDIE processes.";no;no;yes 192;MDY6Q29tbWl0MjMyNTI5ODplOTcyYTA3MGUyZDMyOTZjZDJlMmNjMmZkMDU2MWNlODlhMWQ1ZWJm;David Rientjes;Linus Torvalds;"mm, oom: rename zonelist locking functions try_set_zonelist_oom() and clear_zonelist_oom() are not named properly to imply that they require locking semantics to avoid out_of_memory() being reordered. zone_scan_lock is required for both functions to ensure that there is proper locking synchronization. Rename try_set_zonelist_oom() to oom_zonelist_trylock() and rename clear_zonelist_oom() to oom_zonelist_unlock() to imply there is proper locking semantics. At the same time, convert oom_zonelist_trylock() to return bool instead of int since only success and failure are tested. Signed-off-by: David Rientjes <rientjes@google.com> Cc: ""Kirill A. Shutemov"" <kirill.shutemov@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/e972a070e2d3296cd2e2cc2fd0561ce89a1d5ebf;mm, oom: rename zonelist locking functions;yes;no;no 192;MDY6Q29tbWl0MjMyNTI5ODplOTcyYTA3MGUyZDMyOTZjZDJlMmNjMmZkMDU2MWNlODlhMWQ1ZWJm;David Rientjes;Linus Torvalds;"mm, oom: rename zonelist locking functions try_set_zonelist_oom() and clear_zonelist_oom() are not named properly to imply that they require locking semantics to avoid out_of_memory() being reordered. zone_scan_lock is required for both functions to ensure that there is proper locking synchronization. Rename try_set_zonelist_oom() to oom_zonelist_trylock() and rename clear_zonelist_oom() to oom_zonelist_unlock() to imply there is proper locking semantics. At the same time, convert oom_zonelist_trylock() to return bool instead of int since only success and failure are tested. Signed-off-by: David Rientjes <rientjes@google.com> Cc: ""Kirill A. Shutemov"" <kirill.shutemov@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/e972a070e2d3296cd2e2cc2fd0561ce89a1d5ebf;"try_set_zonelist_oom() and clear_zonelist_oom() are not named properly to imply that they require locking semantics to avoid out_of_memory() being reordered";no;yes;yes 192;MDY6Q29tbWl0MjMyNTI5ODplOTcyYTA3MGUyZDMyOTZjZDJlMmNjMmZkMDU2MWNlODlhMWQ1ZWJm;David Rientjes;Linus Torvalds;"mm, oom: rename zonelist locking functions try_set_zonelist_oom() and clear_zonelist_oom() are not named properly to imply that they require locking semantics to avoid out_of_memory() being reordered. zone_scan_lock is required for both functions to ensure that there is proper locking synchronization. Rename try_set_zonelist_oom() to oom_zonelist_trylock() and rename clear_zonelist_oom() to oom_zonelist_unlock() to imply there is proper locking semantics. At the same time, convert oom_zonelist_trylock() to return bool instead of int since only success and failure are tested. Signed-off-by: David Rientjes <rientjes@google.com> Cc: ""Kirill A. Shutemov"" <kirill.shutemov@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/e972a070e2d3296cd2e2cc2fd0561ce89a1d5ebf;"zone_scan_lock is required for both functions to ensure that there is proper locking synchronization";no;no;yes 192;MDY6Q29tbWl0MjMyNTI5ODplOTcyYTA3MGUyZDMyOTZjZDJlMmNjMmZkMDU2MWNlODlhMWQ1ZWJm;David Rientjes;Linus Torvalds;"mm, oom: rename zonelist locking functions try_set_zonelist_oom() and clear_zonelist_oom() are not named properly to imply that they require locking semantics to avoid out_of_memory() being reordered. zone_scan_lock is required for both functions to ensure that there is proper locking synchronization. Rename try_set_zonelist_oom() to oom_zonelist_trylock() and rename clear_zonelist_oom() to oom_zonelist_unlock() to imply there is proper locking semantics. At the same time, convert oom_zonelist_trylock() to return bool instead of int since only success and failure are tested. Signed-off-by: David Rientjes <rientjes@google.com> Cc: ""Kirill A. Shutemov"" <kirill.shutemov@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/e972a070e2d3296cd2e2cc2fd0561ce89a1d5ebf;"Rename try_set_zonelist_oom() to oom_zonelist_trylock() and rename clear_zonelist_oom() to oom_zonelist_unlock() to imply there is proper locking semantics";yes;yes;no 192;MDY6Q29tbWl0MjMyNTI5ODplOTcyYTA3MGUyZDMyOTZjZDJlMmNjMmZkMDU2MWNlODlhMWQ1ZWJm;David Rientjes;Linus Torvalds;"mm, oom: rename zonelist locking functions try_set_zonelist_oom() and clear_zonelist_oom() are not named properly to imply that they require locking semantics to avoid out_of_memory() being reordered. zone_scan_lock is required for both functions to ensure that there is proper locking synchronization. Rename try_set_zonelist_oom() to oom_zonelist_trylock() and rename clear_zonelist_oom() to oom_zonelist_unlock() to imply there is proper locking semantics. At the same time, convert oom_zonelist_trylock() to return bool instead of int since only success and failure are tested. Signed-off-by: David Rientjes <rientjes@google.com> Cc: ""Kirill A. Shutemov"" <kirill.shutemov@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/e972a070e2d3296cd2e2cc2fd0561ce89a1d5ebf;"At the same time, convert oom_zonelist_trylock() to return bool instead of int since only success and failure are tested.";yes;yes;no 193;MDY6Q29tbWl0MjMyNTI5ODo4ZDA2MGJmNDkwOTMwZjMwNWM0ZWZjNDU3MjRlODYxYTI2OGY0ZDJm;David Rientjes;Linus Torvalds;"mm, oom: ensure memoryless node zonelist always includes zones With memoryless node support being worked on, it's possible that for optimizations that a node may not have a non-NULL zonelist. When CONFIG_NUMA is enabled and node 0 is memoryless, this means the zonelist for first_online_node may become NULL. The oom killer requires a zonelist that includes all memory zones for the sysrq trigger and pagefault out of memory handler. Ensure that a non-NULL zonelist is always passed to the oom killer. [akpm@linux-foundation.org: fix non-numa build] Signed-off-by: David Rientjes <rientjes@google.com> Cc: ""Kirill A. Shutemov"" <kirill.shutemov@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/8d060bf490930f305c4efc45724e861a268f4d2f;mm, oom: ensure memoryless node zonelist always includes zones;yes;no;no 193;MDY6Q29tbWl0MjMyNTI5ODo4ZDA2MGJmNDkwOTMwZjMwNWM0ZWZjNDU3MjRlODYxYTI2OGY0ZDJm;David Rientjes;Linus Torvalds;"mm, oom: ensure memoryless node zonelist always includes zones With memoryless node support being worked on, it's possible that for optimizations that a node may not have a non-NULL zonelist. When CONFIG_NUMA is enabled and node 0 is memoryless, this means the zonelist for first_online_node may become NULL. The oom killer requires a zonelist that includes all memory zones for the sysrq trigger and pagefault out of memory handler. Ensure that a non-NULL zonelist is always passed to the oom killer. [akpm@linux-foundation.org: fix non-numa build] Signed-off-by: David Rientjes <rientjes@google.com> Cc: ""Kirill A. Shutemov"" <kirill.shutemov@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/8d060bf490930f305c4efc45724e861a268f4d2f;"With memoryless node support being worked on, it's possible that for optimizations that a node may not have a non-NULL zonelist";no;no;yes 193;MDY6Q29tbWl0MjMyNTI5ODo4ZDA2MGJmNDkwOTMwZjMwNWM0ZWZjNDU3MjRlODYxYTI2OGY0ZDJm;David Rientjes;Linus Torvalds;"mm, oom: ensure memoryless node zonelist always includes zones With memoryless node support being worked on, it's possible that for optimizations that a node may not have a non-NULL zonelist. When CONFIG_NUMA is enabled and node 0 is memoryless, this means the zonelist for first_online_node may become NULL. The oom killer requires a zonelist that includes all memory zones for the sysrq trigger and pagefault out of memory handler. Ensure that a non-NULL zonelist is always passed to the oom killer. [akpm@linux-foundation.org: fix non-numa build] Signed-off-by: David Rientjes <rientjes@google.com> Cc: ""Kirill A. Shutemov"" <kirill.shutemov@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/8d060bf490930f305c4efc45724e861a268f4d2f;" When CONFIG_NUMA is enabled and node 0 is memoryless, this means the zonelist for first_online_node may become NULL";no;yes;yes 193;MDY6Q29tbWl0MjMyNTI5ODo4ZDA2MGJmNDkwOTMwZjMwNWM0ZWZjNDU3MjRlODYxYTI2OGY0ZDJm;David Rientjes;Linus Torvalds;"mm, oom: ensure memoryless node zonelist always includes zones With memoryless node support being worked on, it's possible that for optimizations that a node may not have a non-NULL zonelist. When CONFIG_NUMA is enabled and node 0 is memoryless, this means the zonelist for first_online_node may become NULL. The oom killer requires a zonelist that includes all memory zones for the sysrq trigger and pagefault out of memory handler. Ensure that a non-NULL zonelist is always passed to the oom killer. [akpm@linux-foundation.org: fix non-numa build] Signed-off-by: David Rientjes <rientjes@google.com> Cc: ""Kirill A. Shutemov"" <kirill.shutemov@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/8d060bf490930f305c4efc45724e861a268f4d2f;"The oom killer requires a zonelist that includes all memory zones for the sysrq trigger and pagefault out of memory handler";no;yes;yes 193;MDY6Q29tbWl0MjMyNTI5ODo4ZDA2MGJmNDkwOTMwZjMwNWM0ZWZjNDU3MjRlODYxYTI2OGY0ZDJm;David Rientjes;Linus Torvalds;"mm, oom: ensure memoryless node zonelist always includes zones With memoryless node support being worked on, it's possible that for optimizations that a node may not have a non-NULL zonelist. When CONFIG_NUMA is enabled and node 0 is memoryless, this means the zonelist for first_online_node may become NULL. The oom killer requires a zonelist that includes all memory zones for the sysrq trigger and pagefault out of memory handler. Ensure that a non-NULL zonelist is always passed to the oom killer. [akpm@linux-foundation.org: fix non-numa build] Signed-off-by: David Rientjes <rientjes@google.com> Cc: ""Kirill A. Shutemov"" <kirill.shutemov@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/8d060bf490930f305c4efc45724e861a268f4d2f;Ensure that a non-NULL zonelist is always passed to the oom killer.;yes;yes;no 194;MDY6Q29tbWl0MjMyNTI5ODo3NzhjMTRhZmZhZjk0YTllNDk1MzE3OWQzZTEzYTU0NGNjY2U3NzA3;David Rientjes;Linus Torvalds;"mm, oom: base root bonus on current usage A 3% of system memory bonus is sometimes too excessive in comparison to other processes. With commit a63d83f427fb (""oom: badness heuristic rewrite""), the OOM killer tries to avoid killing privileged tasks by subtracting 3% of overall memory (system or cgroup) from their per-task consumption. But as a result, all root tasks that consume less than 3% of overall memory are considered equal, and so it only takes 33+ privileged tasks pushing the system out of memory for the OOM killer to do something stupid and kill dhclient or other root-owned processes. For example, on a 32G machine it can't tell the difference between the 1M agetty and the 10G fork bomb member. The changelog describes this 3% boost as the equivalent to the global overcommit limit being 3% higher for privileged tasks, but this is not the same as discounting 3% of overall memory from _every privileged task individually_ during OOM selection. Replace the 3% of system memory bonus with a 3% of current memory usage bonus. By giving root tasks a bonus that is proportional to their actual size, they remain comparable even when relatively small. In the example above, the OOM killer will discount the 1M agetty's 256 badness points down to 179, and the 10G fork bomb's 262144 points down to 183500 points and make the right choice, instead of discounting both to 0 and killing agetty because it's first in the task list. Signed-off-by: David Rientjes <rientjes@google.com> Reported-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/778c14affaf94a9e4953179d3e13a544ccce7707;mm, oom: base root bonus on current usage;yes;no;no 194;MDY6Q29tbWl0MjMyNTI5ODo3NzhjMTRhZmZhZjk0YTllNDk1MzE3OWQzZTEzYTU0NGNjY2U3NzA3;David Rientjes;Linus Torvalds;"mm, oom: base root bonus on current usage A 3% of system memory bonus is sometimes too excessive in comparison to other processes. With commit a63d83f427fb (""oom: badness heuristic rewrite""), the OOM killer tries to avoid killing privileged tasks by subtracting 3% of overall memory (system or cgroup) from their per-task consumption. But as a result, all root tasks that consume less than 3% of overall memory are considered equal, and so it only takes 33+ privileged tasks pushing the system out of memory for the OOM killer to do something stupid and kill dhclient or other root-owned processes. For example, on a 32G machine it can't tell the difference between the 1M agetty and the 10G fork bomb member. The changelog describes this 3% boost as the equivalent to the global overcommit limit being 3% higher for privileged tasks, but this is not the same as discounting 3% of overall memory from _every privileged task individually_ during OOM selection. Replace the 3% of system memory bonus with a 3% of current memory usage bonus. By giving root tasks a bonus that is proportional to their actual size, they remain comparable even when relatively small. In the example above, the OOM killer will discount the 1M agetty's 256 badness points down to 179, and the 10G fork bomb's 262144 points down to 183500 points and make the right choice, instead of discounting both to 0 and killing agetty because it's first in the task list. Signed-off-by: David Rientjes <rientjes@google.com> Reported-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/778c14affaf94a9e4953179d3e13a544ccce7707;"A 3% of system memory bonus is sometimes too excessive in comparison to other processes";no;yes;yes 194;MDY6Q29tbWl0MjMyNTI5ODo3NzhjMTRhZmZhZjk0YTllNDk1MzE3OWQzZTEzYTU0NGNjY2U3NzA3;David Rientjes;Linus Torvalds;"mm, oom: base root bonus on current usage A 3% of system memory bonus is sometimes too excessive in comparison to other processes. With commit a63d83f427fb (""oom: badness heuristic rewrite""), the OOM killer tries to avoid killing privileged tasks by subtracting 3% of overall memory (system or cgroup) from their per-task consumption. But as a result, all root tasks that consume less than 3% of overall memory are considered equal, and so it only takes 33+ privileged tasks pushing the system out of memory for the OOM killer to do something stupid and kill dhclient or other root-owned processes. For example, on a 32G machine it can't tell the difference between the 1M agetty and the 10G fork bomb member. The changelog describes this 3% boost as the equivalent to the global overcommit limit being 3% higher for privileged tasks, but this is not the same as discounting 3% of overall memory from _every privileged task individually_ during OOM selection. Replace the 3% of system memory bonus with a 3% of current memory usage bonus. By giving root tasks a bonus that is proportional to their actual size, they remain comparable even when relatively small. In the example above, the OOM killer will discount the 1M agetty's 256 badness points down to 179, and the 10G fork bomb's 262144 points down to 183500 points and make the right choice, instead of discounting both to 0 and killing agetty because it's first in the task list. Signed-off-by: David Rientjes <rientjes@google.com> Reported-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/778c14affaf94a9e4953179d3e13a544ccce7707;"With commit a63d83f427fb (""oom: badness heuristic rewrite""), the OOM killer tries to avoid killing privileged tasks by subtracting 3% of overall memory (system or cgroup) from their per-task consumption";no;no;yes 194;MDY6Q29tbWl0MjMyNTI5ODo3NzhjMTRhZmZhZjk0YTllNDk1MzE3OWQzZTEzYTU0NGNjY2U3NzA3;David Rientjes;Linus Torvalds;"mm, oom: base root bonus on current usage A 3% of system memory bonus is sometimes too excessive in comparison to other processes. With commit a63d83f427fb (""oom: badness heuristic rewrite""), the OOM killer tries to avoid killing privileged tasks by subtracting 3% of overall memory (system or cgroup) from their per-task consumption. But as a result, all root tasks that consume less than 3% of overall memory are considered equal, and so it only takes 33+ privileged tasks pushing the system out of memory for the OOM killer to do something stupid and kill dhclient or other root-owned processes. For example, on a 32G machine it can't tell the difference between the 1M agetty and the 10G fork bomb member. The changelog describes this 3% boost as the equivalent to the global overcommit limit being 3% higher for privileged tasks, but this is not the same as discounting 3% of overall memory from _every privileged task individually_ during OOM selection. Replace the 3% of system memory bonus with a 3% of current memory usage bonus. By giving root tasks a bonus that is proportional to their actual size, they remain comparable even when relatively small. In the example above, the OOM killer will discount the 1M agetty's 256 badness points down to 179, and the 10G fork bomb's 262144 points down to 183500 points and make the right choice, instead of discounting both to 0 and killing agetty because it's first in the task list. Signed-off-by: David Rientjes <rientjes@google.com> Reported-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/778c14affaf94a9e4953179d3e13a544ccce7707;" But as a result, all root tasks that consume less than 3% of overall memory are considered equal, and so it only takes 33+ privileged tasks pushing the system out of memory for the OOM killer to do something stupid and kill dhclient or other root-owned processes";no;yes;yes 194;MDY6Q29tbWl0MjMyNTI5ODo3NzhjMTRhZmZhZjk0YTllNDk1MzE3OWQzZTEzYTU0NGNjY2U3NzA3;David Rientjes;Linus Torvalds;"mm, oom: base root bonus on current usage A 3% of system memory bonus is sometimes too excessive in comparison to other processes. With commit a63d83f427fb (""oom: badness heuristic rewrite""), the OOM killer tries to avoid killing privileged tasks by subtracting 3% of overall memory (system or cgroup) from their per-task consumption. But as a result, all root tasks that consume less than 3% of overall memory are considered equal, and so it only takes 33+ privileged tasks pushing the system out of memory for the OOM killer to do something stupid and kill dhclient or other root-owned processes. For example, on a 32G machine it can't tell the difference between the 1M agetty and the 10G fork bomb member. The changelog describes this 3% boost as the equivalent to the global overcommit limit being 3% higher for privileged tasks, but this is not the same as discounting 3% of overall memory from _every privileged task individually_ during OOM selection. Replace the 3% of system memory bonus with a 3% of current memory usage bonus. By giving root tasks a bonus that is proportional to their actual size, they remain comparable even when relatively small. In the example above, the OOM killer will discount the 1M agetty's 256 badness points down to 179, and the 10G fork bomb's 262144 points down to 183500 points and make the right choice, instead of discounting both to 0 and killing agetty because it's first in the task list. Signed-off-by: David Rientjes <rientjes@google.com> Reported-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/778c14affaf94a9e4953179d3e13a544ccce7707;" For example, on a 32G machine it can't tell the difference between the 1M agetty and the 10G fork bomb member";no;yes;yes 194;MDY6Q29tbWl0MjMyNTI5ODo3NzhjMTRhZmZhZjk0YTllNDk1MzE3OWQzZTEzYTU0NGNjY2U3NzA3;David Rientjes;Linus Torvalds;"mm, oom: base root bonus on current usage A 3% of system memory bonus is sometimes too excessive in comparison to other processes. With commit a63d83f427fb (""oom: badness heuristic rewrite""), the OOM killer tries to avoid killing privileged tasks by subtracting 3% of overall memory (system or cgroup) from their per-task consumption. But as a result, all root tasks that consume less than 3% of overall memory are considered equal, and so it only takes 33+ privileged tasks pushing the system out of memory for the OOM killer to do something stupid and kill dhclient or other root-owned processes. For example, on a 32G machine it can't tell the difference between the 1M agetty and the 10G fork bomb member. The changelog describes this 3% boost as the equivalent to the global overcommit limit being 3% higher for privileged tasks, but this is not the same as discounting 3% of overall memory from _every privileged task individually_ during OOM selection. Replace the 3% of system memory bonus with a 3% of current memory usage bonus. By giving root tasks a bonus that is proportional to their actual size, they remain comparable even when relatively small. In the example above, the OOM killer will discount the 1M agetty's 256 badness points down to 179, and the 10G fork bomb's 262144 points down to 183500 points and make the right choice, instead of discounting both to 0 and killing agetty because it's first in the task list. Signed-off-by: David Rientjes <rientjes@google.com> Reported-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/778c14affaf94a9e4953179d3e13a544ccce7707;"The changelog describes this 3% boost as the equivalent to the global overcommit limit being 3% higher for privileged tasks, but this is not the same as discounting 3% of overall memory from _every privileged task individually_ during OOM selection";no;yes;yes 194;MDY6Q29tbWl0MjMyNTI5ODo3NzhjMTRhZmZhZjk0YTllNDk1MzE3OWQzZTEzYTU0NGNjY2U3NzA3;David Rientjes;Linus Torvalds;"mm, oom: base root bonus on current usage A 3% of system memory bonus is sometimes too excessive in comparison to other processes. With commit a63d83f427fb (""oom: badness heuristic rewrite""), the OOM killer tries to avoid killing privileged tasks by subtracting 3% of overall memory (system or cgroup) from their per-task consumption. But as a result, all root tasks that consume less than 3% of overall memory are considered equal, and so it only takes 33+ privileged tasks pushing the system out of memory for the OOM killer to do something stupid and kill dhclient or other root-owned processes. For example, on a 32G machine it can't tell the difference between the 1M agetty and the 10G fork bomb member. The changelog describes this 3% boost as the equivalent to the global overcommit limit being 3% higher for privileged tasks, but this is not the same as discounting 3% of overall memory from _every privileged task individually_ during OOM selection. Replace the 3% of system memory bonus with a 3% of current memory usage bonus. By giving root tasks a bonus that is proportional to their actual size, they remain comparable even when relatively small. In the example above, the OOM killer will discount the 1M agetty's 256 badness points down to 179, and the 10G fork bomb's 262144 points down to 183500 points and make the right choice, instead of discounting both to 0 and killing agetty because it's first in the task list. Signed-off-by: David Rientjes <rientjes@google.com> Reported-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/778c14affaf94a9e4953179d3e13a544ccce7707;"Replace the 3% of system memory bonus with a 3% of current memory usage bonus";yes;no;no 194;MDY6Q29tbWl0MjMyNTI5ODo3NzhjMTRhZmZhZjk0YTllNDk1MzE3OWQzZTEzYTU0NGNjY2U3NzA3;David Rientjes;Linus Torvalds;"mm, oom: base root bonus on current usage A 3% of system memory bonus is sometimes too excessive in comparison to other processes. With commit a63d83f427fb (""oom: badness heuristic rewrite""), the OOM killer tries to avoid killing privileged tasks by subtracting 3% of overall memory (system or cgroup) from their per-task consumption. But as a result, all root tasks that consume less than 3% of overall memory are considered equal, and so it only takes 33+ privileged tasks pushing the system out of memory for the OOM killer to do something stupid and kill dhclient or other root-owned processes. For example, on a 32G machine it can't tell the difference between the 1M agetty and the 10G fork bomb member. The changelog describes this 3% boost as the equivalent to the global overcommit limit being 3% higher for privileged tasks, but this is not the same as discounting 3% of overall memory from _every privileged task individually_ during OOM selection. Replace the 3% of system memory bonus with a 3% of current memory usage bonus. By giving root tasks a bonus that is proportional to their actual size, they remain comparable even when relatively small. In the example above, the OOM killer will discount the 1M agetty's 256 badness points down to 179, and the 10G fork bomb's 262144 points down to 183500 points and make the right choice, instead of discounting both to 0 and killing agetty because it's first in the task list. Signed-off-by: David Rientjes <rientjes@google.com> Reported-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/778c14affaf94a9e4953179d3e13a544ccce7707;"By giving root tasks a bonus that is proportional to their actual size, they remain comparable even when relatively small";yes;yes;yes 194;MDY6Q29tbWl0MjMyNTI5ODo3NzhjMTRhZmZhZjk0YTllNDk1MzE3OWQzZTEzYTU0NGNjY2U3NzA3;David Rientjes;Linus Torvalds;"mm, oom: base root bonus on current usage A 3% of system memory bonus is sometimes too excessive in comparison to other processes. With commit a63d83f427fb (""oom: badness heuristic rewrite""), the OOM killer tries to avoid killing privileged tasks by subtracting 3% of overall memory (system or cgroup) from their per-task consumption. But as a result, all root tasks that consume less than 3% of overall memory are considered equal, and so it only takes 33+ privileged tasks pushing the system out of memory for the OOM killer to do something stupid and kill dhclient or other root-owned processes. For example, on a 32G machine it can't tell the difference between the 1M agetty and the 10G fork bomb member. The changelog describes this 3% boost as the equivalent to the global overcommit limit being 3% higher for privileged tasks, but this is not the same as discounting 3% of overall memory from _every privileged task individually_ during OOM selection. Replace the 3% of system memory bonus with a 3% of current memory usage bonus. By giving root tasks a bonus that is proportional to their actual size, they remain comparable even when relatively small. In the example above, the OOM killer will discount the 1M agetty's 256 badness points down to 179, and the 10G fork bomb's 262144 points down to 183500 points and make the right choice, instead of discounting both to 0 and killing agetty because it's first in the task list. Signed-off-by: David Rientjes <rientjes@google.com> Reported-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/778c14affaf94a9e4953179d3e13a544ccce7707;" In the example above, the OOM killer will discount the 1M agetty's 256 badness points down to 179, and the 10G fork bomb's 262144 points down to 183500 points and make the right choice, instead of discounting both to 0 and killing agetty because it's first in the task list.";yes;yes;yes 195;MDY6Q29tbWl0MjMyNTI5ODpkNDlhZDkzNTU0MjBjNzQzYzczNmJmZDFkZWU5ZWFhNWIxYTc3MjJh;David Rientjes;Linus Torvalds;"mm, oom: prefer thread group leaders for display purposes When two threads have the same badness score, it's preferable to kill the thread group leader so that the actual process name is printed to the kernel log rather than the thread group name which may be shared amongst several processes. This was the behavior when select_bad_process() used to do for_each_process(), but it now iterates threads instead and leads to ambiguity. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/d49ad9355420c743c736bfd1dee9eaa5b1a7722a;mm, oom: prefer thread group leaders for display purposes;yes;yes;no 195;MDY6Q29tbWl0MjMyNTI5ODpkNDlhZDkzNTU0MjBjNzQzYzczNmJmZDFkZWU5ZWFhNWIxYTc3MjJh;David Rientjes;Linus Torvalds;"mm, oom: prefer thread group leaders for display purposes When two threads have the same badness score, it's preferable to kill the thread group leader so that the actual process name is printed to the kernel log rather than the thread group name which may be shared amongst several processes. This was the behavior when select_bad_process() used to do for_each_process(), but it now iterates threads instead and leads to ambiguity. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/d49ad9355420c743c736bfd1dee9eaa5b1a7722a;"When two threads have the same badness score, it's preferable to kill the thread group leader so that the actual process name is printed to the kernel log rather than the thread group name which may be shared amongst several processes";yes;yes;yes 195;MDY6Q29tbWl0MjMyNTI5ODpkNDlhZDkzNTU0MjBjNzQzYzczNmJmZDFkZWU5ZWFhNWIxYTc3MjJh;David Rientjes;Linus Torvalds;"mm, oom: prefer thread group leaders for display purposes When two threads have the same badness score, it's preferable to kill the thread group leader so that the actual process name is printed to the kernel log rather than the thread group name which may be shared amongst several processes. This was the behavior when select_bad_process() used to do for_each_process(), but it now iterates threads instead and leads to ambiguity. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/d49ad9355420c743c736bfd1dee9eaa5b1a7722a;"This was the behavior when select_bad_process() used to do for_each_process(), but it now iterates threads instead and leads to ambiguity.";yes;yes;yes 196;MDY6Q29tbWl0MjMyNTI5ODo0ZDQwNDhiZThhOTM3NjkzNTBlZmEzMWQyNDgyYTAzOGI3ZGU3M2Qw;Oleg Nesterov;Linus Torvalds;"oom_kill: add rcu_read_lock() into find_lock_task_mm() find_lock_task_mm() expects it is called under rcu or tasklist lock, but it seems that at least oom_unkillable_task()->task_in_mem_cgroup() and mem_cgroup_out_of_memory()->oom_badness() can call it lockless. Perhaps we could fix the callers, but this patch simply adds rcu lock into find_lock_task_mm(). This also allows to simplify a bit one of its callers, oom_kill_process(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Sergey Dyasly <dserrg@gmail.com> Cc: Sameer Nanda <snanda@chromium.org> Cc: ""Eric W. Biederman"" <ebiederm@xmission.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Mandeep Singh Baines <msb@chromium.org> Cc: ""Ma, Xindong"" <xindong.ma@intel.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: ""Tu, Xiaobing"" <xiaobing.tu@intel.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/4d4048be8a93769350efa31d2482a038b7de73d0;oom_kill: add rcu_read_lock() into find_lock_task_mm();yes;no;no 196;MDY6Q29tbWl0MjMyNTI5ODo0ZDQwNDhiZThhOTM3NjkzNTBlZmEzMWQyNDgyYTAzOGI3ZGU3M2Qw;Oleg Nesterov;Linus Torvalds;"oom_kill: add rcu_read_lock() into find_lock_task_mm() find_lock_task_mm() expects it is called under rcu or tasklist lock, but it seems that at least oom_unkillable_task()->task_in_mem_cgroup() and mem_cgroup_out_of_memory()->oom_badness() can call it lockless. Perhaps we could fix the callers, but this patch simply adds rcu lock into find_lock_task_mm(). This also allows to simplify a bit one of its callers, oom_kill_process(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Sergey Dyasly <dserrg@gmail.com> Cc: Sameer Nanda <snanda@chromium.org> Cc: ""Eric W. Biederman"" <ebiederm@xmission.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Mandeep Singh Baines <msb@chromium.org> Cc: ""Ma, Xindong"" <xindong.ma@intel.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: ""Tu, Xiaobing"" <xiaobing.tu@intel.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/4d4048be8a93769350efa31d2482a038b7de73d0;"find_lock_task_mm() expects it is called under rcu or tasklist lock, but it seems that at least oom_unkillable_task()->task_in_mem_cgroup() and mem_cgroup_out_of_memory()->oom_badness() can call it lockless";no;yes;yes 196;MDY6Q29tbWl0MjMyNTI5ODo0ZDQwNDhiZThhOTM3NjkzNTBlZmEzMWQyNDgyYTAzOGI3ZGU3M2Qw;Oleg Nesterov;Linus Torvalds;"oom_kill: add rcu_read_lock() into find_lock_task_mm() find_lock_task_mm() expects it is called under rcu or tasklist lock, but it seems that at least oom_unkillable_task()->task_in_mem_cgroup() and mem_cgroup_out_of_memory()->oom_badness() can call it lockless. Perhaps we could fix the callers, but this patch simply adds rcu lock into find_lock_task_mm(). This also allows to simplify a bit one of its callers, oom_kill_process(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Sergey Dyasly <dserrg@gmail.com> Cc: Sameer Nanda <snanda@chromium.org> Cc: ""Eric W. Biederman"" <ebiederm@xmission.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Mandeep Singh Baines <msb@chromium.org> Cc: ""Ma, Xindong"" <xindong.ma@intel.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: ""Tu, Xiaobing"" <xiaobing.tu@intel.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/4d4048be8a93769350efa31d2482a038b7de73d0;"Perhaps we could fix the callers, but this patch simply adds rcu lock into find_lock_task_mm()";yes;no;yes 196;MDY6Q29tbWl0MjMyNTI5ODo0ZDQwNDhiZThhOTM3NjkzNTBlZmEzMWQyNDgyYTAzOGI3ZGU3M2Qw;Oleg Nesterov;Linus Torvalds;"oom_kill: add rcu_read_lock() into find_lock_task_mm() find_lock_task_mm() expects it is called under rcu or tasklist lock, but it seems that at least oom_unkillable_task()->task_in_mem_cgroup() and mem_cgroup_out_of_memory()->oom_badness() can call it lockless. Perhaps we could fix the callers, but this patch simply adds rcu lock into find_lock_task_mm(). This also allows to simplify a bit one of its callers, oom_kill_process(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Sergey Dyasly <dserrg@gmail.com> Cc: Sameer Nanda <snanda@chromium.org> Cc: ""Eric W. Biederman"" <ebiederm@xmission.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Mandeep Singh Baines <msb@chromium.org> Cc: ""Ma, Xindong"" <xindong.ma@intel.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: ""Tu, Xiaobing"" <xiaobing.tu@intel.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/4d4048be8a93769350efa31d2482a038b7de73d0;" This also allows to simplify a bit one of its callers, oom_kill_process().";yes;yes;no 197;MDY6Q29tbWl0MjMyNTI5ODphZDk2MjQ0MTc5ZmJkNTViNDBjMDBmMTBmMzk5YmMwNDczOWI4ZTFm;Oleg Nesterov;Linus Torvalds;"oom_kill: has_intersects_mems_allowed() needs rcu_read_lock() At least out_of_memory() calls has_intersects_mems_allowed() without even rcu_read_lock(), this is obviously buggy. Add the necessary rcu_read_lock(). This means that we can not simply return from the loop, we need ""bool ret"" and ""break"". While at it, swap the names of task_struct's (the argument and the local). This cleans up the code a little bit and avoids the unnecessary initialization. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Sergey Dyasly <dserrg@gmail.com> Tested-by: Sergey Dyasly <dserrg@gmail.com> Reviewed-by: Sameer Nanda <snanda@chromium.org> Cc: ""Eric W. Biederman"" <ebiederm@xmission.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Mandeep Singh Baines <msb@chromium.org> Cc: ""Ma, Xindong"" <xindong.ma@intel.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: ""Tu, Xiaobing"" <xiaobing.tu@intel.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/ad96244179fbd55b40c00f10f399bc04739b8e1f;oom_kill: has_intersects_mems_allowed() needs rcu_read_lock();yes;yes;yes 197;MDY6Q29tbWl0MjMyNTI5ODphZDk2MjQ0MTc5ZmJkNTViNDBjMDBmMTBmMzk5YmMwNDczOWI4ZTFm;Oleg Nesterov;Linus Torvalds;"oom_kill: has_intersects_mems_allowed() needs rcu_read_lock() At least out_of_memory() calls has_intersects_mems_allowed() without even rcu_read_lock(), this is obviously buggy. Add the necessary rcu_read_lock(). This means that we can not simply return from the loop, we need ""bool ret"" and ""break"". While at it, swap the names of task_struct's (the argument and the local). This cleans up the code a little bit and avoids the unnecessary initialization. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Sergey Dyasly <dserrg@gmail.com> Tested-by: Sergey Dyasly <dserrg@gmail.com> Reviewed-by: Sameer Nanda <snanda@chromium.org> Cc: ""Eric W. Biederman"" <ebiederm@xmission.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Mandeep Singh Baines <msb@chromium.org> Cc: ""Ma, Xindong"" <xindong.ma@intel.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: ""Tu, Xiaobing"" <xiaobing.tu@intel.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/ad96244179fbd55b40c00f10f399bc04739b8e1f;"At least out_of_memory() calls has_intersects_mems_allowed() without even rcu_read_lock(), this is obviously buggy";no;yes;yes 197;MDY6Q29tbWl0MjMyNTI5ODphZDk2MjQ0MTc5ZmJkNTViNDBjMDBmMTBmMzk5YmMwNDczOWI4ZTFm;Oleg Nesterov;Linus Torvalds;"oom_kill: has_intersects_mems_allowed() needs rcu_read_lock() At least out_of_memory() calls has_intersects_mems_allowed() without even rcu_read_lock(), this is obviously buggy. Add the necessary rcu_read_lock(). This means that we can not simply return from the loop, we need ""bool ret"" and ""break"". While at it, swap the names of task_struct's (the argument and the local). This cleans up the code a little bit and avoids the unnecessary initialization. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Sergey Dyasly <dserrg@gmail.com> Tested-by: Sergey Dyasly <dserrg@gmail.com> Reviewed-by: Sameer Nanda <snanda@chromium.org> Cc: ""Eric W. Biederman"" <ebiederm@xmission.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Mandeep Singh Baines <msb@chromium.org> Cc: ""Ma, Xindong"" <xindong.ma@intel.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: ""Tu, Xiaobing"" <xiaobing.tu@intel.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/ad96244179fbd55b40c00f10f399bc04739b8e1f;Add the necessary rcu_read_lock();yes;yes;no 197;MDY6Q29tbWl0MjMyNTI5ODphZDk2MjQ0MTc5ZmJkNTViNDBjMDBmMTBmMzk5YmMwNDczOWI4ZTFm;Oleg Nesterov;Linus Torvalds;"oom_kill: has_intersects_mems_allowed() needs rcu_read_lock() At least out_of_memory() calls has_intersects_mems_allowed() without even rcu_read_lock(), this is obviously buggy. Add the necessary rcu_read_lock(). This means that we can not simply return from the loop, we need ""bool ret"" and ""break"". While at it, swap the names of task_struct's (the argument and the local). This cleans up the code a little bit and avoids the unnecessary initialization. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Sergey Dyasly <dserrg@gmail.com> Tested-by: Sergey Dyasly <dserrg@gmail.com> Reviewed-by: Sameer Nanda <snanda@chromium.org> Cc: ""Eric W. Biederman"" <ebiederm@xmission.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Mandeep Singh Baines <msb@chromium.org> Cc: ""Ma, Xindong"" <xindong.ma@intel.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: ""Tu, Xiaobing"" <xiaobing.tu@intel.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/ad96244179fbd55b40c00f10f399bc04739b8e1f;" This means that we can not simply return from the loop, we need ""bool ret"" and ""break""";yes;yes;yes 197;MDY6Q29tbWl0MjMyNTI5ODphZDk2MjQ0MTc5ZmJkNTViNDBjMDBmMTBmMzk5YmMwNDczOWI4ZTFm;Oleg Nesterov;Linus Torvalds;"oom_kill: has_intersects_mems_allowed() needs rcu_read_lock() At least out_of_memory() calls has_intersects_mems_allowed() without even rcu_read_lock(), this is obviously buggy. Add the necessary rcu_read_lock(). This means that we can not simply return from the loop, we need ""bool ret"" and ""break"". While at it, swap the names of task_struct's (the argument and the local). This cleans up the code a little bit and avoids the unnecessary initialization. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Sergey Dyasly <dserrg@gmail.com> Tested-by: Sergey Dyasly <dserrg@gmail.com> Reviewed-by: Sameer Nanda <snanda@chromium.org> Cc: ""Eric W. Biederman"" <ebiederm@xmission.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Mandeep Singh Baines <msb@chromium.org> Cc: ""Ma, Xindong"" <xindong.ma@intel.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: ""Tu, Xiaobing"" <xiaobing.tu@intel.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/ad96244179fbd55b40c00f10f399bc04739b8e1f;"While at it, swap the names of task_struct's (the argument and the local)";yes;no;no 197;MDY6Q29tbWl0MjMyNTI5ODphZDk2MjQ0MTc5ZmJkNTViNDBjMDBmMTBmMzk5YmMwNDczOWI4ZTFm;Oleg Nesterov;Linus Torvalds;"oom_kill: has_intersects_mems_allowed() needs rcu_read_lock() At least out_of_memory() calls has_intersects_mems_allowed() without even rcu_read_lock(), this is obviously buggy. Add the necessary rcu_read_lock(). This means that we can not simply return from the loop, we need ""bool ret"" and ""break"". While at it, swap the names of task_struct's (the argument and the local). This cleans up the code a little bit and avoids the unnecessary initialization. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Sergey Dyasly <dserrg@gmail.com> Tested-by: Sergey Dyasly <dserrg@gmail.com> Reviewed-by: Sameer Nanda <snanda@chromium.org> Cc: ""Eric W. Biederman"" <ebiederm@xmission.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Mandeep Singh Baines <msb@chromium.org> Cc: ""Ma, Xindong"" <xindong.ma@intel.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: ""Tu, Xiaobing"" <xiaobing.tu@intel.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/ad96244179fbd55b40c00f10f399bc04739b8e1f;" This cleans up the code a little bit and avoids the unnecessary initialization.";yes;yes;no 198;MDY6Q29tbWl0MjMyNTI5ODoxZGE0ZGIwY2Q1YzhhMzFkNDQ2OGVjOTA2YjQxM2U3NWU2MDRiNDY1;Oleg Nesterov;Linus Torvalds;"oom_kill: change oom_kill.c to use for_each_thread() Change oom_kill.c to use for_each_thread() rather than the racy while_each_thread() which can loop forever if we race with exit. Note also that most users were buggy even if while_each_thread() was fine, the task can exit even _before_ rcu_read_lock(). Fortunately the new for_each_thread() only requires the stable task_struct, so this change fixes both problems. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Sergey Dyasly <dserrg@gmail.com> Tested-by: Sergey Dyasly <dserrg@gmail.com> Reviewed-by: Sameer Nanda <snanda@chromium.org> Cc: ""Eric W. Biederman"" <ebiederm@xmission.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Mandeep Singh Baines <msb@chromium.org> Cc: ""Ma, Xindong"" <xindong.ma@intel.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: ""Tu, Xiaobing"" <xiaobing.tu@intel.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/1da4db0cd5c8a31d4468ec906b413e75e604b465;oom_kill: change oom_kill.c to use for_each_thread();yes;no;no 198;MDY6Q29tbWl0MjMyNTI5ODoxZGE0ZGIwY2Q1YzhhMzFkNDQ2OGVjOTA2YjQxM2U3NWU2MDRiNDY1;Oleg Nesterov;Linus Torvalds;"oom_kill: change oom_kill.c to use for_each_thread() Change oom_kill.c to use for_each_thread() rather than the racy while_each_thread() which can loop forever if we race with exit. Note also that most users were buggy even if while_each_thread() was fine, the task can exit even _before_ rcu_read_lock(). Fortunately the new for_each_thread() only requires the stable task_struct, so this change fixes both problems. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Sergey Dyasly <dserrg@gmail.com> Tested-by: Sergey Dyasly <dserrg@gmail.com> Reviewed-by: Sameer Nanda <snanda@chromium.org> Cc: ""Eric W. Biederman"" <ebiederm@xmission.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Mandeep Singh Baines <msb@chromium.org> Cc: ""Ma, Xindong"" <xindong.ma@intel.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: ""Tu, Xiaobing"" <xiaobing.tu@intel.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/1da4db0cd5c8a31d4468ec906b413e75e604b465;"Change oom_kill.c to use for_each_thread() rather than the racy while_each_thread() which can loop forever if we race with exit";yes;yes;yes 198;MDY6Q29tbWl0MjMyNTI5ODoxZGE0ZGIwY2Q1YzhhMzFkNDQ2OGVjOTA2YjQxM2U3NWU2MDRiNDY1;Oleg Nesterov;Linus Torvalds;"oom_kill: change oom_kill.c to use for_each_thread() Change oom_kill.c to use for_each_thread() rather than the racy while_each_thread() which can loop forever if we race with exit. Note also that most users were buggy even if while_each_thread() was fine, the task can exit even _before_ rcu_read_lock(). Fortunately the new for_each_thread() only requires the stable task_struct, so this change fixes both problems. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Sergey Dyasly <dserrg@gmail.com> Tested-by: Sergey Dyasly <dserrg@gmail.com> Reviewed-by: Sameer Nanda <snanda@chromium.org> Cc: ""Eric W. Biederman"" <ebiederm@xmission.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Mandeep Singh Baines <msb@chromium.org> Cc: ""Ma, Xindong"" <xindong.ma@intel.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: ""Tu, Xiaobing"" <xiaobing.tu@intel.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/1da4db0cd5c8a31d4468ec906b413e75e604b465;"Note also that most users were buggy even if while_each_thread() was fine, the task can exit even _before_ rcu_read_lock()";no;yes;yes 198;MDY6Q29tbWl0MjMyNTI5ODoxZGE0ZGIwY2Q1YzhhMzFkNDQ2OGVjOTA2YjQxM2U3NWU2MDRiNDY1;Oleg Nesterov;Linus Torvalds;"oom_kill: change oom_kill.c to use for_each_thread() Change oom_kill.c to use for_each_thread() rather than the racy while_each_thread() which can loop forever if we race with exit. Note also that most users were buggy even if while_each_thread() was fine, the task can exit even _before_ rcu_read_lock(). Fortunately the new for_each_thread() only requires the stable task_struct, so this change fixes both problems. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Sergey Dyasly <dserrg@gmail.com> Tested-by: Sergey Dyasly <dserrg@gmail.com> Reviewed-by: Sameer Nanda <snanda@chromium.org> Cc: ""Eric W. Biederman"" <ebiederm@xmission.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Mandeep Singh Baines <msb@chromium.org> Cc: ""Ma, Xindong"" <xindong.ma@intel.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: ""Tu, Xiaobing"" <xiaobing.tu@intel.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/1da4db0cd5c8a31d4468ec906b413e75e604b465;"Fortunately the new for_each_thread() only requires the stable task_struct, so this change fixes both problems.";yes;yes;yes 199;MDY6Q29tbWl0MjMyNTI5ODplMWY1NmM4OWIwNDAxMzRhZGQ5M2Y2ODY5MzFjYzI2NjU0MWQyMzlh;Kirill A. Shutemov;Linus Torvalds;"mm: convert mm->nr_ptes to atomic_long_t With split page table lock for PMD level we can't hold mm->page_table_lock while updating nr_ptes. Let's convert it to atomic_long_t to avoid races. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Tested-by: Alex Thorlton <athorlton@sgi.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: ""Eric W . Biederman"" <ebiederm@xmission.com> Cc: ""Paul E . McKenney"" <paulmck@linux.vnet.ibm.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Dave Jones <davej@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kees Cook <keescook@chromium.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Robin Holt <robinmholt@gmail.com> Cc: Sedat Dilek <sedat.dilek@gmail.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/e1f56c89b040134add93f686931cc266541d239a;mm: convert mm->nr_ptes to atomic_long_t;yes;no;no 199;MDY6Q29tbWl0MjMyNTI5ODplMWY1NmM4OWIwNDAxMzRhZGQ5M2Y2ODY5MzFjYzI2NjU0MWQyMzlh;Kirill A. Shutemov;Linus Torvalds;"mm: convert mm->nr_ptes to atomic_long_t With split page table lock for PMD level we can't hold mm->page_table_lock while updating nr_ptes. Let's convert it to atomic_long_t to avoid races. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Tested-by: Alex Thorlton <athorlton@sgi.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: ""Eric W . Biederman"" <ebiederm@xmission.com> Cc: ""Paul E . McKenney"" <paulmck@linux.vnet.ibm.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Dave Jones <davej@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kees Cook <keescook@chromium.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Robin Holt <robinmholt@gmail.com> Cc: Sedat Dilek <sedat.dilek@gmail.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/e1f56c89b040134add93f686931cc266541d239a;"With split page table lock for PMD level we can't hold mm->page_table_lock while updating nr_ptes";no;yes;yes 199;MDY6Q29tbWl0MjMyNTI5ODplMWY1NmM4OWIwNDAxMzRhZGQ5M2Y2ODY5MzFjYzI2NjU0MWQyMzlh;Kirill A. Shutemov;Linus Torvalds;"mm: convert mm->nr_ptes to atomic_long_t With split page table lock for PMD level we can't hold mm->page_table_lock while updating nr_ptes. Let's convert it to atomic_long_t to avoid races. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Tested-by: Alex Thorlton <athorlton@sgi.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: ""Eric W . Biederman"" <ebiederm@xmission.com> Cc: ""Paul E . McKenney"" <paulmck@linux.vnet.ibm.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Dave Jones <davej@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kees Cook <keescook@chromium.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Robin Holt <robinmholt@gmail.com> Cc: Sedat Dilek <sedat.dilek@gmail.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/e1f56c89b040134add93f686931cc266541d239a;Let's convert it to atomic_long_t to avoid races.;yes;yes;no 201;MDY6Q29tbWl0MjMyNTI5ODozODEyYzhjOGYzOTUzOTIxZWYxODU0NDExMGRhZmMzNTA1YzFhYzYy;Johannes Weiner;Linus Torvalds;"mm: memcg: do not trap chargers with full callstack on OOM The memcg OOM handling is incredibly fragile and can deadlock. When a task fails to charge memory, it invokes the OOM killer and loops right there in the charge code until it succeeds. Comparably, any other task that enters the charge path at this point will go to a waitqueue right then and there and sleep until the OOM situation is resolved. The problem is that these tasks may hold filesystem locks and the mmap_sem; locks that the selected OOM victim may need to exit. For example, in one reported case, the task invoking the OOM killer was about to charge a page cache page during a write(), which holds the i_mutex. The OOM killer selected a task that was just entering truncate() and trying to acquire the i_mutex: OOM invoking task: mem_cgroup_handle_oom+0x241/0x3b0 mem_cgroup_cache_charge+0xbe/0xe0 add_to_page_cache_locked+0x4c/0x140 add_to_page_cache_lru+0x22/0x50 grab_cache_page_write_begin+0x8b/0xe0 ext3_write_begin+0x88/0x270 generic_file_buffered_write+0x116/0x290 __generic_file_aio_write+0x27c/0x480 generic_file_aio_write+0x76/0xf0 # takes ->i_mutex do_sync_write+0xea/0x130 vfs_write+0xf3/0x1f0 sys_write+0x51/0x90 system_call_fastpath+0x18/0x1d OOM kill victim: do_truncate+0x58/0xa0 # takes i_mutex do_last+0x250/0xa30 path_openat+0xd7/0x440 do_filp_open+0x49/0xa0 do_sys_open+0x106/0x240 sys_open+0x20/0x30 system_call_fastpath+0x18/0x1d The OOM handling task will retry the charge indefinitely while the OOM killed task is not releasing any resources. A similar scenario can happen when the kernel OOM killer for a memcg is disabled and a userspace task is in charge of resolving OOM situations. In this case, ALL tasks that enter the OOM path will be made to sleep on the OOM waitqueue and wait for userspace to free resources or increase the group's limit. But a userspace OOM handler is prone to deadlock itself on the locks held by the waiting tasks. For example one of the sleeping tasks may be stuck in a brk() call with the mmap_sem held for writing but the userspace handler, in order to pick an optimal victim, may need to read files from /proc/<pid>, which tries to acquire the same mmap_sem for reading and deadlocks. This patch changes the way tasks behave after detecting a memcg OOM and makes sure nobody loops or sleeps with locks held: 1. When OOMing in a user fault, invoke the OOM killer and restart the fault instead of looping on the charge attempt. This way, the OOM victim can not get stuck on locks the looping task may hold. 2. When OOMing in a user fault but somebody else is handling it (either the kernel OOM killer or a userspace handler), don't go to sleep in the charge context. Instead, remember the OOMing memcg in the task struct and then fully unwind the page fault stack with -ENOMEM. pagefault_out_of_memory() will then call back into the memcg code to check if the -ENOMEM came from the memcg, and then either put the task to sleep on the memcg's OOM waitqueue or just restart the fault. The OOM victim can no longer get stuck on any lock a sleeping task may hold. Debugged by Michal Hocko. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: azurIt <azurit@pobox.sk> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3812c8c8f3953921ef18544110dafc3505c1ac62;mm: memcg: do not trap chargers with full callstack on OOM;yes;no;no 201;MDY6Q29tbWl0MjMyNTI5ODozODEyYzhjOGYzOTUzOTIxZWYxODU0NDExMGRhZmMzNTA1YzFhYzYy;Johannes Weiner;Linus Torvalds;"mm: memcg: do not trap chargers with full callstack on OOM The memcg OOM handling is incredibly fragile and can deadlock. When a task fails to charge memory, it invokes the OOM killer and loops right there in the charge code until it succeeds. Comparably, any other task that enters the charge path at this point will go to a waitqueue right then and there and sleep until the OOM situation is resolved. The problem is that these tasks may hold filesystem locks and the mmap_sem; locks that the selected OOM victim may need to exit. For example, in one reported case, the task invoking the OOM killer was about to charge a page cache page during a write(), which holds the i_mutex. The OOM killer selected a task that was just entering truncate() and trying to acquire the i_mutex: OOM invoking task: mem_cgroup_handle_oom+0x241/0x3b0 mem_cgroup_cache_charge+0xbe/0xe0 add_to_page_cache_locked+0x4c/0x140 add_to_page_cache_lru+0x22/0x50 grab_cache_page_write_begin+0x8b/0xe0 ext3_write_begin+0x88/0x270 generic_file_buffered_write+0x116/0x290 __generic_file_aio_write+0x27c/0x480 generic_file_aio_write+0x76/0xf0 # takes ->i_mutex do_sync_write+0xea/0x130 vfs_write+0xf3/0x1f0 sys_write+0x51/0x90 system_call_fastpath+0x18/0x1d OOM kill victim: do_truncate+0x58/0xa0 # takes i_mutex do_last+0x250/0xa30 path_openat+0xd7/0x440 do_filp_open+0x49/0xa0 do_sys_open+0x106/0x240 sys_open+0x20/0x30 system_call_fastpath+0x18/0x1d The OOM handling task will retry the charge indefinitely while the OOM killed task is not releasing any resources. A similar scenario can happen when the kernel OOM killer for a memcg is disabled and a userspace task is in charge of resolving OOM situations. In this case, ALL tasks that enter the OOM path will be made to sleep on the OOM waitqueue and wait for userspace to free resources or increase the group's limit. But a userspace OOM handler is prone to deadlock itself on the locks held by the waiting tasks. For example one of the sleeping tasks may be stuck in a brk() call with the mmap_sem held for writing but the userspace handler, in order to pick an optimal victim, may need to read files from /proc/<pid>, which tries to acquire the same mmap_sem for reading and deadlocks. This patch changes the way tasks behave after detecting a memcg OOM and makes sure nobody loops or sleeps with locks held: 1. When OOMing in a user fault, invoke the OOM killer and restart the fault instead of looping on the charge attempt. This way, the OOM victim can not get stuck on locks the looping task may hold. 2. When OOMing in a user fault but somebody else is handling it (either the kernel OOM killer or a userspace handler), don't go to sleep in the charge context. Instead, remember the OOMing memcg in the task struct and then fully unwind the page fault stack with -ENOMEM. pagefault_out_of_memory() will then call back into the memcg code to check if the -ENOMEM came from the memcg, and then either put the task to sleep on the memcg's OOM waitqueue or just restart the fault. The OOM victim can no longer get stuck on any lock a sleeping task may hold. Debugged by Michal Hocko. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: azurIt <azurit@pobox.sk> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3812c8c8f3953921ef18544110dafc3505c1ac62;The memcg OOM handling is incredibly fragile and can deadlock;no;yes;yes 201;MDY6Q29tbWl0MjMyNTI5ODozODEyYzhjOGYzOTUzOTIxZWYxODU0NDExMGRhZmMzNTA1YzFhYzYy;Johannes Weiner;Linus Torvalds;"mm: memcg: do not trap chargers with full callstack on OOM The memcg OOM handling is incredibly fragile and can deadlock. When a task fails to charge memory, it invokes the OOM killer and loops right there in the charge code until it succeeds. Comparably, any other task that enters the charge path at this point will go to a waitqueue right then and there and sleep until the OOM situation is resolved. The problem is that these tasks may hold filesystem locks and the mmap_sem; locks that the selected OOM victim may need to exit. For example, in one reported case, the task invoking the OOM killer was about to charge a page cache page during a write(), which holds the i_mutex. The OOM killer selected a task that was just entering truncate() and trying to acquire the i_mutex: OOM invoking task: mem_cgroup_handle_oom+0x241/0x3b0 mem_cgroup_cache_charge+0xbe/0xe0 add_to_page_cache_locked+0x4c/0x140 add_to_page_cache_lru+0x22/0x50 grab_cache_page_write_begin+0x8b/0xe0 ext3_write_begin+0x88/0x270 generic_file_buffered_write+0x116/0x290 __generic_file_aio_write+0x27c/0x480 generic_file_aio_write+0x76/0xf0 # takes ->i_mutex do_sync_write+0xea/0x130 vfs_write+0xf3/0x1f0 sys_write+0x51/0x90 system_call_fastpath+0x18/0x1d OOM kill victim: do_truncate+0x58/0xa0 # takes i_mutex do_last+0x250/0xa30 path_openat+0xd7/0x440 do_filp_open+0x49/0xa0 do_sys_open+0x106/0x240 sys_open+0x20/0x30 system_call_fastpath+0x18/0x1d The OOM handling task will retry the charge indefinitely while the OOM killed task is not releasing any resources. A similar scenario can happen when the kernel OOM killer for a memcg is disabled and a userspace task is in charge of resolving OOM situations. In this case, ALL tasks that enter the OOM path will be made to sleep on the OOM waitqueue and wait for userspace to free resources or increase the group's limit. But a userspace OOM handler is prone to deadlock itself on the locks held by the waiting tasks. For example one of the sleeping tasks may be stuck in a brk() call with the mmap_sem held for writing but the userspace handler, in order to pick an optimal victim, may need to read files from /proc/<pid>, which tries to acquire the same mmap_sem for reading and deadlocks. This patch changes the way tasks behave after detecting a memcg OOM and makes sure nobody loops or sleeps with locks held: 1. When OOMing in a user fault, invoke the OOM killer and restart the fault instead of looping on the charge attempt. This way, the OOM victim can not get stuck on locks the looping task may hold. 2. When OOMing in a user fault but somebody else is handling it (either the kernel OOM killer or a userspace handler), don't go to sleep in the charge context. Instead, remember the OOMing memcg in the task struct and then fully unwind the page fault stack with -ENOMEM. pagefault_out_of_memory() will then call back into the memcg code to check if the -ENOMEM came from the memcg, and then either put the task to sleep on the memcg's OOM waitqueue or just restart the fault. The OOM victim can no longer get stuck on any lock a sleeping task may hold. Debugged by Michal Hocko. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: azurIt <azurit@pobox.sk> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3812c8c8f3953921ef18544110dafc3505c1ac62;" When a task fails to charge memory, it invokes the OOM killer and loops right there in the charge code until it succeeds";no;no;yes 201;MDY6Q29tbWl0MjMyNTI5ODozODEyYzhjOGYzOTUzOTIxZWYxODU0NDExMGRhZmMzNTA1YzFhYzYy;Johannes Weiner;Linus Torvalds;"mm: memcg: do not trap chargers with full callstack on OOM The memcg OOM handling is incredibly fragile and can deadlock. When a task fails to charge memory, it invokes the OOM killer and loops right there in the charge code until it succeeds. Comparably, any other task that enters the charge path at this point will go to a waitqueue right then and there and sleep until the OOM situation is resolved. The problem is that these tasks may hold filesystem locks and the mmap_sem; locks that the selected OOM victim may need to exit. For example, in one reported case, the task invoking the OOM killer was about to charge a page cache page during a write(), which holds the i_mutex. The OOM killer selected a task that was just entering truncate() and trying to acquire the i_mutex: OOM invoking task: mem_cgroup_handle_oom+0x241/0x3b0 mem_cgroup_cache_charge+0xbe/0xe0 add_to_page_cache_locked+0x4c/0x140 add_to_page_cache_lru+0x22/0x50 grab_cache_page_write_begin+0x8b/0xe0 ext3_write_begin+0x88/0x270 generic_file_buffered_write+0x116/0x290 __generic_file_aio_write+0x27c/0x480 generic_file_aio_write+0x76/0xf0 # takes ->i_mutex do_sync_write+0xea/0x130 vfs_write+0xf3/0x1f0 sys_write+0x51/0x90 system_call_fastpath+0x18/0x1d OOM kill victim: do_truncate+0x58/0xa0 # takes i_mutex do_last+0x250/0xa30 path_openat+0xd7/0x440 do_filp_open+0x49/0xa0 do_sys_open+0x106/0x240 sys_open+0x20/0x30 system_call_fastpath+0x18/0x1d The OOM handling task will retry the charge indefinitely while the OOM killed task is not releasing any resources. A similar scenario can happen when the kernel OOM killer for a memcg is disabled and a userspace task is in charge of resolving OOM situations. In this case, ALL tasks that enter the OOM path will be made to sleep on the OOM waitqueue and wait for userspace to free resources or increase the group's limit. But a userspace OOM handler is prone to deadlock itself on the locks held by the waiting tasks. For example one of the sleeping tasks may be stuck in a brk() call with the mmap_sem held for writing but the userspace handler, in order to pick an optimal victim, may need to read files from /proc/<pid>, which tries to acquire the same mmap_sem for reading and deadlocks. This patch changes the way tasks behave after detecting a memcg OOM and makes sure nobody loops or sleeps with locks held: 1. When OOMing in a user fault, invoke the OOM killer and restart the fault instead of looping on the charge attempt. This way, the OOM victim can not get stuck on locks the looping task may hold. 2. When OOMing in a user fault but somebody else is handling it (either the kernel OOM killer or a userspace handler), don't go to sleep in the charge context. Instead, remember the OOMing memcg in the task struct and then fully unwind the page fault stack with -ENOMEM. pagefault_out_of_memory() will then call back into the memcg code to check if the -ENOMEM came from the memcg, and then either put the task to sleep on the memcg's OOM waitqueue or just restart the fault. The OOM victim can no longer get stuck on any lock a sleeping task may hold. Debugged by Michal Hocko. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: azurIt <azurit@pobox.sk> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3812c8c8f3953921ef18544110dafc3505c1ac62;" Comparably, any other task that enters the charge path at this point will go to a waitqueue right then and there and sleep until the OOM situation is resolved";no;no;yes 201;MDY6Q29tbWl0MjMyNTI5ODozODEyYzhjOGYzOTUzOTIxZWYxODU0NDExMGRhZmMzNTA1YzFhYzYy;Johannes Weiner;Linus Torvalds;"mm: memcg: do not trap chargers with full callstack on OOM The memcg OOM handling is incredibly fragile and can deadlock. When a task fails to charge memory, it invokes the OOM killer and loops right there in the charge code until it succeeds. Comparably, any other task that enters the charge path at this point will go to a waitqueue right then and there and sleep until the OOM situation is resolved. The problem is that these tasks may hold filesystem locks and the mmap_sem; locks that the selected OOM victim may need to exit. For example, in one reported case, the task invoking the OOM killer was about to charge a page cache page during a write(), which holds the i_mutex. The OOM killer selected a task that was just entering truncate() and trying to acquire the i_mutex: OOM invoking task: mem_cgroup_handle_oom+0x241/0x3b0 mem_cgroup_cache_charge+0xbe/0xe0 add_to_page_cache_locked+0x4c/0x140 add_to_page_cache_lru+0x22/0x50 grab_cache_page_write_begin+0x8b/0xe0 ext3_write_begin+0x88/0x270 generic_file_buffered_write+0x116/0x290 __generic_file_aio_write+0x27c/0x480 generic_file_aio_write+0x76/0xf0 # takes ->i_mutex do_sync_write+0xea/0x130 vfs_write+0xf3/0x1f0 sys_write+0x51/0x90 system_call_fastpath+0x18/0x1d OOM kill victim: do_truncate+0x58/0xa0 # takes i_mutex do_last+0x250/0xa30 path_openat+0xd7/0x440 do_filp_open+0x49/0xa0 do_sys_open+0x106/0x240 sys_open+0x20/0x30 system_call_fastpath+0x18/0x1d The OOM handling task will retry the charge indefinitely while the OOM killed task is not releasing any resources. A similar scenario can happen when the kernel OOM killer for a memcg is disabled and a userspace task is in charge of resolving OOM situations. In this case, ALL tasks that enter the OOM path will be made to sleep on the OOM waitqueue and wait for userspace to free resources or increase the group's limit. But a userspace OOM handler is prone to deadlock itself on the locks held by the waiting tasks. For example one of the sleeping tasks may be stuck in a brk() call with the mmap_sem held for writing but the userspace handler, in order to pick an optimal victim, may need to read files from /proc/<pid>, which tries to acquire the same mmap_sem for reading and deadlocks. This patch changes the way tasks behave after detecting a memcg OOM and makes sure nobody loops or sleeps with locks held: 1. When OOMing in a user fault, invoke the OOM killer and restart the fault instead of looping on the charge attempt. This way, the OOM victim can not get stuck on locks the looping task may hold. 2. When OOMing in a user fault but somebody else is handling it (either the kernel OOM killer or a userspace handler), don't go to sleep in the charge context. Instead, remember the OOMing memcg in the task struct and then fully unwind the page fault stack with -ENOMEM. pagefault_out_of_memory() will then call back into the memcg code to check if the -ENOMEM came from the memcg, and then either put the task to sleep on the memcg's OOM waitqueue or just restart the fault. The OOM victim can no longer get stuck on any lock a sleeping task may hold. Debugged by Michal Hocko. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: azurIt <azurit@pobox.sk> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3812c8c8f3953921ef18544110dafc3505c1ac62;" The problem is that these tasks may hold filesystem locks and the mmap_sem; locks that the selected OOM victim may need to exit";no;yes;yes 201;MDY6Q29tbWl0MjMyNTI5ODozODEyYzhjOGYzOTUzOTIxZWYxODU0NDExMGRhZmMzNTA1YzFhYzYy;Johannes Weiner;Linus Torvalds;"mm: memcg: do not trap chargers with full callstack on OOM The memcg OOM handling is incredibly fragile and can deadlock. When a task fails to charge memory, it invokes the OOM killer and loops right there in the charge code until it succeeds. Comparably, any other task that enters the charge path at this point will go to a waitqueue right then and there and sleep until the OOM situation is resolved. The problem is that these tasks may hold filesystem locks and the mmap_sem; locks that the selected OOM victim may need to exit. For example, in one reported case, the task invoking the OOM killer was about to charge a page cache page during a write(), which holds the i_mutex. The OOM killer selected a task that was just entering truncate() and trying to acquire the i_mutex: OOM invoking task: mem_cgroup_handle_oom+0x241/0x3b0 mem_cgroup_cache_charge+0xbe/0xe0 add_to_page_cache_locked+0x4c/0x140 add_to_page_cache_lru+0x22/0x50 grab_cache_page_write_begin+0x8b/0xe0 ext3_write_begin+0x88/0x270 generic_file_buffered_write+0x116/0x290 __generic_file_aio_write+0x27c/0x480 generic_file_aio_write+0x76/0xf0 # takes ->i_mutex do_sync_write+0xea/0x130 vfs_write+0xf3/0x1f0 sys_write+0x51/0x90 system_call_fastpath+0x18/0x1d OOM kill victim: do_truncate+0x58/0xa0 # takes i_mutex do_last+0x250/0xa30 path_openat+0xd7/0x440 do_filp_open+0x49/0xa0 do_sys_open+0x106/0x240 sys_open+0x20/0x30 system_call_fastpath+0x18/0x1d The OOM handling task will retry the charge indefinitely while the OOM killed task is not releasing any resources. A similar scenario can happen when the kernel OOM killer for a memcg is disabled and a userspace task is in charge of resolving OOM situations. In this case, ALL tasks that enter the OOM path will be made to sleep on the OOM waitqueue and wait for userspace to free resources or increase the group's limit. But a userspace OOM handler is prone to deadlock itself on the locks held by the waiting tasks. For example one of the sleeping tasks may be stuck in a brk() call with the mmap_sem held for writing but the userspace handler, in order to pick an optimal victim, may need to read files from /proc/<pid>, which tries to acquire the same mmap_sem for reading and deadlocks. This patch changes the way tasks behave after detecting a memcg OOM and makes sure nobody loops or sleeps with locks held: 1. When OOMing in a user fault, invoke the OOM killer and restart the fault instead of looping on the charge attempt. This way, the OOM victim can not get stuck on locks the looping task may hold. 2. When OOMing in a user fault but somebody else is handling it (either the kernel OOM killer or a userspace handler), don't go to sleep in the charge context. Instead, remember the OOMing memcg in the task struct and then fully unwind the page fault stack with -ENOMEM. pagefault_out_of_memory() will then call back into the memcg code to check if the -ENOMEM came from the memcg, and then either put the task to sleep on the memcg's OOM waitqueue or just restart the fault. The OOM victim can no longer get stuck on any lock a sleeping task may hold. Debugged by Michal Hocko. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: azurIt <azurit@pobox.sk> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3812c8c8f3953921ef18544110dafc3505c1ac62;"For example, in one reported case, the task invoking the OOM killer was about to charge a page cache page during a write(), which holds the i_mutex";no;no;yes 201;MDY6Q29tbWl0MjMyNTI5ODozODEyYzhjOGYzOTUzOTIxZWYxODU0NDExMGRhZmMzNTA1YzFhYzYy;Johannes Weiner;Linus Torvalds;"mm: memcg: do not trap chargers with full callstack on OOM The memcg OOM handling is incredibly fragile and can deadlock. When a task fails to charge memory, it invokes the OOM killer and loops right there in the charge code until it succeeds. Comparably, any other task that enters the charge path at this point will go to a waitqueue right then and there and sleep until the OOM situation is resolved. The problem is that these tasks may hold filesystem locks and the mmap_sem; locks that the selected OOM victim may need to exit. For example, in one reported case, the task invoking the OOM killer was about to charge a page cache page during a write(), which holds the i_mutex. The OOM killer selected a task that was just entering truncate() and trying to acquire the i_mutex: OOM invoking task: mem_cgroup_handle_oom+0x241/0x3b0 mem_cgroup_cache_charge+0xbe/0xe0 add_to_page_cache_locked+0x4c/0x140 add_to_page_cache_lru+0x22/0x50 grab_cache_page_write_begin+0x8b/0xe0 ext3_write_begin+0x88/0x270 generic_file_buffered_write+0x116/0x290 __generic_file_aio_write+0x27c/0x480 generic_file_aio_write+0x76/0xf0 # takes ->i_mutex do_sync_write+0xea/0x130 vfs_write+0xf3/0x1f0 sys_write+0x51/0x90 system_call_fastpath+0x18/0x1d OOM kill victim: do_truncate+0x58/0xa0 # takes i_mutex do_last+0x250/0xa30 path_openat+0xd7/0x440 do_filp_open+0x49/0xa0 do_sys_open+0x106/0x240 sys_open+0x20/0x30 system_call_fastpath+0x18/0x1d The OOM handling task will retry the charge indefinitely while the OOM killed task is not releasing any resources. A similar scenario can happen when the kernel OOM killer for a memcg is disabled and a userspace task is in charge of resolving OOM situations. In this case, ALL tasks that enter the OOM path will be made to sleep on the OOM waitqueue and wait for userspace to free resources or increase the group's limit. But a userspace OOM handler is prone to deadlock itself on the locks held by the waiting tasks. For example one of the sleeping tasks may be stuck in a brk() call with the mmap_sem held for writing but the userspace handler, in order to pick an optimal victim, may need to read files from /proc/<pid>, which tries to acquire the same mmap_sem for reading and deadlocks. This patch changes the way tasks behave after detecting a memcg OOM and makes sure nobody loops or sleeps with locks held: 1. When OOMing in a user fault, invoke the OOM killer and restart the fault instead of looping on the charge attempt. This way, the OOM victim can not get stuck on locks the looping task may hold. 2. When OOMing in a user fault but somebody else is handling it (either the kernel OOM killer or a userspace handler), don't go to sleep in the charge context. Instead, remember the OOMing memcg in the task struct and then fully unwind the page fault stack with -ENOMEM. pagefault_out_of_memory() will then call back into the memcg code to check if the -ENOMEM came from the memcg, and then either put the task to sleep on the memcg's OOM waitqueue or just restart the fault. The OOM victim can no longer get stuck on any lock a sleeping task may hold. Debugged by Michal Hocko. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: azurIt <azurit@pobox.sk> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3812c8c8f3953921ef18544110dafc3505c1ac62;" The OOM killer selected a task that was just entering truncate() and trying to acquire the i_mutex";no;no;yes 201;MDY6Q29tbWl0MjMyNTI5ODozODEyYzhjOGYzOTUzOTIxZWYxODU0NDExMGRhZmMzNTA1YzFhYzYy;Johannes Weiner;Linus Torvalds;"mm: memcg: do not trap chargers with full callstack on OOM The memcg OOM handling is incredibly fragile and can deadlock. When a task fails to charge memory, it invokes the OOM killer and loops right there in the charge code until it succeeds. Comparably, any other task that enters the charge path at this point will go to a waitqueue right then and there and sleep until the OOM situation is resolved. The problem is that these tasks may hold filesystem locks and the mmap_sem; locks that the selected OOM victim may need to exit. For example, in one reported case, the task invoking the OOM killer was about to charge a page cache page during a write(), which holds the i_mutex. The OOM killer selected a task that was just entering truncate() and trying to acquire the i_mutex: OOM invoking task: mem_cgroup_handle_oom+0x241/0x3b0 mem_cgroup_cache_charge+0xbe/0xe0 add_to_page_cache_locked+0x4c/0x140 add_to_page_cache_lru+0x22/0x50 grab_cache_page_write_begin+0x8b/0xe0 ext3_write_begin+0x88/0x270 generic_file_buffered_write+0x116/0x290 __generic_file_aio_write+0x27c/0x480 generic_file_aio_write+0x76/0xf0 # takes ->i_mutex do_sync_write+0xea/0x130 vfs_write+0xf3/0x1f0 sys_write+0x51/0x90 system_call_fastpath+0x18/0x1d OOM kill victim: do_truncate+0x58/0xa0 # takes i_mutex do_last+0x250/0xa30 path_openat+0xd7/0x440 do_filp_open+0x49/0xa0 do_sys_open+0x106/0x240 sys_open+0x20/0x30 system_call_fastpath+0x18/0x1d The OOM handling task will retry the charge indefinitely while the OOM killed task is not releasing any resources. A similar scenario can happen when the kernel OOM killer for a memcg is disabled and a userspace task is in charge of resolving OOM situations. In this case, ALL tasks that enter the OOM path will be made to sleep on the OOM waitqueue and wait for userspace to free resources or increase the group's limit. But a userspace OOM handler is prone to deadlock itself on the locks held by the waiting tasks. For example one of the sleeping tasks may be stuck in a brk() call with the mmap_sem held for writing but the userspace handler, in order to pick an optimal victim, may need to read files from /proc/<pid>, which tries to acquire the same mmap_sem for reading and deadlocks. This patch changes the way tasks behave after detecting a memcg OOM and makes sure nobody loops or sleeps with locks held: 1. When OOMing in a user fault, invoke the OOM killer and restart the fault instead of looping on the charge attempt. This way, the OOM victim can not get stuck on locks the looping task may hold. 2. When OOMing in a user fault but somebody else is handling it (either the kernel OOM killer or a userspace handler), don't go to sleep in the charge context. Instead, remember the OOMing memcg in the task struct and then fully unwind the page fault stack with -ENOMEM. pagefault_out_of_memory() will then call back into the memcg code to check if the -ENOMEM came from the memcg, and then either put the task to sleep on the memcg's OOM waitqueue or just restart the fault. The OOM victim can no longer get stuck on any lock a sleeping task may hold. Debugged by Michal Hocko. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: azurIt <azurit@pobox.sk> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3812c8c8f3953921ef18544110dafc3505c1ac62;OOM invoking task;no;no;yes 201;MDY6Q29tbWl0MjMyNTI5ODozODEyYzhjOGYzOTUzOTIxZWYxODU0NDExMGRhZmMzNTA1YzFhYzYy;Johannes Weiner;Linus Torvalds;"mm: memcg: do not trap chargers with full callstack on OOM The memcg OOM handling is incredibly fragile and can deadlock. When a task fails to charge memory, it invokes the OOM killer and loops right there in the charge code until it succeeds. Comparably, any other task that enters the charge path at this point will go to a waitqueue right then and there and sleep until the OOM situation is resolved. The problem is that these tasks may hold filesystem locks and the mmap_sem; locks that the selected OOM victim may need to exit. For example, in one reported case, the task invoking the OOM killer was about to charge a page cache page during a write(), which holds the i_mutex. The OOM killer selected a task that was just entering truncate() and trying to acquire the i_mutex: OOM invoking task: mem_cgroup_handle_oom+0x241/0x3b0 mem_cgroup_cache_charge+0xbe/0xe0 add_to_page_cache_locked+0x4c/0x140 add_to_page_cache_lru+0x22/0x50 grab_cache_page_write_begin+0x8b/0xe0 ext3_write_begin+0x88/0x270 generic_file_buffered_write+0x116/0x290 __generic_file_aio_write+0x27c/0x480 generic_file_aio_write+0x76/0xf0 # takes ->i_mutex do_sync_write+0xea/0x130 vfs_write+0xf3/0x1f0 sys_write+0x51/0x90 system_call_fastpath+0x18/0x1d OOM kill victim: do_truncate+0x58/0xa0 # takes i_mutex do_last+0x250/0xa30 path_openat+0xd7/0x440 do_filp_open+0x49/0xa0 do_sys_open+0x106/0x240 sys_open+0x20/0x30 system_call_fastpath+0x18/0x1d The OOM handling task will retry the charge indefinitely while the OOM killed task is not releasing any resources. A similar scenario can happen when the kernel OOM killer for a memcg is disabled and a userspace task is in charge of resolving OOM situations. In this case, ALL tasks that enter the OOM path will be made to sleep on the OOM waitqueue and wait for userspace to free resources or increase the group's limit. But a userspace OOM handler is prone to deadlock itself on the locks held by the waiting tasks. For example one of the sleeping tasks may be stuck in a brk() call with the mmap_sem held for writing but the userspace handler, in order to pick an optimal victim, may need to read files from /proc/<pid>, which tries to acquire the same mmap_sem for reading and deadlocks. This patch changes the way tasks behave after detecting a memcg OOM and makes sure nobody loops or sleeps with locks held: 1. When OOMing in a user fault, invoke the OOM killer and restart the fault instead of looping on the charge attempt. This way, the OOM victim can not get stuck on locks the looping task may hold. 2. When OOMing in a user fault but somebody else is handling it (either the kernel OOM killer or a userspace handler), don't go to sleep in the charge context. Instead, remember the OOMing memcg in the task struct and then fully unwind the page fault stack with -ENOMEM. pagefault_out_of_memory() will then call back into the memcg code to check if the -ENOMEM came from the memcg, and then either put the task to sleep on the memcg's OOM waitqueue or just restart the fault. The OOM victim can no longer get stuck on any lock a sleeping task may hold. Debugged by Michal Hocko. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: azurIt <azurit@pobox.sk> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3812c8c8f3953921ef18544110dafc3505c1ac62;" do_truncate+0x58/0xa0 # takes i_mutex do_last+0x250/0xa30 path_openat+0xd7/0x440 do_filp_open+0x49/0xa0 do_sys_open+0x106/0x240 sys_open+0x20/0x30 system_call_fastpath+0x18/0x1d The OOM handling task will retry the charge indefinitely while the OOM killed task is not releasing any resources";no;yes;yes 201;MDY6Q29tbWl0MjMyNTI5ODozODEyYzhjOGYzOTUzOTIxZWYxODU0NDExMGRhZmMzNTA1YzFhYzYy;Johannes Weiner;Linus Torvalds;"mm: memcg: do not trap chargers with full callstack on OOM The memcg OOM handling is incredibly fragile and can deadlock. When a task fails to charge memory, it invokes the OOM killer and loops right there in the charge code until it succeeds. Comparably, any other task that enters the charge path at this point will go to a waitqueue right then and there and sleep until the OOM situation is resolved. The problem is that these tasks may hold filesystem locks and the mmap_sem; locks that the selected OOM victim may need to exit. For example, in one reported case, the task invoking the OOM killer was about to charge a page cache page during a write(), which holds the i_mutex. The OOM killer selected a task that was just entering truncate() and trying to acquire the i_mutex: OOM invoking task: mem_cgroup_handle_oom+0x241/0x3b0 mem_cgroup_cache_charge+0xbe/0xe0 add_to_page_cache_locked+0x4c/0x140 add_to_page_cache_lru+0x22/0x50 grab_cache_page_write_begin+0x8b/0xe0 ext3_write_begin+0x88/0x270 generic_file_buffered_write+0x116/0x290 __generic_file_aio_write+0x27c/0x480 generic_file_aio_write+0x76/0xf0 # takes ->i_mutex do_sync_write+0xea/0x130 vfs_write+0xf3/0x1f0 sys_write+0x51/0x90 system_call_fastpath+0x18/0x1d OOM kill victim: do_truncate+0x58/0xa0 # takes i_mutex do_last+0x250/0xa30 path_openat+0xd7/0x440 do_filp_open+0x49/0xa0 do_sys_open+0x106/0x240 sys_open+0x20/0x30 system_call_fastpath+0x18/0x1d The OOM handling task will retry the charge indefinitely while the OOM killed task is not releasing any resources. A similar scenario can happen when the kernel OOM killer for a memcg is disabled and a userspace task is in charge of resolving OOM situations. In this case, ALL tasks that enter the OOM path will be made to sleep on the OOM waitqueue and wait for userspace to free resources or increase the group's limit. But a userspace OOM handler is prone to deadlock itself on the locks held by the waiting tasks. For example one of the sleeping tasks may be stuck in a brk() call with the mmap_sem held for writing but the userspace handler, in order to pick an optimal victim, may need to read files from /proc/<pid>, which tries to acquire the same mmap_sem for reading and deadlocks. This patch changes the way tasks behave after detecting a memcg OOM and makes sure nobody loops or sleeps with locks held: 1. When OOMing in a user fault, invoke the OOM killer and restart the fault instead of looping on the charge attempt. This way, the OOM victim can not get stuck on locks the looping task may hold. 2. When OOMing in a user fault but somebody else is handling it (either the kernel OOM killer or a userspace handler), don't go to sleep in the charge context. Instead, remember the OOMing memcg in the task struct and then fully unwind the page fault stack with -ENOMEM. pagefault_out_of_memory() will then call back into the memcg code to check if the -ENOMEM came from the memcg, and then either put the task to sleep on the memcg's OOM waitqueue or just restart the fault. The OOM victim can no longer get stuck on any lock a sleeping task may hold. Debugged by Michal Hocko. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: azurIt <azurit@pobox.sk> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3812c8c8f3953921ef18544110dafc3505c1ac62;"A similar scenario can happen when the kernel OOM killer for a memcg is disabled and a userspace task is in charge of resolving OOM situations";no;no;yes 201;MDY6Q29tbWl0MjMyNTI5ODozODEyYzhjOGYzOTUzOTIxZWYxODU0NDExMGRhZmMzNTA1YzFhYzYy;Johannes Weiner;Linus Torvalds;"mm: memcg: do not trap chargers with full callstack on OOM The memcg OOM handling is incredibly fragile and can deadlock. When a task fails to charge memory, it invokes the OOM killer and loops right there in the charge code until it succeeds. Comparably, any other task that enters the charge path at this point will go to a waitqueue right then and there and sleep until the OOM situation is resolved. The problem is that these tasks may hold filesystem locks and the mmap_sem; locks that the selected OOM victim may need to exit. For example, in one reported case, the task invoking the OOM killer was about to charge a page cache page during a write(), which holds the i_mutex. The OOM killer selected a task that was just entering truncate() and trying to acquire the i_mutex: OOM invoking task: mem_cgroup_handle_oom+0x241/0x3b0 mem_cgroup_cache_charge+0xbe/0xe0 add_to_page_cache_locked+0x4c/0x140 add_to_page_cache_lru+0x22/0x50 grab_cache_page_write_begin+0x8b/0xe0 ext3_write_begin+0x88/0x270 generic_file_buffered_write+0x116/0x290 __generic_file_aio_write+0x27c/0x480 generic_file_aio_write+0x76/0xf0 # takes ->i_mutex do_sync_write+0xea/0x130 vfs_write+0xf3/0x1f0 sys_write+0x51/0x90 system_call_fastpath+0x18/0x1d OOM kill victim: do_truncate+0x58/0xa0 # takes i_mutex do_last+0x250/0xa30 path_openat+0xd7/0x440 do_filp_open+0x49/0xa0 do_sys_open+0x106/0x240 sys_open+0x20/0x30 system_call_fastpath+0x18/0x1d The OOM handling task will retry the charge indefinitely while the OOM killed task is not releasing any resources. A similar scenario can happen when the kernel OOM killer for a memcg is disabled and a userspace task is in charge of resolving OOM situations. In this case, ALL tasks that enter the OOM path will be made to sleep on the OOM waitqueue and wait for userspace to free resources or increase the group's limit. But a userspace OOM handler is prone to deadlock itself on the locks held by the waiting tasks. For example one of the sleeping tasks may be stuck in a brk() call with the mmap_sem held for writing but the userspace handler, in order to pick an optimal victim, may need to read files from /proc/<pid>, which tries to acquire the same mmap_sem for reading and deadlocks. This patch changes the way tasks behave after detecting a memcg OOM and makes sure nobody loops or sleeps with locks held: 1. When OOMing in a user fault, invoke the OOM killer and restart the fault instead of looping on the charge attempt. This way, the OOM victim can not get stuck on locks the looping task may hold. 2. When OOMing in a user fault but somebody else is handling it (either the kernel OOM killer or a userspace handler), don't go to sleep in the charge context. Instead, remember the OOMing memcg in the task struct and then fully unwind the page fault stack with -ENOMEM. pagefault_out_of_memory() will then call back into the memcg code to check if the -ENOMEM came from the memcg, and then either put the task to sleep on the memcg's OOM waitqueue or just restart the fault. The OOM victim can no longer get stuck on any lock a sleeping task may hold. Debugged by Michal Hocko. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: azurIt <azurit@pobox.sk> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3812c8c8f3953921ef18544110dafc3505c1ac62;"In this case, ALL tasks that enter the OOM path will be made to sleep on the OOM waitqueue and wait for userspace to free resources or increase the group's limit";no;no;yes 201;MDY6Q29tbWl0MjMyNTI5ODozODEyYzhjOGYzOTUzOTIxZWYxODU0NDExMGRhZmMzNTA1YzFhYzYy;Johannes Weiner;Linus Torvalds;"mm: memcg: do not trap chargers with full callstack on OOM The memcg OOM handling is incredibly fragile and can deadlock. When a task fails to charge memory, it invokes the OOM killer and loops right there in the charge code until it succeeds. Comparably, any other task that enters the charge path at this point will go to a waitqueue right then and there and sleep until the OOM situation is resolved. The problem is that these tasks may hold filesystem locks and the mmap_sem; locks that the selected OOM victim may need to exit. For example, in one reported case, the task invoking the OOM killer was about to charge a page cache page during a write(), which holds the i_mutex. The OOM killer selected a task that was just entering truncate() and trying to acquire the i_mutex: OOM invoking task: mem_cgroup_handle_oom+0x241/0x3b0 mem_cgroup_cache_charge+0xbe/0xe0 add_to_page_cache_locked+0x4c/0x140 add_to_page_cache_lru+0x22/0x50 grab_cache_page_write_begin+0x8b/0xe0 ext3_write_begin+0x88/0x270 generic_file_buffered_write+0x116/0x290 __generic_file_aio_write+0x27c/0x480 generic_file_aio_write+0x76/0xf0 # takes ->i_mutex do_sync_write+0xea/0x130 vfs_write+0xf3/0x1f0 sys_write+0x51/0x90 system_call_fastpath+0x18/0x1d OOM kill victim: do_truncate+0x58/0xa0 # takes i_mutex do_last+0x250/0xa30 path_openat+0xd7/0x440 do_filp_open+0x49/0xa0 do_sys_open+0x106/0x240 sys_open+0x20/0x30 system_call_fastpath+0x18/0x1d The OOM handling task will retry the charge indefinitely while the OOM killed task is not releasing any resources. A similar scenario can happen when the kernel OOM killer for a memcg is disabled and a userspace task is in charge of resolving OOM situations. In this case, ALL tasks that enter the OOM path will be made to sleep on the OOM waitqueue and wait for userspace to free resources or increase the group's limit. But a userspace OOM handler is prone to deadlock itself on the locks held by the waiting tasks. For example one of the sleeping tasks may be stuck in a brk() call with the mmap_sem held for writing but the userspace handler, in order to pick an optimal victim, may need to read files from /proc/<pid>, which tries to acquire the same mmap_sem for reading and deadlocks. This patch changes the way tasks behave after detecting a memcg OOM and makes sure nobody loops or sleeps with locks held: 1. When OOMing in a user fault, invoke the OOM killer and restart the fault instead of looping on the charge attempt. This way, the OOM victim can not get stuck on locks the looping task may hold. 2. When OOMing in a user fault but somebody else is handling it (either the kernel OOM killer or a userspace handler), don't go to sleep in the charge context. Instead, remember the OOMing memcg in the task struct and then fully unwind the page fault stack with -ENOMEM. pagefault_out_of_memory() will then call back into the memcg code to check if the -ENOMEM came from the memcg, and then either put the task to sleep on the memcg's OOM waitqueue or just restart the fault. The OOM victim can no longer get stuck on any lock a sleeping task may hold. Debugged by Michal Hocko. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: azurIt <azurit@pobox.sk> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3812c8c8f3953921ef18544110dafc3505c1ac62;" But a userspace OOM handler is prone to deadlock itself on the locks held by the waiting tasks";no;yes;yes 201;MDY6Q29tbWl0MjMyNTI5ODozODEyYzhjOGYzOTUzOTIxZWYxODU0NDExMGRhZmMzNTA1YzFhYzYy;Johannes Weiner;Linus Torvalds;"mm: memcg: do not trap chargers with full callstack on OOM The memcg OOM handling is incredibly fragile and can deadlock. When a task fails to charge memory, it invokes the OOM killer and loops right there in the charge code until it succeeds. Comparably, any other task that enters the charge path at this point will go to a waitqueue right then and there and sleep until the OOM situation is resolved. The problem is that these tasks may hold filesystem locks and the mmap_sem; locks that the selected OOM victim may need to exit. For example, in one reported case, the task invoking the OOM killer was about to charge a page cache page during a write(), which holds the i_mutex. The OOM killer selected a task that was just entering truncate() and trying to acquire the i_mutex: OOM invoking task: mem_cgroup_handle_oom+0x241/0x3b0 mem_cgroup_cache_charge+0xbe/0xe0 add_to_page_cache_locked+0x4c/0x140 add_to_page_cache_lru+0x22/0x50 grab_cache_page_write_begin+0x8b/0xe0 ext3_write_begin+0x88/0x270 generic_file_buffered_write+0x116/0x290 __generic_file_aio_write+0x27c/0x480 generic_file_aio_write+0x76/0xf0 # takes ->i_mutex do_sync_write+0xea/0x130 vfs_write+0xf3/0x1f0 sys_write+0x51/0x90 system_call_fastpath+0x18/0x1d OOM kill victim: do_truncate+0x58/0xa0 # takes i_mutex do_last+0x250/0xa30 path_openat+0xd7/0x440 do_filp_open+0x49/0xa0 do_sys_open+0x106/0x240 sys_open+0x20/0x30 system_call_fastpath+0x18/0x1d The OOM handling task will retry the charge indefinitely while the OOM killed task is not releasing any resources. A similar scenario can happen when the kernel OOM killer for a memcg is disabled and a userspace task is in charge of resolving OOM situations. In this case, ALL tasks that enter the OOM path will be made to sleep on the OOM waitqueue and wait for userspace to free resources or increase the group's limit. But a userspace OOM handler is prone to deadlock itself on the locks held by the waiting tasks. For example one of the sleeping tasks may be stuck in a brk() call with the mmap_sem held for writing but the userspace handler, in order to pick an optimal victim, may need to read files from /proc/<pid>, which tries to acquire the same mmap_sem for reading and deadlocks. This patch changes the way tasks behave after detecting a memcg OOM and makes sure nobody loops or sleeps with locks held: 1. When OOMing in a user fault, invoke the OOM killer and restart the fault instead of looping on the charge attempt. This way, the OOM victim can not get stuck on locks the looping task may hold. 2. When OOMing in a user fault but somebody else is handling it (either the kernel OOM killer or a userspace handler), don't go to sleep in the charge context. Instead, remember the OOMing memcg in the task struct and then fully unwind the page fault stack with -ENOMEM. pagefault_out_of_memory() will then call back into the memcg code to check if the -ENOMEM came from the memcg, and then either put the task to sleep on the memcg's OOM waitqueue or just restart the fault. The OOM victim can no longer get stuck on any lock a sleeping task may hold. Debugged by Michal Hocko. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: azurIt <azurit@pobox.sk> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3812c8c8f3953921ef18544110dafc3505c1ac62;" For example one of the sleeping tasks may be stuck in a brk() call with the mmap_sem held for writing but the userspace handler, in order to pick an optimal victim, may need to read files from /proc/<pid>, which tries to acquire the same mmap_sem for reading and deadlocks";no;no;yes 201;MDY6Q29tbWl0MjMyNTI5ODozODEyYzhjOGYzOTUzOTIxZWYxODU0NDExMGRhZmMzNTA1YzFhYzYy;Johannes Weiner;Linus Torvalds;"mm: memcg: do not trap chargers with full callstack on OOM The memcg OOM handling is incredibly fragile and can deadlock. When a task fails to charge memory, it invokes the OOM killer and loops right there in the charge code until it succeeds. Comparably, any other task that enters the charge path at this point will go to a waitqueue right then and there and sleep until the OOM situation is resolved. The problem is that these tasks may hold filesystem locks and the mmap_sem; locks that the selected OOM victim may need to exit. For example, in one reported case, the task invoking the OOM killer was about to charge a page cache page during a write(), which holds the i_mutex. The OOM killer selected a task that was just entering truncate() and trying to acquire the i_mutex: OOM invoking task: mem_cgroup_handle_oom+0x241/0x3b0 mem_cgroup_cache_charge+0xbe/0xe0 add_to_page_cache_locked+0x4c/0x140 add_to_page_cache_lru+0x22/0x50 grab_cache_page_write_begin+0x8b/0xe0 ext3_write_begin+0x88/0x270 generic_file_buffered_write+0x116/0x290 __generic_file_aio_write+0x27c/0x480 generic_file_aio_write+0x76/0xf0 # takes ->i_mutex do_sync_write+0xea/0x130 vfs_write+0xf3/0x1f0 sys_write+0x51/0x90 system_call_fastpath+0x18/0x1d OOM kill victim: do_truncate+0x58/0xa0 # takes i_mutex do_last+0x250/0xa30 path_openat+0xd7/0x440 do_filp_open+0x49/0xa0 do_sys_open+0x106/0x240 sys_open+0x20/0x30 system_call_fastpath+0x18/0x1d The OOM handling task will retry the charge indefinitely while the OOM killed task is not releasing any resources. A similar scenario can happen when the kernel OOM killer for a memcg is disabled and a userspace task is in charge of resolving OOM situations. In this case, ALL tasks that enter the OOM path will be made to sleep on the OOM waitqueue and wait for userspace to free resources or increase the group's limit. But a userspace OOM handler is prone to deadlock itself on the locks held by the waiting tasks. For example one of the sleeping tasks may be stuck in a brk() call with the mmap_sem held for writing but the userspace handler, in order to pick an optimal victim, may need to read files from /proc/<pid>, which tries to acquire the same mmap_sem for reading and deadlocks. This patch changes the way tasks behave after detecting a memcg OOM and makes sure nobody loops or sleeps with locks held: 1. When OOMing in a user fault, invoke the OOM killer and restart the fault instead of looping on the charge attempt. This way, the OOM victim can not get stuck on locks the looping task may hold. 2. When OOMing in a user fault but somebody else is handling it (either the kernel OOM killer or a userspace handler), don't go to sleep in the charge context. Instead, remember the OOMing memcg in the task struct and then fully unwind the page fault stack with -ENOMEM. pagefault_out_of_memory() will then call back into the memcg code to check if the -ENOMEM came from the memcg, and then either put the task to sleep on the memcg's OOM waitqueue or just restart the fault. The OOM victim can no longer get stuck on any lock a sleeping task may hold. Debugged by Michal Hocko. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: azurIt <azurit@pobox.sk> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3812c8c8f3953921ef18544110dafc3505c1ac62;"This patch changes the way tasks behave after detecting a memcg OOM and makes sure nobody loops or sleeps with locks held";yes;yes;no 201;MDY6Q29tbWl0MjMyNTI5ODozODEyYzhjOGYzOTUzOTIxZWYxODU0NDExMGRhZmMzNTA1YzFhYzYy;Johannes Weiner;Linus Torvalds;"mm: memcg: do not trap chargers with full callstack on OOM The memcg OOM handling is incredibly fragile and can deadlock. When a task fails to charge memory, it invokes the OOM killer and loops right there in the charge code until it succeeds. Comparably, any other task that enters the charge path at this point will go to a waitqueue right then and there and sleep until the OOM situation is resolved. The problem is that these tasks may hold filesystem locks and the mmap_sem; locks that the selected OOM victim may need to exit. For example, in one reported case, the task invoking the OOM killer was about to charge a page cache page during a write(), which holds the i_mutex. The OOM killer selected a task that was just entering truncate() and trying to acquire the i_mutex: OOM invoking task: mem_cgroup_handle_oom+0x241/0x3b0 mem_cgroup_cache_charge+0xbe/0xe0 add_to_page_cache_locked+0x4c/0x140 add_to_page_cache_lru+0x22/0x50 grab_cache_page_write_begin+0x8b/0xe0 ext3_write_begin+0x88/0x270 generic_file_buffered_write+0x116/0x290 __generic_file_aio_write+0x27c/0x480 generic_file_aio_write+0x76/0xf0 # takes ->i_mutex do_sync_write+0xea/0x130 vfs_write+0xf3/0x1f0 sys_write+0x51/0x90 system_call_fastpath+0x18/0x1d OOM kill victim: do_truncate+0x58/0xa0 # takes i_mutex do_last+0x250/0xa30 path_openat+0xd7/0x440 do_filp_open+0x49/0xa0 do_sys_open+0x106/0x240 sys_open+0x20/0x30 system_call_fastpath+0x18/0x1d The OOM handling task will retry the charge indefinitely while the OOM killed task is not releasing any resources. A similar scenario can happen when the kernel OOM killer for a memcg is disabled and a userspace task is in charge of resolving OOM situations. In this case, ALL tasks that enter the OOM path will be made to sleep on the OOM waitqueue and wait for userspace to free resources or increase the group's limit. But a userspace OOM handler is prone to deadlock itself on the locks held by the waiting tasks. For example one of the sleeping tasks may be stuck in a brk() call with the mmap_sem held for writing but the userspace handler, in order to pick an optimal victim, may need to read files from /proc/<pid>, which tries to acquire the same mmap_sem for reading and deadlocks. This patch changes the way tasks behave after detecting a memcg OOM and makes sure nobody loops or sleeps with locks held: 1. When OOMing in a user fault, invoke the OOM killer and restart the fault instead of looping on the charge attempt. This way, the OOM victim can not get stuck on locks the looping task may hold. 2. When OOMing in a user fault but somebody else is handling it (either the kernel OOM killer or a userspace handler), don't go to sleep in the charge context. Instead, remember the OOMing memcg in the task struct and then fully unwind the page fault stack with -ENOMEM. pagefault_out_of_memory() will then call back into the memcg code to check if the -ENOMEM came from the memcg, and then either put the task to sleep on the memcg's OOM waitqueue or just restart the fault. The OOM victim can no longer get stuck on any lock a sleeping task may hold. Debugged by Michal Hocko. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: azurIt <azurit@pobox.sk> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3812c8c8f3953921ef18544110dafc3505c1ac62;"When OOMing in a user fault, invoke the OOM killer and restart the fault instead of looping on the charge attempt";yes;no;no 201;MDY6Q29tbWl0MjMyNTI5ODozODEyYzhjOGYzOTUzOTIxZWYxODU0NDExMGRhZmMzNTA1YzFhYzYy;Johannes Weiner;Linus Torvalds;"mm: memcg: do not trap chargers with full callstack on OOM The memcg OOM handling is incredibly fragile and can deadlock. When a task fails to charge memory, it invokes the OOM killer and loops right there in the charge code until it succeeds. Comparably, any other task that enters the charge path at this point will go to a waitqueue right then and there and sleep until the OOM situation is resolved. The problem is that these tasks may hold filesystem locks and the mmap_sem; locks that the selected OOM victim may need to exit. For example, in one reported case, the task invoking the OOM killer was about to charge a page cache page during a write(), which holds the i_mutex. The OOM killer selected a task that was just entering truncate() and trying to acquire the i_mutex: OOM invoking task: mem_cgroup_handle_oom+0x241/0x3b0 mem_cgroup_cache_charge+0xbe/0xe0 add_to_page_cache_locked+0x4c/0x140 add_to_page_cache_lru+0x22/0x50 grab_cache_page_write_begin+0x8b/0xe0 ext3_write_begin+0x88/0x270 generic_file_buffered_write+0x116/0x290 __generic_file_aio_write+0x27c/0x480 generic_file_aio_write+0x76/0xf0 # takes ->i_mutex do_sync_write+0xea/0x130 vfs_write+0xf3/0x1f0 sys_write+0x51/0x90 system_call_fastpath+0x18/0x1d OOM kill victim: do_truncate+0x58/0xa0 # takes i_mutex do_last+0x250/0xa30 path_openat+0xd7/0x440 do_filp_open+0x49/0xa0 do_sys_open+0x106/0x240 sys_open+0x20/0x30 system_call_fastpath+0x18/0x1d The OOM handling task will retry the charge indefinitely while the OOM killed task is not releasing any resources. A similar scenario can happen when the kernel OOM killer for a memcg is disabled and a userspace task is in charge of resolving OOM situations. In this case, ALL tasks that enter the OOM path will be made to sleep on the OOM waitqueue and wait for userspace to free resources or increase the group's limit. But a userspace OOM handler is prone to deadlock itself on the locks held by the waiting tasks. For example one of the sleeping tasks may be stuck in a brk() call with the mmap_sem held for writing but the userspace handler, in order to pick an optimal victim, may need to read files from /proc/<pid>, which tries to acquire the same mmap_sem for reading and deadlocks. This patch changes the way tasks behave after detecting a memcg OOM and makes sure nobody loops or sleeps with locks held: 1. When OOMing in a user fault, invoke the OOM killer and restart the fault instead of looping on the charge attempt. This way, the OOM victim can not get stuck on locks the looping task may hold. 2. When OOMing in a user fault but somebody else is handling it (either the kernel OOM killer or a userspace handler), don't go to sleep in the charge context. Instead, remember the OOMing memcg in the task struct and then fully unwind the page fault stack with -ENOMEM. pagefault_out_of_memory() will then call back into the memcg code to check if the -ENOMEM came from the memcg, and then either put the task to sleep on the memcg's OOM waitqueue or just restart the fault. The OOM victim can no longer get stuck on any lock a sleeping task may hold. Debugged by Michal Hocko. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: azurIt <azurit@pobox.sk> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3812c8c8f3953921ef18544110dafc3505c1ac62;" This way, the OOM victim can not get stuck on locks the looping task may hold";yes;yes;no 201;MDY6Q29tbWl0MjMyNTI5ODozODEyYzhjOGYzOTUzOTIxZWYxODU0NDExMGRhZmMzNTA1YzFhYzYy;Johannes Weiner;Linus Torvalds;"mm: memcg: do not trap chargers with full callstack on OOM The memcg OOM handling is incredibly fragile and can deadlock. When a task fails to charge memory, it invokes the OOM killer and loops right there in the charge code until it succeeds. Comparably, any other task that enters the charge path at this point will go to a waitqueue right then and there and sleep until the OOM situation is resolved. The problem is that these tasks may hold filesystem locks and the mmap_sem; locks that the selected OOM victim may need to exit. For example, in one reported case, the task invoking the OOM killer was about to charge a page cache page during a write(), which holds the i_mutex. The OOM killer selected a task that was just entering truncate() and trying to acquire the i_mutex: OOM invoking task: mem_cgroup_handle_oom+0x241/0x3b0 mem_cgroup_cache_charge+0xbe/0xe0 add_to_page_cache_locked+0x4c/0x140 add_to_page_cache_lru+0x22/0x50 grab_cache_page_write_begin+0x8b/0xe0 ext3_write_begin+0x88/0x270 generic_file_buffered_write+0x116/0x290 __generic_file_aio_write+0x27c/0x480 generic_file_aio_write+0x76/0xf0 # takes ->i_mutex do_sync_write+0xea/0x130 vfs_write+0xf3/0x1f0 sys_write+0x51/0x90 system_call_fastpath+0x18/0x1d OOM kill victim: do_truncate+0x58/0xa0 # takes i_mutex do_last+0x250/0xa30 path_openat+0xd7/0x440 do_filp_open+0x49/0xa0 do_sys_open+0x106/0x240 sys_open+0x20/0x30 system_call_fastpath+0x18/0x1d The OOM handling task will retry the charge indefinitely while the OOM killed task is not releasing any resources. A similar scenario can happen when the kernel OOM killer for a memcg is disabled and a userspace task is in charge of resolving OOM situations. In this case, ALL tasks that enter the OOM path will be made to sleep on the OOM waitqueue and wait for userspace to free resources or increase the group's limit. But a userspace OOM handler is prone to deadlock itself on the locks held by the waiting tasks. For example one of the sleeping tasks may be stuck in a brk() call with the mmap_sem held for writing but the userspace handler, in order to pick an optimal victim, may need to read files from /proc/<pid>, which tries to acquire the same mmap_sem for reading and deadlocks. This patch changes the way tasks behave after detecting a memcg OOM and makes sure nobody loops or sleeps with locks held: 1. When OOMing in a user fault, invoke the OOM killer and restart the fault instead of looping on the charge attempt. This way, the OOM victim can not get stuck on locks the looping task may hold. 2. When OOMing in a user fault but somebody else is handling it (either the kernel OOM killer or a userspace handler), don't go to sleep in the charge context. Instead, remember the OOMing memcg in the task struct and then fully unwind the page fault stack with -ENOMEM. pagefault_out_of_memory() will then call back into the memcg code to check if the -ENOMEM came from the memcg, and then either put the task to sleep on the memcg's OOM waitqueue or just restart the fault. The OOM victim can no longer get stuck on any lock a sleeping task may hold. Debugged by Michal Hocko. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: azurIt <azurit@pobox.sk> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3812c8c8f3953921ef18544110dafc3505c1ac62;"When OOMing in a user fault but somebody else is handling it (either the kernel OOM killer or a userspace handler), don't go to sleep in the charge context";yes;no;no 201;MDY6Q29tbWl0MjMyNTI5ODozODEyYzhjOGYzOTUzOTIxZWYxODU0NDExMGRhZmMzNTA1YzFhYzYy;Johannes Weiner;Linus Torvalds;"mm: memcg: do not trap chargers with full callstack on OOM The memcg OOM handling is incredibly fragile and can deadlock. When a task fails to charge memory, it invokes the OOM killer and loops right there in the charge code until it succeeds. Comparably, any other task that enters the charge path at this point will go to a waitqueue right then and there and sleep until the OOM situation is resolved. The problem is that these tasks may hold filesystem locks and the mmap_sem; locks that the selected OOM victim may need to exit. For example, in one reported case, the task invoking the OOM killer was about to charge a page cache page during a write(), which holds the i_mutex. The OOM killer selected a task that was just entering truncate() and trying to acquire the i_mutex: OOM invoking task: mem_cgroup_handle_oom+0x241/0x3b0 mem_cgroup_cache_charge+0xbe/0xe0 add_to_page_cache_locked+0x4c/0x140 add_to_page_cache_lru+0x22/0x50 grab_cache_page_write_begin+0x8b/0xe0 ext3_write_begin+0x88/0x270 generic_file_buffered_write+0x116/0x290 __generic_file_aio_write+0x27c/0x480 generic_file_aio_write+0x76/0xf0 # takes ->i_mutex do_sync_write+0xea/0x130 vfs_write+0xf3/0x1f0 sys_write+0x51/0x90 system_call_fastpath+0x18/0x1d OOM kill victim: do_truncate+0x58/0xa0 # takes i_mutex do_last+0x250/0xa30 path_openat+0xd7/0x440 do_filp_open+0x49/0xa0 do_sys_open+0x106/0x240 sys_open+0x20/0x30 system_call_fastpath+0x18/0x1d The OOM handling task will retry the charge indefinitely while the OOM killed task is not releasing any resources. A similar scenario can happen when the kernel OOM killer for a memcg is disabled and a userspace task is in charge of resolving OOM situations. In this case, ALL tasks that enter the OOM path will be made to sleep on the OOM waitqueue and wait for userspace to free resources or increase the group's limit. But a userspace OOM handler is prone to deadlock itself on the locks held by the waiting tasks. For example one of the sleeping tasks may be stuck in a brk() call with the mmap_sem held for writing but the userspace handler, in order to pick an optimal victim, may need to read files from /proc/<pid>, which tries to acquire the same mmap_sem for reading and deadlocks. This patch changes the way tasks behave after detecting a memcg OOM and makes sure nobody loops or sleeps with locks held: 1. When OOMing in a user fault, invoke the OOM killer and restart the fault instead of looping on the charge attempt. This way, the OOM victim can not get stuck on locks the looping task may hold. 2. When OOMing in a user fault but somebody else is handling it (either the kernel OOM killer or a userspace handler), don't go to sleep in the charge context. Instead, remember the OOMing memcg in the task struct and then fully unwind the page fault stack with -ENOMEM. pagefault_out_of_memory() will then call back into the memcg code to check if the -ENOMEM came from the memcg, and then either put the task to sleep on the memcg's OOM waitqueue or just restart the fault. The OOM victim can no longer get stuck on any lock a sleeping task may hold. Debugged by Michal Hocko. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: azurIt <azurit@pobox.sk> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3812c8c8f3953921ef18544110dafc3505c1ac62;" Instead, remember the OOMing memcg in the task struct and then fully unwind the page fault stack with -ENOMEM";yes;no;no 201;MDY6Q29tbWl0MjMyNTI5ODozODEyYzhjOGYzOTUzOTIxZWYxODU0NDExMGRhZmMzNTA1YzFhYzYy;Johannes Weiner;Linus Torvalds;"mm: memcg: do not trap chargers with full callstack on OOM The memcg OOM handling is incredibly fragile and can deadlock. When a task fails to charge memory, it invokes the OOM killer and loops right there in the charge code until it succeeds. Comparably, any other task that enters the charge path at this point will go to a waitqueue right then and there and sleep until the OOM situation is resolved. The problem is that these tasks may hold filesystem locks and the mmap_sem; locks that the selected OOM victim may need to exit. For example, in one reported case, the task invoking the OOM killer was about to charge a page cache page during a write(), which holds the i_mutex. The OOM killer selected a task that was just entering truncate() and trying to acquire the i_mutex: OOM invoking task: mem_cgroup_handle_oom+0x241/0x3b0 mem_cgroup_cache_charge+0xbe/0xe0 add_to_page_cache_locked+0x4c/0x140 add_to_page_cache_lru+0x22/0x50 grab_cache_page_write_begin+0x8b/0xe0 ext3_write_begin+0x88/0x270 generic_file_buffered_write+0x116/0x290 __generic_file_aio_write+0x27c/0x480 generic_file_aio_write+0x76/0xf0 # takes ->i_mutex do_sync_write+0xea/0x130 vfs_write+0xf3/0x1f0 sys_write+0x51/0x90 system_call_fastpath+0x18/0x1d OOM kill victim: do_truncate+0x58/0xa0 # takes i_mutex do_last+0x250/0xa30 path_openat+0xd7/0x440 do_filp_open+0x49/0xa0 do_sys_open+0x106/0x240 sys_open+0x20/0x30 system_call_fastpath+0x18/0x1d The OOM handling task will retry the charge indefinitely while the OOM killed task is not releasing any resources. A similar scenario can happen when the kernel OOM killer for a memcg is disabled and a userspace task is in charge of resolving OOM situations. In this case, ALL tasks that enter the OOM path will be made to sleep on the OOM waitqueue and wait for userspace to free resources or increase the group's limit. But a userspace OOM handler is prone to deadlock itself on the locks held by the waiting tasks. For example one of the sleeping tasks may be stuck in a brk() call with the mmap_sem held for writing but the userspace handler, in order to pick an optimal victim, may need to read files from /proc/<pid>, which tries to acquire the same mmap_sem for reading and deadlocks. This patch changes the way tasks behave after detecting a memcg OOM and makes sure nobody loops or sleeps with locks held: 1. When OOMing in a user fault, invoke the OOM killer and restart the fault instead of looping on the charge attempt. This way, the OOM victim can not get stuck on locks the looping task may hold. 2. When OOMing in a user fault but somebody else is handling it (either the kernel OOM killer or a userspace handler), don't go to sleep in the charge context. Instead, remember the OOMing memcg in the task struct and then fully unwind the page fault stack with -ENOMEM. pagefault_out_of_memory() will then call back into the memcg code to check if the -ENOMEM came from the memcg, and then either put the task to sleep on the memcg's OOM waitqueue or just restart the fault. The OOM victim can no longer get stuck on any lock a sleeping task may hold. Debugged by Michal Hocko. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: azurIt <azurit@pobox.sk> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3812c8c8f3953921ef18544110dafc3505c1ac62;" pagefault_out_of_memory() will then call back into the memcg code to check if the -ENOMEM came from the memcg, and then either put the task to sleep on the memcg's OOM waitqueue or just restart the fault";yes;no;no 201;MDY6Q29tbWl0MjMyNTI5ODozODEyYzhjOGYzOTUzOTIxZWYxODU0NDExMGRhZmMzNTA1YzFhYzYy;Johannes Weiner;Linus Torvalds;"mm: memcg: do not trap chargers with full callstack on OOM The memcg OOM handling is incredibly fragile and can deadlock. When a task fails to charge memory, it invokes the OOM killer and loops right there in the charge code until it succeeds. Comparably, any other task that enters the charge path at this point will go to a waitqueue right then and there and sleep until the OOM situation is resolved. The problem is that these tasks may hold filesystem locks and the mmap_sem; locks that the selected OOM victim may need to exit. For example, in one reported case, the task invoking the OOM killer was about to charge a page cache page during a write(), which holds the i_mutex. The OOM killer selected a task that was just entering truncate() and trying to acquire the i_mutex: OOM invoking task: mem_cgroup_handle_oom+0x241/0x3b0 mem_cgroup_cache_charge+0xbe/0xe0 add_to_page_cache_locked+0x4c/0x140 add_to_page_cache_lru+0x22/0x50 grab_cache_page_write_begin+0x8b/0xe0 ext3_write_begin+0x88/0x270 generic_file_buffered_write+0x116/0x290 __generic_file_aio_write+0x27c/0x480 generic_file_aio_write+0x76/0xf0 # takes ->i_mutex do_sync_write+0xea/0x130 vfs_write+0xf3/0x1f0 sys_write+0x51/0x90 system_call_fastpath+0x18/0x1d OOM kill victim: do_truncate+0x58/0xa0 # takes i_mutex do_last+0x250/0xa30 path_openat+0xd7/0x440 do_filp_open+0x49/0xa0 do_sys_open+0x106/0x240 sys_open+0x20/0x30 system_call_fastpath+0x18/0x1d The OOM handling task will retry the charge indefinitely while the OOM killed task is not releasing any resources. A similar scenario can happen when the kernel OOM killer for a memcg is disabled and a userspace task is in charge of resolving OOM situations. In this case, ALL tasks that enter the OOM path will be made to sleep on the OOM waitqueue and wait for userspace to free resources or increase the group's limit. But a userspace OOM handler is prone to deadlock itself on the locks held by the waiting tasks. For example one of the sleeping tasks may be stuck in a brk() call with the mmap_sem held for writing but the userspace handler, in order to pick an optimal victim, may need to read files from /proc/<pid>, which tries to acquire the same mmap_sem for reading and deadlocks. This patch changes the way tasks behave after detecting a memcg OOM and makes sure nobody loops or sleeps with locks held: 1. When OOMing in a user fault, invoke the OOM killer and restart the fault instead of looping on the charge attempt. This way, the OOM victim can not get stuck on locks the looping task may hold. 2. When OOMing in a user fault but somebody else is handling it (either the kernel OOM killer or a userspace handler), don't go to sleep in the charge context. Instead, remember the OOMing memcg in the task struct and then fully unwind the page fault stack with -ENOMEM. pagefault_out_of_memory() will then call back into the memcg code to check if the -ENOMEM came from the memcg, and then either put the task to sleep on the memcg's OOM waitqueue or just restart the fault. The OOM victim can no longer get stuck on any lock a sleeping task may hold. Debugged by Michal Hocko. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: azurIt <azurit@pobox.sk> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3812c8c8f3953921ef18544110dafc3505c1ac62;" The OOM victim can no longer get stuck on any lock a sleeping task may hold";yes;yes;no 201;MDY6Q29tbWl0MjMyNTI5ODozODEyYzhjOGYzOTUzOTIxZWYxODU0NDExMGRhZmMzNTA1YzFhYzYy;Johannes Weiner;Linus Torvalds;"mm: memcg: do not trap chargers with full callstack on OOM The memcg OOM handling is incredibly fragile and can deadlock. When a task fails to charge memory, it invokes the OOM killer and loops right there in the charge code until it succeeds. Comparably, any other task that enters the charge path at this point will go to a waitqueue right then and there and sleep until the OOM situation is resolved. The problem is that these tasks may hold filesystem locks and the mmap_sem; locks that the selected OOM victim may need to exit. For example, in one reported case, the task invoking the OOM killer was about to charge a page cache page during a write(), which holds the i_mutex. The OOM killer selected a task that was just entering truncate() and trying to acquire the i_mutex: OOM invoking task: mem_cgroup_handle_oom+0x241/0x3b0 mem_cgroup_cache_charge+0xbe/0xe0 add_to_page_cache_locked+0x4c/0x140 add_to_page_cache_lru+0x22/0x50 grab_cache_page_write_begin+0x8b/0xe0 ext3_write_begin+0x88/0x270 generic_file_buffered_write+0x116/0x290 __generic_file_aio_write+0x27c/0x480 generic_file_aio_write+0x76/0xf0 # takes ->i_mutex do_sync_write+0xea/0x130 vfs_write+0xf3/0x1f0 sys_write+0x51/0x90 system_call_fastpath+0x18/0x1d OOM kill victim: do_truncate+0x58/0xa0 # takes i_mutex do_last+0x250/0xa30 path_openat+0xd7/0x440 do_filp_open+0x49/0xa0 do_sys_open+0x106/0x240 sys_open+0x20/0x30 system_call_fastpath+0x18/0x1d The OOM handling task will retry the charge indefinitely while the OOM killed task is not releasing any resources. A similar scenario can happen when the kernel OOM killer for a memcg is disabled and a userspace task is in charge of resolving OOM situations. In this case, ALL tasks that enter the OOM path will be made to sleep on the OOM waitqueue and wait for userspace to free resources or increase the group's limit. But a userspace OOM handler is prone to deadlock itself on the locks held by the waiting tasks. For example one of the sleeping tasks may be stuck in a brk() call with the mmap_sem held for writing but the userspace handler, in order to pick an optimal victim, may need to read files from /proc/<pid>, which tries to acquire the same mmap_sem for reading and deadlocks. This patch changes the way tasks behave after detecting a memcg OOM and makes sure nobody loops or sleeps with locks held: 1. When OOMing in a user fault, invoke the OOM killer and restart the fault instead of looping on the charge attempt. This way, the OOM victim can not get stuck on locks the looping task may hold. 2. When OOMing in a user fault but somebody else is handling it (either the kernel OOM killer or a userspace handler), don't go to sleep in the charge context. Instead, remember the OOMing memcg in the task struct and then fully unwind the page fault stack with -ENOMEM. pagefault_out_of_memory() will then call back into the memcg code to check if the -ENOMEM came from the memcg, and then either put the task to sleep on the memcg's OOM waitqueue or just restart the fault. The OOM victim can no longer get stuck on any lock a sleeping task may hold. Debugged by Michal Hocko. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: azurIt <azurit@pobox.sk> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3812c8c8f3953921ef18544110dafc3505c1ac62;Debugged by Michal Hocko.;no;no;no 202;MDY6Q29tbWl0MjMyNTI5ODo2YjRmMmI1NmE0OGM4ZWE5Nzc1YmQyYjI5NjgxNzI1ZDQ0NzQzNjdh;Rusty Russell;Rusty Russell;"mm/oom_kill: remove weird use of ERR_PTR()/PTR_ERR(). The normal expectation for ERR_PTR() is to put a negative errno into a pointer. oom_kill puts the magic -1 in the result (and has since pre-git), which is probably clearer with an explicit cast. Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>";https://api.github.com/repos/torvalds/linux/git/commits/6b4f2b56a48c8ea9775bd2b29681725d4474367a;mm/oom_kill: remove weird use of ERR_PTR()/PTR_ERR().;yes;yes;no 202;MDY6Q29tbWl0MjMyNTI5ODo2YjRmMmI1NmE0OGM4ZWE5Nzc1YmQyYjI5NjgxNzI1ZDQ0NzQzNjdh;Rusty Russell;Rusty Russell;"mm/oom_kill: remove weird use of ERR_PTR()/PTR_ERR(). The normal expectation for ERR_PTR() is to put a negative errno into a pointer. oom_kill puts the magic -1 in the result (and has since pre-git), which is probably clearer with an explicit cast. Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>";https://api.github.com/repos/torvalds/linux/git/commits/6b4f2b56a48c8ea9775bd2b29681725d4474367a;"The normal expectation for ERR_PTR() is to put a negative errno into a pointer";no;no;yes 202;MDY6Q29tbWl0MjMyNTI5ODo2YjRmMmI1NmE0OGM4ZWE5Nzc1YmQyYjI5NjgxNzI1ZDQ0NzQzNjdh;Rusty Russell;Rusty Russell;"mm/oom_kill: remove weird use of ERR_PTR()/PTR_ERR(). The normal expectation for ERR_PTR() is to put a negative errno into a pointer. oom_kill puts the magic -1 in the result (and has since pre-git), which is probably clearer with an explicit cast. Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>";https://api.github.com/repos/torvalds/linux/git/commits/6b4f2b56a48c8ea9775bd2b29681725d4474367a;" oom_kill puts the magic -1 in the result (and has since pre-git), which is probably clearer with an explicit cast.";yes;yes;yes 203;MDY6Q29tbWl0MjMyNTI5ODo1OGNmMTg4ZWQ2NDliNjU3MGRmZGM5YzYyMTU2Y2RmMzk2YzJlMzk1;Sha Zhengju;Linus Torvalds;"memcg, oom: provide more precise dump info while memcg oom happening Currently when a memcg oom is happening the oom dump messages is still global state and provides few useful info for users. This patch prints more pointed memcg page statistics for memcg-oom and take hierarchy into consideration: Based on Michal's advice, we take hierarchy into consideration: supppose we trigger an OOM on A's limit root_memcg | A (use_hierachy=1) / \ B C | D then the printed info will be: Memory cgroup stats for /A:... Memory cgroup stats for /A/B:... Memory cgroup stats for /A/C:... Memory cgroup stats for /A/B/D:... Following are samples of oom output: (1) Before change: mal-80 invoked oom-killer:gfp_mask=0xd0, order=0, oom_score_adj=0 mal-80 cpuset=/ mems_allowed=0 Pid: 2976, comm: mal-80 Not tainted 3.7.0+ #10 Call Trace: [<ffffffff8167fbfb>] dump_header+0x83/0x1ca ..... (call trace) [<ffffffff8168a818>] page_fault+0x28/0x30 <<<<<<<<<<<<<<<<<<<<< memcg specific information Task in /A/B/D killed as a result of limit of /A memory: usage 101376kB, limit 101376kB, failcnt 57 memory+swap: usage 101376kB, limit 101376kB, failcnt 0 kmem: usage 0kB, limit 9007199254740991kB, failcnt 0 <<<<<<<<<<<<<<<<<<<<< print per cpu pageset stat Mem-Info: Node 0 DMA per-cpu: CPU 0: hi: 0, btch: 1 usd: 0 ...... CPU 3: hi: 0, btch: 1 usd: 0 Node 0 DMA32 per-cpu: CPU 0: hi: 186, btch: 31 usd: 173 ...... CPU 3: hi: 186, btch: 31 usd: 130 <<<<<<<<<<<<<<<<<<<<< print global page state active_anon:92963 inactive_anon:40777 isolated_anon:0 active_file:33027 inactive_file:51718 isolated_file:0 unevictable:0 dirty:3 writeback:0 unstable:0 free:729995 slab_reclaimable:6897 slab_unreclaimable:6263 mapped:20278 shmem:35971 pagetables:5885 bounce:0 free_cma:0 <<<<<<<<<<<<<<<<<<<<< print per zone page state Node 0 DMA free:15836kB ... all_unreclaimable? no lowmem_reserve[]: 0 3175 3899 3899 Node 0 DMA32 free:2888564kB ... all_unrelaimable? no lowmem_reserve[]: 0 0 724 724 lowmem_reserve[]: 0 0 0 0 Node 0 DMA: 1*4kB (U) ... 3*4096kB (M) = 15836kB Node 0 DMA32: 41*4kB (UM) ... 702*4096kB (MR) = 2888316kB 120710 total pagecache pages 0 pages in swap cache <<<<<<<<<<<<<<<<<<<<< print global swap cache stat Swap cache stats: add 0, delete 0, find 0/0 Free swap = 499708kB Total swap = 499708kB 1040368 pages RAM 58678 pages reserved 169065 pages shared 173632 pages non-shared [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name [ 2693] 0 2693 6005 1324 17 0 0 god [ 2754] 0 2754 6003 1320 16 0 0 god [ 2811] 0 2811 5992 1304 18 0 0 god [ 2874] 0 2874 6005 1323 18 0 0 god [ 2935] 0 2935 8720 7742 21 0 0 mal-30 [ 2976] 0 2976 21520 17577 42 0 0 mal-80 Memory cgroup out of memory: Kill process 2976 (mal-80) score 665 or sacrifice child Killed process 2976 (mal-80) total-vm:86080kB, anon-rss:69964kB, file-rss:344kB We can see that messages dumped by show_free_areas() are longsome and can provide so limited info for memcg that just happen oom. (2) After change mal-80 invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0 mal-80 cpuset=/ mems_allowed=0 Pid: 2704, comm: mal-80 Not tainted 3.7.0+ #10 Call Trace: [<ffffffff8167fd0b>] dump_header+0x83/0x1d1 .......(call trace) [<ffffffff8168a918>] page_fault+0x28/0x30 Task in /A/B/D killed as a result of limit of /A <<<<<<<<<<<<<<<<<<<<< memcg specific information memory: usage 102400kB, limit 102400kB, failcnt 140 memory+swap: usage 102400kB, limit 102400kB, failcnt 0 kmem: usage 0kB, limit 9007199254740991kB, failcnt 0 Memory cgroup stats for /A: cache:32KB rss:30984KB mapped_file:0KB swap:0KB inactive_anon:6912KB active_anon:24072KB inactive_file:32KB active_file:0KB unevictable:0KB Memory cgroup stats for /A/B: cache:0KB rss:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB Memory cgroup stats for /A/C: cache:0KB rss:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB Memory cgroup stats for /A/B/D: cache:32KB rss:71352KB mapped_file:0KB swap:0KB inactive_anon:6656KB active_anon:64696KB inactive_file:16KB active_file:16KB unevictable:0KB [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name [ 2260] 0 2260 6006 1325 18 0 0 god [ 2383] 0 2383 6003 1319 17 0 0 god [ 2503] 0 2503 6004 1321 18 0 0 god [ 2622] 0 2622 6004 1321 16 0 0 god [ 2695] 0 2695 8720 7741 22 0 0 mal-30 [ 2704] 0 2704 21520 17839 43 0 0 mal-80 Memory cgroup out of memory: Kill process 2704 (mal-80) score 669 or sacrifice child Killed process 2704 (mal-80) total-vm:86080kB, anon-rss:71016kB, file-rss:340kB This version provides more pointed info for memcg in ""Memory cgroup stats for XXX"" section. Signed-off-by: Sha Zhengju <handai.szj@taobao.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/58cf188ed649b6570dfdc9c62156cdf396c2e395;memcg, oom: provide more precise dump info while memcg oom happening;yes;yes;no 203;MDY6Q29tbWl0MjMyNTI5ODo1OGNmMTg4ZWQ2NDliNjU3MGRmZGM5YzYyMTU2Y2RmMzk2YzJlMzk1;Sha Zhengju;Linus Torvalds;"memcg, oom: provide more precise dump info while memcg oom happening Currently when a memcg oom is happening the oom dump messages is still global state and provides few useful info for users. This patch prints more pointed memcg page statistics for memcg-oom and take hierarchy into consideration: Based on Michal's advice, we take hierarchy into consideration: supppose we trigger an OOM on A's limit root_memcg | A (use_hierachy=1) / \ B C | D then the printed info will be: Memory cgroup stats for /A:... Memory cgroup stats for /A/B:... Memory cgroup stats for /A/C:... Memory cgroup stats for /A/B/D:... Following are samples of oom output: (1) Before change: mal-80 invoked oom-killer:gfp_mask=0xd0, order=0, oom_score_adj=0 mal-80 cpuset=/ mems_allowed=0 Pid: 2976, comm: mal-80 Not tainted 3.7.0+ #10 Call Trace: [<ffffffff8167fbfb>] dump_header+0x83/0x1ca ..... (call trace) [<ffffffff8168a818>] page_fault+0x28/0x30 <<<<<<<<<<<<<<<<<<<<< memcg specific information Task in /A/B/D killed as a result of limit of /A memory: usage 101376kB, limit 101376kB, failcnt 57 memory+swap: usage 101376kB, limit 101376kB, failcnt 0 kmem: usage 0kB, limit 9007199254740991kB, failcnt 0 <<<<<<<<<<<<<<<<<<<<< print per cpu pageset stat Mem-Info: Node 0 DMA per-cpu: CPU 0: hi: 0, btch: 1 usd: 0 ...... CPU 3: hi: 0, btch: 1 usd: 0 Node 0 DMA32 per-cpu: CPU 0: hi: 186, btch: 31 usd: 173 ...... CPU 3: hi: 186, btch: 31 usd: 130 <<<<<<<<<<<<<<<<<<<<< print global page state active_anon:92963 inactive_anon:40777 isolated_anon:0 active_file:33027 inactive_file:51718 isolated_file:0 unevictable:0 dirty:3 writeback:0 unstable:0 free:729995 slab_reclaimable:6897 slab_unreclaimable:6263 mapped:20278 shmem:35971 pagetables:5885 bounce:0 free_cma:0 <<<<<<<<<<<<<<<<<<<<< print per zone page state Node 0 DMA free:15836kB ... all_unreclaimable? no lowmem_reserve[]: 0 3175 3899 3899 Node 0 DMA32 free:2888564kB ... all_unrelaimable? no lowmem_reserve[]: 0 0 724 724 lowmem_reserve[]: 0 0 0 0 Node 0 DMA: 1*4kB (U) ... 3*4096kB (M) = 15836kB Node 0 DMA32: 41*4kB (UM) ... 702*4096kB (MR) = 2888316kB 120710 total pagecache pages 0 pages in swap cache <<<<<<<<<<<<<<<<<<<<< print global swap cache stat Swap cache stats: add 0, delete 0, find 0/0 Free swap = 499708kB Total swap = 499708kB 1040368 pages RAM 58678 pages reserved 169065 pages shared 173632 pages non-shared [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name [ 2693] 0 2693 6005 1324 17 0 0 god [ 2754] 0 2754 6003 1320 16 0 0 god [ 2811] 0 2811 5992 1304 18 0 0 god [ 2874] 0 2874 6005 1323 18 0 0 god [ 2935] 0 2935 8720 7742 21 0 0 mal-30 [ 2976] 0 2976 21520 17577 42 0 0 mal-80 Memory cgroup out of memory: Kill process 2976 (mal-80) score 665 or sacrifice child Killed process 2976 (mal-80) total-vm:86080kB, anon-rss:69964kB, file-rss:344kB We can see that messages dumped by show_free_areas() are longsome and can provide so limited info for memcg that just happen oom. (2) After change mal-80 invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0 mal-80 cpuset=/ mems_allowed=0 Pid: 2704, comm: mal-80 Not tainted 3.7.0+ #10 Call Trace: [<ffffffff8167fd0b>] dump_header+0x83/0x1d1 .......(call trace) [<ffffffff8168a918>] page_fault+0x28/0x30 Task in /A/B/D killed as a result of limit of /A <<<<<<<<<<<<<<<<<<<<< memcg specific information memory: usage 102400kB, limit 102400kB, failcnt 140 memory+swap: usage 102400kB, limit 102400kB, failcnt 0 kmem: usage 0kB, limit 9007199254740991kB, failcnt 0 Memory cgroup stats for /A: cache:32KB rss:30984KB mapped_file:0KB swap:0KB inactive_anon:6912KB active_anon:24072KB inactive_file:32KB active_file:0KB unevictable:0KB Memory cgroup stats for /A/B: cache:0KB rss:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB Memory cgroup stats for /A/C: cache:0KB rss:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB Memory cgroup stats for /A/B/D: cache:32KB rss:71352KB mapped_file:0KB swap:0KB inactive_anon:6656KB active_anon:64696KB inactive_file:16KB active_file:16KB unevictable:0KB [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name [ 2260] 0 2260 6006 1325 18 0 0 god [ 2383] 0 2383 6003 1319 17 0 0 god [ 2503] 0 2503 6004 1321 18 0 0 god [ 2622] 0 2622 6004 1321 16 0 0 god [ 2695] 0 2695 8720 7741 22 0 0 mal-30 [ 2704] 0 2704 21520 17839 43 0 0 mal-80 Memory cgroup out of memory: Kill process 2704 (mal-80) score 669 or sacrifice child Killed process 2704 (mal-80) total-vm:86080kB, anon-rss:71016kB, file-rss:340kB This version provides more pointed info for memcg in ""Memory cgroup stats for XXX"" section. Signed-off-by: Sha Zhengju <handai.szj@taobao.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/58cf188ed649b6570dfdc9c62156cdf396c2e395;"Currently when a memcg oom is happening the oom dump messages is still global state and provides few useful info for users";no;no;yes 203;MDY6Q29tbWl0MjMyNTI5ODo1OGNmMTg4ZWQ2NDliNjU3MGRmZGM5YzYyMTU2Y2RmMzk2YzJlMzk1;Sha Zhengju;Linus Torvalds;"memcg, oom: provide more precise dump info while memcg oom happening Currently when a memcg oom is happening the oom dump messages is still global state and provides few useful info for users. This patch prints more pointed memcg page statistics for memcg-oom and take hierarchy into consideration: Based on Michal's advice, we take hierarchy into consideration: supppose we trigger an OOM on A's limit root_memcg | A (use_hierachy=1) / \ B C | D then the printed info will be: Memory cgroup stats for /A:... Memory cgroup stats for /A/B:... Memory cgroup stats for /A/C:... Memory cgroup stats for /A/B/D:... Following are samples of oom output: (1) Before change: mal-80 invoked oom-killer:gfp_mask=0xd0, order=0, oom_score_adj=0 mal-80 cpuset=/ mems_allowed=0 Pid: 2976, comm: mal-80 Not tainted 3.7.0+ #10 Call Trace: [<ffffffff8167fbfb>] dump_header+0x83/0x1ca ..... (call trace) [<ffffffff8168a818>] page_fault+0x28/0x30 <<<<<<<<<<<<<<<<<<<<< memcg specific information Task in /A/B/D killed as a result of limit of /A memory: usage 101376kB, limit 101376kB, failcnt 57 memory+swap: usage 101376kB, limit 101376kB, failcnt 0 kmem: usage 0kB, limit 9007199254740991kB, failcnt 0 <<<<<<<<<<<<<<<<<<<<< print per cpu pageset stat Mem-Info: Node 0 DMA per-cpu: CPU 0: hi: 0, btch: 1 usd: 0 ...... CPU 3: hi: 0, btch: 1 usd: 0 Node 0 DMA32 per-cpu: CPU 0: hi: 186, btch: 31 usd: 173 ...... CPU 3: hi: 186, btch: 31 usd: 130 <<<<<<<<<<<<<<<<<<<<< print global page state active_anon:92963 inactive_anon:40777 isolated_anon:0 active_file:33027 inactive_file:51718 isolated_file:0 unevictable:0 dirty:3 writeback:0 unstable:0 free:729995 slab_reclaimable:6897 slab_unreclaimable:6263 mapped:20278 shmem:35971 pagetables:5885 bounce:0 free_cma:0 <<<<<<<<<<<<<<<<<<<<< print per zone page state Node 0 DMA free:15836kB ... all_unreclaimable? no lowmem_reserve[]: 0 3175 3899 3899 Node 0 DMA32 free:2888564kB ... all_unrelaimable? no lowmem_reserve[]: 0 0 724 724 lowmem_reserve[]: 0 0 0 0 Node 0 DMA: 1*4kB (U) ... 3*4096kB (M) = 15836kB Node 0 DMA32: 41*4kB (UM) ... 702*4096kB (MR) = 2888316kB 120710 total pagecache pages 0 pages in swap cache <<<<<<<<<<<<<<<<<<<<< print global swap cache stat Swap cache stats: add 0, delete 0, find 0/0 Free swap = 499708kB Total swap = 499708kB 1040368 pages RAM 58678 pages reserved 169065 pages shared 173632 pages non-shared [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name [ 2693] 0 2693 6005 1324 17 0 0 god [ 2754] 0 2754 6003 1320 16 0 0 god [ 2811] 0 2811 5992 1304 18 0 0 god [ 2874] 0 2874 6005 1323 18 0 0 god [ 2935] 0 2935 8720 7742 21 0 0 mal-30 [ 2976] 0 2976 21520 17577 42 0 0 mal-80 Memory cgroup out of memory: Kill process 2976 (mal-80) score 665 or sacrifice child Killed process 2976 (mal-80) total-vm:86080kB, anon-rss:69964kB, file-rss:344kB We can see that messages dumped by show_free_areas() are longsome and can provide so limited info for memcg that just happen oom. (2) After change mal-80 invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0 mal-80 cpuset=/ mems_allowed=0 Pid: 2704, comm: mal-80 Not tainted 3.7.0+ #10 Call Trace: [<ffffffff8167fd0b>] dump_header+0x83/0x1d1 .......(call trace) [<ffffffff8168a918>] page_fault+0x28/0x30 Task in /A/B/D killed as a result of limit of /A <<<<<<<<<<<<<<<<<<<<< memcg specific information memory: usage 102400kB, limit 102400kB, failcnt 140 memory+swap: usage 102400kB, limit 102400kB, failcnt 0 kmem: usage 0kB, limit 9007199254740991kB, failcnt 0 Memory cgroup stats for /A: cache:32KB rss:30984KB mapped_file:0KB swap:0KB inactive_anon:6912KB active_anon:24072KB inactive_file:32KB active_file:0KB unevictable:0KB Memory cgroup stats for /A/B: cache:0KB rss:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB Memory cgroup stats for /A/C: cache:0KB rss:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB Memory cgroup stats for /A/B/D: cache:32KB rss:71352KB mapped_file:0KB swap:0KB inactive_anon:6656KB active_anon:64696KB inactive_file:16KB active_file:16KB unevictable:0KB [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name [ 2260] 0 2260 6006 1325 18 0 0 god [ 2383] 0 2383 6003 1319 17 0 0 god [ 2503] 0 2503 6004 1321 18 0 0 god [ 2622] 0 2622 6004 1321 16 0 0 god [ 2695] 0 2695 8720 7741 22 0 0 mal-30 [ 2704] 0 2704 21520 17839 43 0 0 mal-80 Memory cgroup out of memory: Kill process 2704 (mal-80) score 669 or sacrifice child Killed process 2704 (mal-80) total-vm:86080kB, anon-rss:71016kB, file-rss:340kB This version provides more pointed info for memcg in ""Memory cgroup stats for XXX"" section. Signed-off-by: Sha Zhengju <handai.szj@taobao.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/58cf188ed649b6570dfdc9c62156cdf396c2e395;" This patch prints more pointed memcg page statistics for memcg-oom and take hierarchy into consideration";yes;yes;no 203;MDY6Q29tbWl0MjMyNTI5ODo1OGNmMTg4ZWQ2NDliNjU3MGRmZGM5YzYyMTU2Y2RmMzk2YzJlMzk1;Sha Zhengju;Linus Torvalds;"memcg, oom: provide more precise dump info while memcg oom happening Currently when a memcg oom is happening the oom dump messages is still global state and provides few useful info for users. This patch prints more pointed memcg page statistics for memcg-oom and take hierarchy into consideration: Based on Michal's advice, we take hierarchy into consideration: supppose we trigger an OOM on A's limit root_memcg | A (use_hierachy=1) / \ B C | D then the printed info will be: Memory cgroup stats for /A:... Memory cgroup stats for /A/B:... Memory cgroup stats for /A/C:... Memory cgroup stats for /A/B/D:... Following are samples of oom output: (1) Before change: mal-80 invoked oom-killer:gfp_mask=0xd0, order=0, oom_score_adj=0 mal-80 cpuset=/ mems_allowed=0 Pid: 2976, comm: mal-80 Not tainted 3.7.0+ #10 Call Trace: [<ffffffff8167fbfb>] dump_header+0x83/0x1ca ..... (call trace) [<ffffffff8168a818>] page_fault+0x28/0x30 <<<<<<<<<<<<<<<<<<<<< memcg specific information Task in /A/B/D killed as a result of limit of /A memory: usage 101376kB, limit 101376kB, failcnt 57 memory+swap: usage 101376kB, limit 101376kB, failcnt 0 kmem: usage 0kB, limit 9007199254740991kB, failcnt 0 <<<<<<<<<<<<<<<<<<<<< print per cpu pageset stat Mem-Info: Node 0 DMA per-cpu: CPU 0: hi: 0, btch: 1 usd: 0 ...... CPU 3: hi: 0, btch: 1 usd: 0 Node 0 DMA32 per-cpu: CPU 0: hi: 186, btch: 31 usd: 173 ...... CPU 3: hi: 186, btch: 31 usd: 130 <<<<<<<<<<<<<<<<<<<<< print global page state active_anon:92963 inactive_anon:40777 isolated_anon:0 active_file:33027 inactive_file:51718 isolated_file:0 unevictable:0 dirty:3 writeback:0 unstable:0 free:729995 slab_reclaimable:6897 slab_unreclaimable:6263 mapped:20278 shmem:35971 pagetables:5885 bounce:0 free_cma:0 <<<<<<<<<<<<<<<<<<<<< print per zone page state Node 0 DMA free:15836kB ... all_unreclaimable? no lowmem_reserve[]: 0 3175 3899 3899 Node 0 DMA32 free:2888564kB ... all_unrelaimable? no lowmem_reserve[]: 0 0 724 724 lowmem_reserve[]: 0 0 0 0 Node 0 DMA: 1*4kB (U) ... 3*4096kB (M) = 15836kB Node 0 DMA32: 41*4kB (UM) ... 702*4096kB (MR) = 2888316kB 120710 total pagecache pages 0 pages in swap cache <<<<<<<<<<<<<<<<<<<<< print global swap cache stat Swap cache stats: add 0, delete 0, find 0/0 Free swap = 499708kB Total swap = 499708kB 1040368 pages RAM 58678 pages reserved 169065 pages shared 173632 pages non-shared [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name [ 2693] 0 2693 6005 1324 17 0 0 god [ 2754] 0 2754 6003 1320 16 0 0 god [ 2811] 0 2811 5992 1304 18 0 0 god [ 2874] 0 2874 6005 1323 18 0 0 god [ 2935] 0 2935 8720 7742 21 0 0 mal-30 [ 2976] 0 2976 21520 17577 42 0 0 mal-80 Memory cgroup out of memory: Kill process 2976 (mal-80) score 665 or sacrifice child Killed process 2976 (mal-80) total-vm:86080kB, anon-rss:69964kB, file-rss:344kB We can see that messages dumped by show_free_areas() are longsome and can provide so limited info for memcg that just happen oom. (2) After change mal-80 invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0 mal-80 cpuset=/ mems_allowed=0 Pid: 2704, comm: mal-80 Not tainted 3.7.0+ #10 Call Trace: [<ffffffff8167fd0b>] dump_header+0x83/0x1d1 .......(call trace) [<ffffffff8168a918>] page_fault+0x28/0x30 Task in /A/B/D killed as a result of limit of /A <<<<<<<<<<<<<<<<<<<<< memcg specific information memory: usage 102400kB, limit 102400kB, failcnt 140 memory+swap: usage 102400kB, limit 102400kB, failcnt 0 kmem: usage 0kB, limit 9007199254740991kB, failcnt 0 Memory cgroup stats for /A: cache:32KB rss:30984KB mapped_file:0KB swap:0KB inactive_anon:6912KB active_anon:24072KB inactive_file:32KB active_file:0KB unevictable:0KB Memory cgroup stats for /A/B: cache:0KB rss:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB Memory cgroup stats for /A/C: cache:0KB rss:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB Memory cgroup stats for /A/B/D: cache:32KB rss:71352KB mapped_file:0KB swap:0KB inactive_anon:6656KB active_anon:64696KB inactive_file:16KB active_file:16KB unevictable:0KB [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name [ 2260] 0 2260 6006 1325 18 0 0 god [ 2383] 0 2383 6003 1319 17 0 0 god [ 2503] 0 2503 6004 1321 18 0 0 god [ 2622] 0 2622 6004 1321 16 0 0 god [ 2695] 0 2695 8720 7741 22 0 0 mal-30 [ 2704] 0 2704 21520 17839 43 0 0 mal-80 Memory cgroup out of memory: Kill process 2704 (mal-80) score 669 or sacrifice child Killed process 2704 (mal-80) total-vm:86080kB, anon-rss:71016kB, file-rss:340kB This version provides more pointed info for memcg in ""Memory cgroup stats for XXX"" section. Signed-off-by: Sha Zhengju <handai.szj@taobao.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/58cf188ed649b6570dfdc9c62156cdf396c2e395;"Based on Michal's advice, we take hierarchy into consideration: supppose we trigger an OOM on A's limit root_memcg A (use_hierachy=1) then the printed info will be";yes;no;yes 203;MDY6Q29tbWl0MjMyNTI5ODo1OGNmMTg4ZWQ2NDliNjU3MGRmZGM5YzYyMTU2Y2RmMzk2YzJlMzk1;Sha Zhengju;Linus Torvalds;"memcg, oom: provide more precise dump info while memcg oom happening Currently when a memcg oom is happening the oom dump messages is still global state and provides few useful info for users. This patch prints more pointed memcg page statistics for memcg-oom and take hierarchy into consideration: Based on Michal's advice, we take hierarchy into consideration: supppose we trigger an OOM on A's limit root_memcg | A (use_hierachy=1) / \ B C | D then the printed info will be: Memory cgroup stats for /A:... Memory cgroup stats for /A/B:... Memory cgroup stats for /A/C:... Memory cgroup stats for /A/B/D:... Following are samples of oom output: (1) Before change: mal-80 invoked oom-killer:gfp_mask=0xd0, order=0, oom_score_adj=0 mal-80 cpuset=/ mems_allowed=0 Pid: 2976, comm: mal-80 Not tainted 3.7.0+ #10 Call Trace: [<ffffffff8167fbfb>] dump_header+0x83/0x1ca ..... (call trace) [<ffffffff8168a818>] page_fault+0x28/0x30 <<<<<<<<<<<<<<<<<<<<< memcg specific information Task in /A/B/D killed as a result of limit of /A memory: usage 101376kB, limit 101376kB, failcnt 57 memory+swap: usage 101376kB, limit 101376kB, failcnt 0 kmem: usage 0kB, limit 9007199254740991kB, failcnt 0 <<<<<<<<<<<<<<<<<<<<< print per cpu pageset stat Mem-Info: Node 0 DMA per-cpu: CPU 0: hi: 0, btch: 1 usd: 0 ...... CPU 3: hi: 0, btch: 1 usd: 0 Node 0 DMA32 per-cpu: CPU 0: hi: 186, btch: 31 usd: 173 ...... CPU 3: hi: 186, btch: 31 usd: 130 <<<<<<<<<<<<<<<<<<<<< print global page state active_anon:92963 inactive_anon:40777 isolated_anon:0 active_file:33027 inactive_file:51718 isolated_file:0 unevictable:0 dirty:3 writeback:0 unstable:0 free:729995 slab_reclaimable:6897 slab_unreclaimable:6263 mapped:20278 shmem:35971 pagetables:5885 bounce:0 free_cma:0 <<<<<<<<<<<<<<<<<<<<< print per zone page state Node 0 DMA free:15836kB ... all_unreclaimable? no lowmem_reserve[]: 0 3175 3899 3899 Node 0 DMA32 free:2888564kB ... all_unrelaimable? no lowmem_reserve[]: 0 0 724 724 lowmem_reserve[]: 0 0 0 0 Node 0 DMA: 1*4kB (U) ... 3*4096kB (M) = 15836kB Node 0 DMA32: 41*4kB (UM) ... 702*4096kB (MR) = 2888316kB 120710 total pagecache pages 0 pages in swap cache <<<<<<<<<<<<<<<<<<<<< print global swap cache stat Swap cache stats: add 0, delete 0, find 0/0 Free swap = 499708kB Total swap = 499708kB 1040368 pages RAM 58678 pages reserved 169065 pages shared 173632 pages non-shared [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name [ 2693] 0 2693 6005 1324 17 0 0 god [ 2754] 0 2754 6003 1320 16 0 0 god [ 2811] 0 2811 5992 1304 18 0 0 god [ 2874] 0 2874 6005 1323 18 0 0 god [ 2935] 0 2935 8720 7742 21 0 0 mal-30 [ 2976] 0 2976 21520 17577 42 0 0 mal-80 Memory cgroup out of memory: Kill process 2976 (mal-80) score 665 or sacrifice child Killed process 2976 (mal-80) total-vm:86080kB, anon-rss:69964kB, file-rss:344kB We can see that messages dumped by show_free_areas() are longsome and can provide so limited info for memcg that just happen oom. (2) After change mal-80 invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0 mal-80 cpuset=/ mems_allowed=0 Pid: 2704, comm: mal-80 Not tainted 3.7.0+ #10 Call Trace: [<ffffffff8167fd0b>] dump_header+0x83/0x1d1 .......(call trace) [<ffffffff8168a918>] page_fault+0x28/0x30 Task in /A/B/D killed as a result of limit of /A <<<<<<<<<<<<<<<<<<<<< memcg specific information memory: usage 102400kB, limit 102400kB, failcnt 140 memory+swap: usage 102400kB, limit 102400kB, failcnt 0 kmem: usage 0kB, limit 9007199254740991kB, failcnt 0 Memory cgroup stats for /A: cache:32KB rss:30984KB mapped_file:0KB swap:0KB inactive_anon:6912KB active_anon:24072KB inactive_file:32KB active_file:0KB unevictable:0KB Memory cgroup stats for /A/B: cache:0KB rss:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB Memory cgroup stats for /A/C: cache:0KB rss:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB Memory cgroup stats for /A/B/D: cache:32KB rss:71352KB mapped_file:0KB swap:0KB inactive_anon:6656KB active_anon:64696KB inactive_file:16KB active_file:16KB unevictable:0KB [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name [ 2260] 0 2260 6006 1325 18 0 0 god [ 2383] 0 2383 6003 1319 17 0 0 god [ 2503] 0 2503 6004 1321 18 0 0 god [ 2622] 0 2622 6004 1321 16 0 0 god [ 2695] 0 2695 8720 7741 22 0 0 mal-30 [ 2704] 0 2704 21520 17839 43 0 0 mal-80 Memory cgroup out of memory: Kill process 2704 (mal-80) score 669 or sacrifice child Killed process 2704 (mal-80) total-vm:86080kB, anon-rss:71016kB, file-rss:340kB This version provides more pointed info for memcg in ""Memory cgroup stats for XXX"" section. Signed-off-by: Sha Zhengju <handai.szj@taobao.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/58cf188ed649b6570dfdc9c62156cdf396c2e395;Following are samples of oom output;yes;no;yes 203;MDY6Q29tbWl0MjMyNTI5ODo1OGNmMTg4ZWQ2NDliNjU3MGRmZGM5YzYyMTU2Y2RmMzk2YzJlMzk1;Sha Zhengju;Linus Torvalds;"memcg, oom: provide more precise dump info while memcg oom happening Currently when a memcg oom is happening the oom dump messages is still global state and provides few useful info for users. This patch prints more pointed memcg page statistics for memcg-oom and take hierarchy into consideration: Based on Michal's advice, we take hierarchy into consideration: supppose we trigger an OOM on A's limit root_memcg | A (use_hierachy=1) / \ B C | D then the printed info will be: Memory cgroup stats for /A:... Memory cgroup stats for /A/B:... Memory cgroup stats for /A/C:... Memory cgroup stats for /A/B/D:... Following are samples of oom output: (1) Before change: mal-80 invoked oom-killer:gfp_mask=0xd0, order=0, oom_score_adj=0 mal-80 cpuset=/ mems_allowed=0 Pid: 2976, comm: mal-80 Not tainted 3.7.0+ #10 Call Trace: [<ffffffff8167fbfb>] dump_header+0x83/0x1ca ..... (call trace) [<ffffffff8168a818>] page_fault+0x28/0x30 <<<<<<<<<<<<<<<<<<<<< memcg specific information Task in /A/B/D killed as a result of limit of /A memory: usage 101376kB, limit 101376kB, failcnt 57 memory+swap: usage 101376kB, limit 101376kB, failcnt 0 kmem: usage 0kB, limit 9007199254740991kB, failcnt 0 <<<<<<<<<<<<<<<<<<<<< print per cpu pageset stat Mem-Info: Node 0 DMA per-cpu: CPU 0: hi: 0, btch: 1 usd: 0 ...... CPU 3: hi: 0, btch: 1 usd: 0 Node 0 DMA32 per-cpu: CPU 0: hi: 186, btch: 31 usd: 173 ...... CPU 3: hi: 186, btch: 31 usd: 130 <<<<<<<<<<<<<<<<<<<<< print global page state active_anon:92963 inactive_anon:40777 isolated_anon:0 active_file:33027 inactive_file:51718 isolated_file:0 unevictable:0 dirty:3 writeback:0 unstable:0 free:729995 slab_reclaimable:6897 slab_unreclaimable:6263 mapped:20278 shmem:35971 pagetables:5885 bounce:0 free_cma:0 <<<<<<<<<<<<<<<<<<<<< print per zone page state Node 0 DMA free:15836kB ... all_unreclaimable? no lowmem_reserve[]: 0 3175 3899 3899 Node 0 DMA32 free:2888564kB ... all_unrelaimable? no lowmem_reserve[]: 0 0 724 724 lowmem_reserve[]: 0 0 0 0 Node 0 DMA: 1*4kB (U) ... 3*4096kB (M) = 15836kB Node 0 DMA32: 41*4kB (UM) ... 702*4096kB (MR) = 2888316kB 120710 total pagecache pages 0 pages in swap cache <<<<<<<<<<<<<<<<<<<<< print global swap cache stat Swap cache stats: add 0, delete 0, find 0/0 Free swap = 499708kB Total swap = 499708kB 1040368 pages RAM 58678 pages reserved 169065 pages shared 173632 pages non-shared [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name [ 2693] 0 2693 6005 1324 17 0 0 god [ 2754] 0 2754 6003 1320 16 0 0 god [ 2811] 0 2811 5992 1304 18 0 0 god [ 2874] 0 2874 6005 1323 18 0 0 god [ 2935] 0 2935 8720 7742 21 0 0 mal-30 [ 2976] 0 2976 21520 17577 42 0 0 mal-80 Memory cgroup out of memory: Kill process 2976 (mal-80) score 665 or sacrifice child Killed process 2976 (mal-80) total-vm:86080kB, anon-rss:69964kB, file-rss:344kB We can see that messages dumped by show_free_areas() are longsome and can provide so limited info for memcg that just happen oom. (2) After change mal-80 invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0 mal-80 cpuset=/ mems_allowed=0 Pid: 2704, comm: mal-80 Not tainted 3.7.0+ #10 Call Trace: [<ffffffff8167fd0b>] dump_header+0x83/0x1d1 .......(call trace) [<ffffffff8168a918>] page_fault+0x28/0x30 Task in /A/B/D killed as a result of limit of /A <<<<<<<<<<<<<<<<<<<<< memcg specific information memory: usage 102400kB, limit 102400kB, failcnt 140 memory+swap: usage 102400kB, limit 102400kB, failcnt 0 kmem: usage 0kB, limit 9007199254740991kB, failcnt 0 Memory cgroup stats for /A: cache:32KB rss:30984KB mapped_file:0KB swap:0KB inactive_anon:6912KB active_anon:24072KB inactive_file:32KB active_file:0KB unevictable:0KB Memory cgroup stats for /A/B: cache:0KB rss:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB Memory cgroup stats for /A/C: cache:0KB rss:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB Memory cgroup stats for /A/B/D: cache:32KB rss:71352KB mapped_file:0KB swap:0KB inactive_anon:6656KB active_anon:64696KB inactive_file:16KB active_file:16KB unevictable:0KB [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name [ 2260] 0 2260 6006 1325 18 0 0 god [ 2383] 0 2383 6003 1319 17 0 0 god [ 2503] 0 2503 6004 1321 18 0 0 god [ 2622] 0 2622 6004 1321 16 0 0 god [ 2695] 0 2695 8720 7741 22 0 0 mal-30 [ 2704] 0 2704 21520 17839 43 0 0 mal-80 Memory cgroup out of memory: Kill process 2704 (mal-80) score 669 or sacrifice child Killed process 2704 (mal-80) total-vm:86080kB, anon-rss:71016kB, file-rss:340kB This version provides more pointed info for memcg in ""Memory cgroup stats for XXX"" section. Signed-off-by: Sha Zhengju <handai.szj@taobao.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/58cf188ed649b6570dfdc9c62156cdf396c2e395;(1) Before change;no;no;no 203;MDY6Q29tbWl0MjMyNTI5ODo1OGNmMTg4ZWQ2NDliNjU3MGRmZGM5YzYyMTU2Y2RmMzk2YzJlMzk1;Sha Zhengju;Linus Torvalds;"memcg, oom: provide more precise dump info while memcg oom happening Currently when a memcg oom is happening the oom dump messages is still global state and provides few useful info for users. This patch prints more pointed memcg page statistics for memcg-oom and take hierarchy into consideration: Based on Michal's advice, we take hierarchy into consideration: supppose we trigger an OOM on A's limit root_memcg | A (use_hierachy=1) / \ B C | D then the printed info will be: Memory cgroup stats for /A:... Memory cgroup stats for /A/B:... Memory cgroup stats for /A/C:... Memory cgroup stats for /A/B/D:... Following are samples of oom output: (1) Before change: mal-80 invoked oom-killer:gfp_mask=0xd0, order=0, oom_score_adj=0 mal-80 cpuset=/ mems_allowed=0 Pid: 2976, comm: mal-80 Not tainted 3.7.0+ #10 Call Trace: [<ffffffff8167fbfb>] dump_header+0x83/0x1ca ..... (call trace) [<ffffffff8168a818>] page_fault+0x28/0x30 <<<<<<<<<<<<<<<<<<<<< memcg specific information Task in /A/B/D killed as a result of limit of /A memory: usage 101376kB, limit 101376kB, failcnt 57 memory+swap: usage 101376kB, limit 101376kB, failcnt 0 kmem: usage 0kB, limit 9007199254740991kB, failcnt 0 <<<<<<<<<<<<<<<<<<<<< print per cpu pageset stat Mem-Info: Node 0 DMA per-cpu: CPU 0: hi: 0, btch: 1 usd: 0 ...... CPU 3: hi: 0, btch: 1 usd: 0 Node 0 DMA32 per-cpu: CPU 0: hi: 186, btch: 31 usd: 173 ...... CPU 3: hi: 186, btch: 31 usd: 130 <<<<<<<<<<<<<<<<<<<<< print global page state active_anon:92963 inactive_anon:40777 isolated_anon:0 active_file:33027 inactive_file:51718 isolated_file:0 unevictable:0 dirty:3 writeback:0 unstable:0 free:729995 slab_reclaimable:6897 slab_unreclaimable:6263 mapped:20278 shmem:35971 pagetables:5885 bounce:0 free_cma:0 <<<<<<<<<<<<<<<<<<<<< print per zone page state Node 0 DMA free:15836kB ... all_unreclaimable? no lowmem_reserve[]: 0 3175 3899 3899 Node 0 DMA32 free:2888564kB ... all_unrelaimable? no lowmem_reserve[]: 0 0 724 724 lowmem_reserve[]: 0 0 0 0 Node 0 DMA: 1*4kB (U) ... 3*4096kB (M) = 15836kB Node 0 DMA32: 41*4kB (UM) ... 702*4096kB (MR) = 2888316kB 120710 total pagecache pages 0 pages in swap cache <<<<<<<<<<<<<<<<<<<<< print global swap cache stat Swap cache stats: add 0, delete 0, find 0/0 Free swap = 499708kB Total swap = 499708kB 1040368 pages RAM 58678 pages reserved 169065 pages shared 173632 pages non-shared [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name [ 2693] 0 2693 6005 1324 17 0 0 god [ 2754] 0 2754 6003 1320 16 0 0 god [ 2811] 0 2811 5992 1304 18 0 0 god [ 2874] 0 2874 6005 1323 18 0 0 god [ 2935] 0 2935 8720 7742 21 0 0 mal-30 [ 2976] 0 2976 21520 17577 42 0 0 mal-80 Memory cgroup out of memory: Kill process 2976 (mal-80) score 665 or sacrifice child Killed process 2976 (mal-80) total-vm:86080kB, anon-rss:69964kB, file-rss:344kB We can see that messages dumped by show_free_areas() are longsome and can provide so limited info for memcg that just happen oom. (2) After change mal-80 invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0 mal-80 cpuset=/ mems_allowed=0 Pid: 2704, comm: mal-80 Not tainted 3.7.0+ #10 Call Trace: [<ffffffff8167fd0b>] dump_header+0x83/0x1d1 .......(call trace) [<ffffffff8168a918>] page_fault+0x28/0x30 Task in /A/B/D killed as a result of limit of /A <<<<<<<<<<<<<<<<<<<<< memcg specific information memory: usage 102400kB, limit 102400kB, failcnt 140 memory+swap: usage 102400kB, limit 102400kB, failcnt 0 kmem: usage 0kB, limit 9007199254740991kB, failcnt 0 Memory cgroup stats for /A: cache:32KB rss:30984KB mapped_file:0KB swap:0KB inactive_anon:6912KB active_anon:24072KB inactive_file:32KB active_file:0KB unevictable:0KB Memory cgroup stats for /A/B: cache:0KB rss:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB Memory cgroup stats for /A/C: cache:0KB rss:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB Memory cgroup stats for /A/B/D: cache:32KB rss:71352KB mapped_file:0KB swap:0KB inactive_anon:6656KB active_anon:64696KB inactive_file:16KB active_file:16KB unevictable:0KB [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name [ 2260] 0 2260 6006 1325 18 0 0 god [ 2383] 0 2383 6003 1319 17 0 0 god [ 2503] 0 2503 6004 1321 18 0 0 god [ 2622] 0 2622 6004 1321 16 0 0 god [ 2695] 0 2695 8720 7741 22 0 0 mal-30 [ 2704] 0 2704 21520 17839 43 0 0 mal-80 Memory cgroup out of memory: Kill process 2704 (mal-80) score 669 or sacrifice child Killed process 2704 (mal-80) total-vm:86080kB, anon-rss:71016kB, file-rss:340kB This version provides more pointed info for memcg in ""Memory cgroup stats for XXX"" section. Signed-off-by: Sha Zhengju <handai.szj@taobao.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/58cf188ed649b6570dfdc9c62156cdf396c2e395;" mal-80 invoked oom-killer:gfp_mask=0xd0, order=0, oom_score_adj=0 mal-80 cpuset=/ mems_allowed=0 Pid: 2976, comm: mal-80 Not tainted 3.7.0+ #10 We can see that messages dumped by show_free_areas() are longsome and can provide so limited info for memcg that just happen oom";no;yes;yes 204;MDY6Q29tbWl0MjMyNTI5ODowZmE4NGE0YmZhMmFhYzhjMDRkNDUzNTFiNDA3NjVkNjFlMWZkMjBk;David Rientjes;Linus Torvalds;"mm, oom: remove redundant sleep in pagefault oom handler out_of_memory() will already cause current to schedule if it has not been killed, so doing it again in pagefault_out_of_memory() is redundant. Remove it. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/0fa84a4bfa2aac8c04d45351b40765d61e1fd20d;mm, oom: remove redundant sleep in pagefault oom handler;yes;yes;no 204;MDY6Q29tbWl0MjMyNTI5ODowZmE4NGE0YmZhMmFhYzhjMDRkNDUzNTFiNDA3NjVkNjFlMWZkMjBk;David Rientjes;Linus Torvalds;"mm, oom: remove redundant sleep in pagefault oom handler out_of_memory() will already cause current to schedule if it has not been killed, so doing it again in pagefault_out_of_memory() is redundant. Remove it. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/0fa84a4bfa2aac8c04d45351b40765d61e1fd20d;"out_of_memory() will already cause current to schedule if it has not been killed, so doing it again in pagefault_out_of_memory() is redundant";no;yes;yes 204;MDY6Q29tbWl0MjMyNTI5ODowZmE4NGE0YmZhMmFhYzhjMDRkNDUzNTFiNDA3NjVkNjFlMWZkMjBk;David Rientjes;Linus Torvalds;"mm, oom: remove redundant sleep in pagefault oom handler out_of_memory() will already cause current to schedule if it has not been killed, so doing it again in pagefault_out_of_memory() is redundant. Remove it. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/0fa84a4bfa2aac8c04d45351b40765d61e1fd20d;Remove it.;yes;no;no 205;MDY6Q29tbWl0MjMyNTI5ODplZmFjZDAyZTRmNTdkOTRlOTM0YmE1Yzg0ZjEwZjhjZTkxMTU4Nzcw;David Rientjes;Linus Torvalds;"mm, oom: cleanup pagefault oom handler To lock the entire system from parallel oom killing, it's possible to pass in a zonelist with all zones rather than using for_each_populated_zone() for the iteration. This obsoletes try_set_system_oom() and clear_system_oom() so that they can be removed. Signed-off-by: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/efacd02e4f57d94e934ba5c84f10f8ce91158770;mm, oom: cleanup pagefault oom handler;yes;yes;no 205;MDY6Q29tbWl0MjMyNTI5ODplZmFjZDAyZTRmNTdkOTRlOTM0YmE1Yzg0ZjEwZjhjZTkxMTU4Nzcw;David Rientjes;Linus Torvalds;"mm, oom: cleanup pagefault oom handler To lock the entire system from parallel oom killing, it's possible to pass in a zonelist with all zones rather than using for_each_populated_zone() for the iteration. This obsoletes try_set_system_oom() and clear_system_oom() so that they can be removed. Signed-off-by: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/efacd02e4f57d94e934ba5c84f10f8ce91158770;"To lock the entire system from parallel oom killing, it's possible to pass in a zonelist with all zones rather than using for_each_populated_zone() for the iteration";yes;no;yes 205;MDY6Q29tbWl0MjMyNTI5ODplZmFjZDAyZTRmNTdkOTRlOTM0YmE1Yzg0ZjEwZjhjZTkxMTU4Nzcw;David Rientjes;Linus Torvalds;"mm, oom: cleanup pagefault oom handler To lock the entire system from parallel oom killing, it's possible to pass in a zonelist with all zones rather than using for_each_populated_zone() for the iteration. This obsoletes try_set_system_oom() and clear_system_oom() so that they can be removed. Signed-off-by: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/efacd02e4f57d94e934ba5c84f10f8ce91158770;" This obsoletes try_set_system_oom() and clear_system_oom() so that they can be removed.";yes;yes;yes 206;MDY6Q29tbWl0MjMyNTI5ODpiZDNhNjZjMWNkZjMxMjc0NDg5Y2MxYjVhY2U4Nzk2OTVhNWExNzk3;Lai Jiangshan;Linus Torvalds;"oom: use N_MEMORY instead N_HIGH_MEMORY N_HIGH_MEMORY stands for the nodes that has normal or high memory. N_MEMORY stands for the nodes that has any memory. The code here need to handle with the nodes which have memory, we should use N_MEMORY instead. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Hillf Danton <dhillf@gmail.com> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> Cc: Christoph Lameter <cl@linux.com> Cc: Lin Feng <linfeng@cn.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/bd3a66c1cdf31274489cc1b5ace879695a5a1797;oom: use N_MEMORY instead N_HIGH_MEMORY;yes;no;no 206;MDY6Q29tbWl0MjMyNTI5ODpiZDNhNjZjMWNkZjMxMjc0NDg5Y2MxYjVhY2U4Nzk2OTVhNWExNzk3;Lai Jiangshan;Linus Torvalds;"oom: use N_MEMORY instead N_HIGH_MEMORY N_HIGH_MEMORY stands for the nodes that has normal or high memory. N_MEMORY stands for the nodes that has any memory. The code here need to handle with the nodes which have memory, we should use N_MEMORY instead. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Hillf Danton <dhillf@gmail.com> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> Cc: Christoph Lameter <cl@linux.com> Cc: Lin Feng <linfeng@cn.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/bd3a66c1cdf31274489cc1b5ace879695a5a1797;N_HIGH_MEMORY stands for the nodes that has normal or high memory;no;no;yes 206;MDY6Q29tbWl0MjMyNTI5ODpiZDNhNjZjMWNkZjMxMjc0NDg5Y2MxYjVhY2U4Nzk2OTVhNWExNzk3;Lai Jiangshan;Linus Torvalds;"oom: use N_MEMORY instead N_HIGH_MEMORY N_HIGH_MEMORY stands for the nodes that has normal or high memory. N_MEMORY stands for the nodes that has any memory. The code here need to handle with the nodes which have memory, we should use N_MEMORY instead. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Hillf Danton <dhillf@gmail.com> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> Cc: Christoph Lameter <cl@linux.com> Cc: Lin Feng <linfeng@cn.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/bd3a66c1cdf31274489cc1b5ace879695a5a1797;N_MEMORY stands for the nodes that has any memory;no;no;yes 206;MDY6Q29tbWl0MjMyNTI5ODpiZDNhNjZjMWNkZjMxMjc0NDg5Y2MxYjVhY2U4Nzk2OTVhNWExNzk3;Lai Jiangshan;Linus Torvalds;"oom: use N_MEMORY instead N_HIGH_MEMORY N_HIGH_MEMORY stands for the nodes that has normal or high memory. N_MEMORY stands for the nodes that has any memory. The code here need to handle with the nodes which have memory, we should use N_MEMORY instead. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Hillf Danton <dhillf@gmail.com> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> Cc: Christoph Lameter <cl@linux.com> Cc: Lin Feng <linfeng@cn.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/bd3a66c1cdf31274489cc1b5ace879695a5a1797;"The code here need to handle with the nodes which have memory, we should use N_MEMORY instead.";yes;yes;no 207;MDY6Q29tbWl0MjMyNTI5ODplMWUxMmQyZjMxMDRiZTg4NjA3M2FjNmM1YzQ2NzhmMzBiMWI5ZTUx;David Rientjes;Linus Torvalds;"mm, oom: fix race when specifying a thread as the oom origin test_set_oom_score_adj() and compare_swap_oom_score_adj() are used to specify that current should be killed first if an oom condition occurs in between the two calls. The usage is short oom_score_adj = test_set_oom_score_adj(OOM_SCORE_ADJ_MAX); ... compare_swap_oom_score_adj(OOM_SCORE_ADJ_MAX, oom_score_adj); to store the thread's oom_score_adj, temporarily change it to the maximum score possible, and then restore the old value if it is still the same. This happens to still be racy, however, if the user writes OOM_SCORE_ADJ_MAX to /proc/pid/oom_score_adj in between the two calls. The compare_swap_oom_score_adj() will then incorrectly reset the old value prior to the write of OOM_SCORE_ADJ_MAX. To fix this, introduce a new oom_flags_t member in struct signal_struct that will be used for per-thread oom killer flags. KSM and swapoff can now use a bit in this member to specify that threads should be killed first in oom conditions without playing around with oom_score_adj. This also allows the correct oom_score_adj to always be shown when reading /proc/pid/oom_score. Signed-off-by: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Anton Vorontsov <anton.vorontsov@linaro.org> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/e1e12d2f3104be886073ac6c5c4678f30b1b9e51;mm, oom: fix race when specifying a thread as the oom origin;yes;yes;no 207;MDY6Q29tbWl0MjMyNTI5ODplMWUxMmQyZjMxMDRiZTg4NjA3M2FjNmM1YzQ2NzhmMzBiMWI5ZTUx;David Rientjes;Linus Torvalds;"mm, oom: fix race when specifying a thread as the oom origin test_set_oom_score_adj() and compare_swap_oom_score_adj() are used to specify that current should be killed first if an oom condition occurs in between the two calls. The usage is short oom_score_adj = test_set_oom_score_adj(OOM_SCORE_ADJ_MAX); ... compare_swap_oom_score_adj(OOM_SCORE_ADJ_MAX, oom_score_adj); to store the thread's oom_score_adj, temporarily change it to the maximum score possible, and then restore the old value if it is still the same. This happens to still be racy, however, if the user writes OOM_SCORE_ADJ_MAX to /proc/pid/oom_score_adj in between the two calls. The compare_swap_oom_score_adj() will then incorrectly reset the old value prior to the write of OOM_SCORE_ADJ_MAX. To fix this, introduce a new oom_flags_t member in struct signal_struct that will be used for per-thread oom killer flags. KSM and swapoff can now use a bit in this member to specify that threads should be killed first in oom conditions without playing around with oom_score_adj. This also allows the correct oom_score_adj to always be shown when reading /proc/pid/oom_score. Signed-off-by: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Anton Vorontsov <anton.vorontsov@linaro.org> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/e1e12d2f3104be886073ac6c5c4678f30b1b9e51;"test_set_oom_score_adj() and compare_swap_oom_score_adj() are used to specify that current should be killed first if an oom condition occurs in between the two calls";no;no;yes 207;MDY6Q29tbWl0MjMyNTI5ODplMWUxMmQyZjMxMDRiZTg4NjA3M2FjNmM1YzQ2NzhmMzBiMWI5ZTUx;David Rientjes;Linus Torvalds;"mm, oom: fix race when specifying a thread as the oom origin test_set_oom_score_adj() and compare_swap_oom_score_adj() are used to specify that current should be killed first if an oom condition occurs in between the two calls. The usage is short oom_score_adj = test_set_oom_score_adj(OOM_SCORE_ADJ_MAX); ... compare_swap_oom_score_adj(OOM_SCORE_ADJ_MAX, oom_score_adj); to store the thread's oom_score_adj, temporarily change it to the maximum score possible, and then restore the old value if it is still the same. This happens to still be racy, however, if the user writes OOM_SCORE_ADJ_MAX to /proc/pid/oom_score_adj in between the two calls. The compare_swap_oom_score_adj() will then incorrectly reset the old value prior to the write of OOM_SCORE_ADJ_MAX. To fix this, introduce a new oom_flags_t member in struct signal_struct that will be used for per-thread oom killer flags. KSM and swapoff can now use a bit in this member to specify that threads should be killed first in oom conditions without playing around with oom_score_adj. This also allows the correct oom_score_adj to always be shown when reading /proc/pid/oom_score. Signed-off-by: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Anton Vorontsov <anton.vorontsov@linaro.org> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/e1e12d2f3104be886073ac6c5c4678f30b1b9e51;"The usage is to store the thread's oom_score_adj, temporarily change it to the maximum score possible, and then restore the old value if it is still the same";no;no;yes 207;MDY6Q29tbWl0MjMyNTI5ODplMWUxMmQyZjMxMDRiZTg4NjA3M2FjNmM1YzQ2NzhmMzBiMWI5ZTUx;David Rientjes;Linus Torvalds;"mm, oom: fix race when specifying a thread as the oom origin test_set_oom_score_adj() and compare_swap_oom_score_adj() are used to specify that current should be killed first if an oom condition occurs in between the two calls. The usage is short oom_score_adj = test_set_oom_score_adj(OOM_SCORE_ADJ_MAX); ... compare_swap_oom_score_adj(OOM_SCORE_ADJ_MAX, oom_score_adj); to store the thread's oom_score_adj, temporarily change it to the maximum score possible, and then restore the old value if it is still the same. This happens to still be racy, however, if the user writes OOM_SCORE_ADJ_MAX to /proc/pid/oom_score_adj in between the two calls. The compare_swap_oom_score_adj() will then incorrectly reset the old value prior to the write of OOM_SCORE_ADJ_MAX. To fix this, introduce a new oom_flags_t member in struct signal_struct that will be used for per-thread oom killer flags. KSM and swapoff can now use a bit in this member to specify that threads should be killed first in oom conditions without playing around with oom_score_adj. This also allows the correct oom_score_adj to always be shown when reading /proc/pid/oom_score. Signed-off-by: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Anton Vorontsov <anton.vorontsov@linaro.org> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/e1e12d2f3104be886073ac6c5c4678f30b1b9e51;"This happens to still be racy, however, if the user writes OOM_SCORE_ADJ_MAX to /proc/pid/oom_score_adj in between the two calls";no;yes;yes 207;MDY6Q29tbWl0MjMyNTI5ODplMWUxMmQyZjMxMDRiZTg4NjA3M2FjNmM1YzQ2NzhmMzBiMWI5ZTUx;David Rientjes;Linus Torvalds;"mm, oom: fix race when specifying a thread as the oom origin test_set_oom_score_adj() and compare_swap_oom_score_adj() are used to specify that current should be killed first if an oom condition occurs in between the two calls. The usage is short oom_score_adj = test_set_oom_score_adj(OOM_SCORE_ADJ_MAX); ... compare_swap_oom_score_adj(OOM_SCORE_ADJ_MAX, oom_score_adj); to store the thread's oom_score_adj, temporarily change it to the maximum score possible, and then restore the old value if it is still the same. This happens to still be racy, however, if the user writes OOM_SCORE_ADJ_MAX to /proc/pid/oom_score_adj in between the two calls. The compare_swap_oom_score_adj() will then incorrectly reset the old value prior to the write of OOM_SCORE_ADJ_MAX. To fix this, introduce a new oom_flags_t member in struct signal_struct that will be used for per-thread oom killer flags. KSM and swapoff can now use a bit in this member to specify that threads should be killed first in oom conditions without playing around with oom_score_adj. This also allows the correct oom_score_adj to always be shown when reading /proc/pid/oom_score. Signed-off-by: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Anton Vorontsov <anton.vorontsov@linaro.org> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/e1e12d2f3104be886073ac6c5c4678f30b1b9e51;"The compare_swap_oom_score_adj() will then incorrectly reset the old value prior to the write of OOM_SCORE_ADJ_MAX";no;yes;yes 207;MDY6Q29tbWl0MjMyNTI5ODplMWUxMmQyZjMxMDRiZTg4NjA3M2FjNmM1YzQ2NzhmMzBiMWI5ZTUx;David Rientjes;Linus Torvalds;"mm, oom: fix race when specifying a thread as the oom origin test_set_oom_score_adj() and compare_swap_oom_score_adj() are used to specify that current should be killed first if an oom condition occurs in between the two calls. The usage is short oom_score_adj = test_set_oom_score_adj(OOM_SCORE_ADJ_MAX); ... compare_swap_oom_score_adj(OOM_SCORE_ADJ_MAX, oom_score_adj); to store the thread's oom_score_adj, temporarily change it to the maximum score possible, and then restore the old value if it is still the same. This happens to still be racy, however, if the user writes OOM_SCORE_ADJ_MAX to /proc/pid/oom_score_adj in between the two calls. The compare_swap_oom_score_adj() will then incorrectly reset the old value prior to the write of OOM_SCORE_ADJ_MAX. To fix this, introduce a new oom_flags_t member in struct signal_struct that will be used for per-thread oom killer flags. KSM and swapoff can now use a bit in this member to specify that threads should be killed first in oom conditions without playing around with oom_score_adj. This also allows the correct oom_score_adj to always be shown when reading /proc/pid/oom_score. Signed-off-by: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Anton Vorontsov <anton.vorontsov@linaro.org> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/e1e12d2f3104be886073ac6c5c4678f30b1b9e51;"To fix this, introduce a new oom_flags_t member in struct signal_struct that will be used for per-thread oom killer flags";yes;yes;no 207;MDY6Q29tbWl0MjMyNTI5ODplMWUxMmQyZjMxMDRiZTg4NjA3M2FjNmM1YzQ2NzhmMzBiMWI5ZTUx;David Rientjes;Linus Torvalds;"mm, oom: fix race when specifying a thread as the oom origin test_set_oom_score_adj() and compare_swap_oom_score_adj() are used to specify that current should be killed first if an oom condition occurs in between the two calls. The usage is short oom_score_adj = test_set_oom_score_adj(OOM_SCORE_ADJ_MAX); ... compare_swap_oom_score_adj(OOM_SCORE_ADJ_MAX, oom_score_adj); to store the thread's oom_score_adj, temporarily change it to the maximum score possible, and then restore the old value if it is still the same. This happens to still be racy, however, if the user writes OOM_SCORE_ADJ_MAX to /proc/pid/oom_score_adj in between the two calls. The compare_swap_oom_score_adj() will then incorrectly reset the old value prior to the write of OOM_SCORE_ADJ_MAX. To fix this, introduce a new oom_flags_t member in struct signal_struct that will be used for per-thread oom killer flags. KSM and swapoff can now use a bit in this member to specify that threads should be killed first in oom conditions without playing around with oom_score_adj. This also allows the correct oom_score_adj to always be shown when reading /proc/pid/oom_score. Signed-off-by: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Anton Vorontsov <anton.vorontsov@linaro.org> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/e1e12d2f3104be886073ac6c5c4678f30b1b9e51;" KSM and swapoff can now use a bit in this member to specify that threads should be killed first in oom conditions without playing around with oom_score_adj";yes;yes;no 207;MDY6Q29tbWl0MjMyNTI5ODplMWUxMmQyZjMxMDRiZTg4NjA3M2FjNmM1YzQ2NzhmMzBiMWI5ZTUx;David Rientjes;Linus Torvalds;"mm, oom: fix race when specifying a thread as the oom origin test_set_oom_score_adj() and compare_swap_oom_score_adj() are used to specify that current should be killed first if an oom condition occurs in between the two calls. The usage is short oom_score_adj = test_set_oom_score_adj(OOM_SCORE_ADJ_MAX); ... compare_swap_oom_score_adj(OOM_SCORE_ADJ_MAX, oom_score_adj); to store the thread's oom_score_adj, temporarily change it to the maximum score possible, and then restore the old value if it is still the same. This happens to still be racy, however, if the user writes OOM_SCORE_ADJ_MAX to /proc/pid/oom_score_adj in between the two calls. The compare_swap_oom_score_adj() will then incorrectly reset the old value prior to the write of OOM_SCORE_ADJ_MAX. To fix this, introduce a new oom_flags_t member in struct signal_struct that will be used for per-thread oom killer flags. KSM and swapoff can now use a bit in this member to specify that threads should be killed first in oom conditions without playing around with oom_score_adj. This also allows the correct oom_score_adj to always be shown when reading /proc/pid/oom_score. Signed-off-by: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Anton Vorontsov <anton.vorontsov@linaro.org> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/e1e12d2f3104be886073ac6c5c4678f30b1b9e51;"This also allows the correct oom_score_adj to always be shown when reading /proc/pid/oom_score.";yes;yes;no 208;MDY6Q29tbWl0MjMyNTI5ODphOWM1OGI5MDdkYmM2ODIxNTMzZGZjMjk1YjYzY2FmMTExZmYxZjE2;David Rientjes;Linus Torvalds;"mm, oom: change type of oom_score_adj to short The maximum oom_score_adj is 1000 and the minimum oom_score_adj is -1000, so this range can be represented by the signed short type with no functional change. The extra space this frees up in struct signal_struct will be used for per-thread oom kill flags in the next patch. Signed-off-by: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Anton Vorontsov <anton.vorontsov@linaro.org> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/a9c58b907dbc6821533dfc295b63caf111ff1f16;mm, oom: change type of oom_score_adj to short;yes;no;no 208;MDY6Q29tbWl0MjMyNTI5ODphOWM1OGI5MDdkYmM2ODIxNTMzZGZjMjk1YjYzY2FmMTExZmYxZjE2;David Rientjes;Linus Torvalds;"mm, oom: change type of oom_score_adj to short The maximum oom_score_adj is 1000 and the minimum oom_score_adj is -1000, so this range can be represented by the signed short type with no functional change. The extra space this frees up in struct signal_struct will be used for per-thread oom kill flags in the next patch. Signed-off-by: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Anton Vorontsov <anton.vorontsov@linaro.org> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/a9c58b907dbc6821533dfc295b63caf111ff1f16;"The maximum oom_score_adj is 1000 and the minimum oom_score_adj is -1000, so this range can be represented by the signed short type with no functional change";yes;yes;yes 208;MDY6Q29tbWl0MjMyNTI5ODphOWM1OGI5MDdkYmM2ODIxNTMzZGZjMjk1YjYzY2FmMTExZmYxZjE2;David Rientjes;Linus Torvalds;"mm, oom: change type of oom_score_adj to short The maximum oom_score_adj is 1000 and the minimum oom_score_adj is -1000, so this range can be represented by the signed short type with no functional change. The extra space this frees up in struct signal_struct will be used for per-thread oom kill flags in the next patch. Signed-off-by: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Anton Vorontsov <anton.vorontsov@linaro.org> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/a9c58b907dbc6821533dfc295b63caf111ff1f16;" The extra space this frees up in struct signal_struct will be used for per-thread oom kill flags in the next patch.";yes;yes;no 209;MDY6Q29tbWl0MjMyNTI5ODo5ZmY0ODY4ZTMwNTFkOTEyOGEyNGRkMzMwYmVkMzIwMTFhMTE0MjFk;David Rientjes;Linus Torvalds;"mm, oom: allow exiting threads to have access to memory reserves Exiting threads, those with PF_EXITING set, can pagefault and require memory before they can make forward progress. This happens, for instance, when a process must fault task->robust_list, a userspace structure, before detaching its memory. These threads also aren't guaranteed to get access to memory reserves unless oom killed or killed from userspace. The oom killer won't grant memory reserves if other threads are also exiting other than current and stalling at the same point. This prevents needlessly killing processes when others are already exiting. Instead of special casing all the possible situations between PF_EXITING getting set and a thread detaching its mm where it may allocate memory, which probably wouldn't get updated when a change is made to the exit path, the solution is to give all exiting threads access to memory reserves if they call the oom killer. This allows them to quickly allocate, detach its mm, and free the memory it represents. Summary of Luigi's bug report: : He had an oom condition where threads were faulting on task->robust_list : and repeatedly called the oom killer but it would defer killing a thread : because it saw other PF_EXITING threads. This can happen anytime we need : to allocate memory after setting PF_EXITING and before detaching our mm; : if there are other threads in the same state then the oom killer won't do : anything unless one of them happens to be killed from userspace. : : So instead of only deferring for PF_EXITING and !task->robust_list, it's : better to just give them access to memory reserves to prevent a potential : livelock so that any other faults that may be introduced in the future in : the exit path don't cause the same problem (and hopefully we don't allow : too many of those!). Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Minchan Kim <minchan@kernel.org> Tested-by: Luigi Semenzato <semenzato@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9ff4868e3051d9128a24dd330bed32011a11421d;mm, oom: allow exiting threads to have access to memory reserves;yes;no;no 209;MDY6Q29tbWl0MjMyNTI5ODo5ZmY0ODY4ZTMwNTFkOTEyOGEyNGRkMzMwYmVkMzIwMTFhMTE0MjFk;David Rientjes;Linus Torvalds;"mm, oom: allow exiting threads to have access to memory reserves Exiting threads, those with PF_EXITING set, can pagefault and require memory before they can make forward progress. This happens, for instance, when a process must fault task->robust_list, a userspace structure, before detaching its memory. These threads also aren't guaranteed to get access to memory reserves unless oom killed or killed from userspace. The oom killer won't grant memory reserves if other threads are also exiting other than current and stalling at the same point. This prevents needlessly killing processes when others are already exiting. Instead of special casing all the possible situations between PF_EXITING getting set and a thread detaching its mm where it may allocate memory, which probably wouldn't get updated when a change is made to the exit path, the solution is to give all exiting threads access to memory reserves if they call the oom killer. This allows them to quickly allocate, detach its mm, and free the memory it represents. Summary of Luigi's bug report: : He had an oom condition where threads were faulting on task->robust_list : and repeatedly called the oom killer but it would defer killing a thread : because it saw other PF_EXITING threads. This can happen anytime we need : to allocate memory after setting PF_EXITING and before detaching our mm; : if there are other threads in the same state then the oom killer won't do : anything unless one of them happens to be killed from userspace. : : So instead of only deferring for PF_EXITING and !task->robust_list, it's : better to just give them access to memory reserves to prevent a potential : livelock so that any other faults that may be introduced in the future in : the exit path don't cause the same problem (and hopefully we don't allow : too many of those!). Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Minchan Kim <minchan@kernel.org> Tested-by: Luigi Semenzato <semenzato@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9ff4868e3051d9128a24dd330bed32011a11421d;"Exiting threads, those with PF_EXITING set, can pagefault and require memory before they can make forward progress";no;no;yes 209;MDY6Q29tbWl0MjMyNTI5ODo5ZmY0ODY4ZTMwNTFkOTEyOGEyNGRkMzMwYmVkMzIwMTFhMTE0MjFk;David Rientjes;Linus Torvalds;"mm, oom: allow exiting threads to have access to memory reserves Exiting threads, those with PF_EXITING set, can pagefault and require memory before they can make forward progress. This happens, for instance, when a process must fault task->robust_list, a userspace structure, before detaching its memory. These threads also aren't guaranteed to get access to memory reserves unless oom killed or killed from userspace. The oom killer won't grant memory reserves if other threads are also exiting other than current and stalling at the same point. This prevents needlessly killing processes when others are already exiting. Instead of special casing all the possible situations between PF_EXITING getting set and a thread detaching its mm where it may allocate memory, which probably wouldn't get updated when a change is made to the exit path, the solution is to give all exiting threads access to memory reserves if they call the oom killer. This allows them to quickly allocate, detach its mm, and free the memory it represents. Summary of Luigi's bug report: : He had an oom condition where threads were faulting on task->robust_list : and repeatedly called the oom killer but it would defer killing a thread : because it saw other PF_EXITING threads. This can happen anytime we need : to allocate memory after setting PF_EXITING and before detaching our mm; : if there are other threads in the same state then the oom killer won't do : anything unless one of them happens to be killed from userspace. : : So instead of only deferring for PF_EXITING and !task->robust_list, it's : better to just give them access to memory reserves to prevent a potential : livelock so that any other faults that may be introduced in the future in : the exit path don't cause the same problem (and hopefully we don't allow : too many of those!). Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Minchan Kim <minchan@kernel.org> Tested-by: Luigi Semenzato <semenzato@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9ff4868e3051d9128a24dd330bed32011a11421d;" This happens, for instance, when a process must fault task->robust_list, a userspace structure, before detaching its memory";no;no;yes 209;MDY6Q29tbWl0MjMyNTI5ODo5ZmY0ODY4ZTMwNTFkOTEyOGEyNGRkMzMwYmVkMzIwMTFhMTE0MjFk;David Rientjes;Linus Torvalds;"mm, oom: allow exiting threads to have access to memory reserves Exiting threads, those with PF_EXITING set, can pagefault and require memory before they can make forward progress. This happens, for instance, when a process must fault task->robust_list, a userspace structure, before detaching its memory. These threads also aren't guaranteed to get access to memory reserves unless oom killed or killed from userspace. The oom killer won't grant memory reserves if other threads are also exiting other than current and stalling at the same point. This prevents needlessly killing processes when others are already exiting. Instead of special casing all the possible situations between PF_EXITING getting set and a thread detaching its mm where it may allocate memory, which probably wouldn't get updated when a change is made to the exit path, the solution is to give all exiting threads access to memory reserves if they call the oom killer. This allows them to quickly allocate, detach its mm, and free the memory it represents. Summary of Luigi's bug report: : He had an oom condition where threads were faulting on task->robust_list : and repeatedly called the oom killer but it would defer killing a thread : because it saw other PF_EXITING threads. This can happen anytime we need : to allocate memory after setting PF_EXITING and before detaching our mm; : if there are other threads in the same state then the oom killer won't do : anything unless one of them happens to be killed from userspace. : : So instead of only deferring for PF_EXITING and !task->robust_list, it's : better to just give them access to memory reserves to prevent a potential : livelock so that any other faults that may be introduced in the future in : the exit path don't cause the same problem (and hopefully we don't allow : too many of those!). Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Minchan Kim <minchan@kernel.org> Tested-by: Luigi Semenzato <semenzato@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9ff4868e3051d9128a24dd330bed32011a11421d;"These threads also aren't guaranteed to get access to memory reserves unless oom killed or killed from userspace";no;no;yes 209;MDY6Q29tbWl0MjMyNTI5ODo5ZmY0ODY4ZTMwNTFkOTEyOGEyNGRkMzMwYmVkMzIwMTFhMTE0MjFk;David Rientjes;Linus Torvalds;"mm, oom: allow exiting threads to have access to memory reserves Exiting threads, those with PF_EXITING set, can pagefault and require memory before they can make forward progress. This happens, for instance, when a process must fault task->robust_list, a userspace structure, before detaching its memory. These threads also aren't guaranteed to get access to memory reserves unless oom killed or killed from userspace. The oom killer won't grant memory reserves if other threads are also exiting other than current and stalling at the same point. This prevents needlessly killing processes when others are already exiting. Instead of special casing all the possible situations between PF_EXITING getting set and a thread detaching its mm where it may allocate memory, which probably wouldn't get updated when a change is made to the exit path, the solution is to give all exiting threads access to memory reserves if they call the oom killer. This allows them to quickly allocate, detach its mm, and free the memory it represents. Summary of Luigi's bug report: : He had an oom condition where threads were faulting on task->robust_list : and repeatedly called the oom killer but it would defer killing a thread : because it saw other PF_EXITING threads. This can happen anytime we need : to allocate memory after setting PF_EXITING and before detaching our mm; : if there are other threads in the same state then the oom killer won't do : anything unless one of them happens to be killed from userspace. : : So instead of only deferring for PF_EXITING and !task->robust_list, it's : better to just give them access to memory reserves to prevent a potential : livelock so that any other faults that may be introduced in the future in : the exit path don't cause the same problem (and hopefully we don't allow : too many of those!). Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Minchan Kim <minchan@kernel.org> Tested-by: Luigi Semenzato <semenzato@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9ff4868e3051d9128a24dd330bed32011a11421d;" The oom killer won't grant memory reserves if other threads are also exiting other than current and stalling at the same point";no;no;yes 209;MDY6Q29tbWl0MjMyNTI5ODo5ZmY0ODY4ZTMwNTFkOTEyOGEyNGRkMzMwYmVkMzIwMTFhMTE0MjFk;David Rientjes;Linus Torvalds;"mm, oom: allow exiting threads to have access to memory reserves Exiting threads, those with PF_EXITING set, can pagefault and require memory before they can make forward progress. This happens, for instance, when a process must fault task->robust_list, a userspace structure, before detaching its memory. These threads also aren't guaranteed to get access to memory reserves unless oom killed or killed from userspace. The oom killer won't grant memory reserves if other threads are also exiting other than current and stalling at the same point. This prevents needlessly killing processes when others are already exiting. Instead of special casing all the possible situations between PF_EXITING getting set and a thread detaching its mm where it may allocate memory, which probably wouldn't get updated when a change is made to the exit path, the solution is to give all exiting threads access to memory reserves if they call the oom killer. This allows them to quickly allocate, detach its mm, and free the memory it represents. Summary of Luigi's bug report: : He had an oom condition where threads were faulting on task->robust_list : and repeatedly called the oom killer but it would defer killing a thread : because it saw other PF_EXITING threads. This can happen anytime we need : to allocate memory after setting PF_EXITING and before detaching our mm; : if there are other threads in the same state then the oom killer won't do : anything unless one of them happens to be killed from userspace. : : So instead of only deferring for PF_EXITING and !task->robust_list, it's : better to just give them access to memory reserves to prevent a potential : livelock so that any other faults that may be introduced in the future in : the exit path don't cause the same problem (and hopefully we don't allow : too many of those!). Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Minchan Kim <minchan@kernel.org> Tested-by: Luigi Semenzato <semenzato@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9ff4868e3051d9128a24dd330bed32011a11421d;" This prevents needlessly killing processes when others are already exiting";no;yes;yes 209;MDY6Q29tbWl0MjMyNTI5ODo5ZmY0ODY4ZTMwNTFkOTEyOGEyNGRkMzMwYmVkMzIwMTFhMTE0MjFk;David Rientjes;Linus Torvalds;"mm, oom: allow exiting threads to have access to memory reserves Exiting threads, those with PF_EXITING set, can pagefault and require memory before they can make forward progress. This happens, for instance, when a process must fault task->robust_list, a userspace structure, before detaching its memory. These threads also aren't guaranteed to get access to memory reserves unless oom killed or killed from userspace. The oom killer won't grant memory reserves if other threads are also exiting other than current and stalling at the same point. This prevents needlessly killing processes when others are already exiting. Instead of special casing all the possible situations between PF_EXITING getting set and a thread detaching its mm where it may allocate memory, which probably wouldn't get updated when a change is made to the exit path, the solution is to give all exiting threads access to memory reserves if they call the oom killer. This allows them to quickly allocate, detach its mm, and free the memory it represents. Summary of Luigi's bug report: : He had an oom condition where threads were faulting on task->robust_list : and repeatedly called the oom killer but it would defer killing a thread : because it saw other PF_EXITING threads. This can happen anytime we need : to allocate memory after setting PF_EXITING and before detaching our mm; : if there are other threads in the same state then the oom killer won't do : anything unless one of them happens to be killed from userspace. : : So instead of only deferring for PF_EXITING and !task->robust_list, it's : better to just give them access to memory reserves to prevent a potential : livelock so that any other faults that may be introduced in the future in : the exit path don't cause the same problem (and hopefully we don't allow : too many of those!). Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Minchan Kim <minchan@kernel.org> Tested-by: Luigi Semenzato <semenzato@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9ff4868e3051d9128a24dd330bed32011a11421d;"Instead of special casing all the possible situations between PF_EXITING getting set and a thread detaching its mm where it may allocate memory, which probably wouldn't get updated when a change is made to the exit path, the solution is to give all exiting threads access to memory reserves if they call the oom killer";yes;yes;yes 209;MDY6Q29tbWl0MjMyNTI5ODo5ZmY0ODY4ZTMwNTFkOTEyOGEyNGRkMzMwYmVkMzIwMTFhMTE0MjFk;David Rientjes;Linus Torvalds;"mm, oom: allow exiting threads to have access to memory reserves Exiting threads, those with PF_EXITING set, can pagefault and require memory before they can make forward progress. This happens, for instance, when a process must fault task->robust_list, a userspace structure, before detaching its memory. These threads also aren't guaranteed to get access to memory reserves unless oom killed or killed from userspace. The oom killer won't grant memory reserves if other threads are also exiting other than current and stalling at the same point. This prevents needlessly killing processes when others are already exiting. Instead of special casing all the possible situations between PF_EXITING getting set and a thread detaching its mm where it may allocate memory, which probably wouldn't get updated when a change is made to the exit path, the solution is to give all exiting threads access to memory reserves if they call the oom killer. This allows them to quickly allocate, detach its mm, and free the memory it represents. Summary of Luigi's bug report: : He had an oom condition where threads were faulting on task->robust_list : and repeatedly called the oom killer but it would defer killing a thread : because it saw other PF_EXITING threads. This can happen anytime we need : to allocate memory after setting PF_EXITING and before detaching our mm; : if there are other threads in the same state then the oom killer won't do : anything unless one of them happens to be killed from userspace. : : So instead of only deferring for PF_EXITING and !task->robust_list, it's : better to just give them access to memory reserves to prevent a potential : livelock so that any other faults that may be introduced in the future in : the exit path don't cause the same problem (and hopefully we don't allow : too many of those!). Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Minchan Kim <minchan@kernel.org> Tested-by: Luigi Semenzato <semenzato@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9ff4868e3051d9128a24dd330bed32011a11421d;" This allows them to quickly allocate, detach its mm, and free the memory it represents";yes;yes;no 209;MDY6Q29tbWl0MjMyNTI5ODo5ZmY0ODY4ZTMwNTFkOTEyOGEyNGRkMzMwYmVkMzIwMTFhMTE0MjFk;David Rientjes;Linus Torvalds;"mm, oom: allow exiting threads to have access to memory reserves Exiting threads, those with PF_EXITING set, can pagefault and require memory before they can make forward progress. This happens, for instance, when a process must fault task->robust_list, a userspace structure, before detaching its memory. These threads also aren't guaranteed to get access to memory reserves unless oom killed or killed from userspace. The oom killer won't grant memory reserves if other threads are also exiting other than current and stalling at the same point. This prevents needlessly killing processes when others are already exiting. Instead of special casing all the possible situations between PF_EXITING getting set and a thread detaching its mm where it may allocate memory, which probably wouldn't get updated when a change is made to the exit path, the solution is to give all exiting threads access to memory reserves if they call the oom killer. This allows them to quickly allocate, detach its mm, and free the memory it represents. Summary of Luigi's bug report: : He had an oom condition where threads were faulting on task->robust_list : and repeatedly called the oom killer but it would defer killing a thread : because it saw other PF_EXITING threads. This can happen anytime we need : to allocate memory after setting PF_EXITING and before detaching our mm; : if there are other threads in the same state then the oom killer won't do : anything unless one of them happens to be killed from userspace. : : So instead of only deferring for PF_EXITING and !task->robust_list, it's : better to just give them access to memory reserves to prevent a potential : livelock so that any other faults that may be introduced in the future in : the exit path don't cause the same problem (and hopefully we don't allow : too many of those!). Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Minchan Kim <minchan@kernel.org> Tested-by: Luigi Semenzato <semenzato@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9ff4868e3051d9128a24dd330bed32011a11421d;Summary of Luigi's bug report;no;no;yes 209;MDY6Q29tbWl0MjMyNTI5ODo5ZmY0ODY4ZTMwNTFkOTEyOGEyNGRkMzMwYmVkMzIwMTFhMTE0MjFk;David Rientjes;Linus Torvalds;"mm, oom: allow exiting threads to have access to memory reserves Exiting threads, those with PF_EXITING set, can pagefault and require memory before they can make forward progress. This happens, for instance, when a process must fault task->robust_list, a userspace structure, before detaching its memory. These threads also aren't guaranteed to get access to memory reserves unless oom killed or killed from userspace. The oom killer won't grant memory reserves if other threads are also exiting other than current and stalling at the same point. This prevents needlessly killing processes when others are already exiting. Instead of special casing all the possible situations between PF_EXITING getting set and a thread detaching its mm where it may allocate memory, which probably wouldn't get updated when a change is made to the exit path, the solution is to give all exiting threads access to memory reserves if they call the oom killer. This allows them to quickly allocate, detach its mm, and free the memory it represents. Summary of Luigi's bug report: : He had an oom condition where threads were faulting on task->robust_list : and repeatedly called the oom killer but it would defer killing a thread : because it saw other PF_EXITING threads. This can happen anytime we need : to allocate memory after setting PF_EXITING and before detaching our mm; : if there are other threads in the same state then the oom killer won't do : anything unless one of them happens to be killed from userspace. : : So instead of only deferring for PF_EXITING and !task->robust_list, it's : better to just give them access to memory reserves to prevent a potential : livelock so that any other faults that may be introduced in the future in : the exit path don't cause the same problem (and hopefully we don't allow : too many of those!). Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Minchan Kim <minchan@kernel.org> Tested-by: Luigi Semenzato <semenzato@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9ff4868e3051d9128a24dd330bed32011a11421d;": He had an oom condition where threads were faulting on task->robust_list : and repeatedly called the oom killer but it would defer killing a thread : because it saw other PF_EXITING threads";no;no;yes 209;MDY6Q29tbWl0MjMyNTI5ODo5ZmY0ODY4ZTMwNTFkOTEyOGEyNGRkMzMwYmVkMzIwMTFhMTE0MjFk;David Rientjes;Linus Torvalds;"mm, oom: allow exiting threads to have access to memory reserves Exiting threads, those with PF_EXITING set, can pagefault and require memory before they can make forward progress. This happens, for instance, when a process must fault task->robust_list, a userspace structure, before detaching its memory. These threads also aren't guaranteed to get access to memory reserves unless oom killed or killed from userspace. The oom killer won't grant memory reserves if other threads are also exiting other than current and stalling at the same point. This prevents needlessly killing processes when others are already exiting. Instead of special casing all the possible situations between PF_EXITING getting set and a thread detaching its mm where it may allocate memory, which probably wouldn't get updated when a change is made to the exit path, the solution is to give all exiting threads access to memory reserves if they call the oom killer. This allows them to quickly allocate, detach its mm, and free the memory it represents. Summary of Luigi's bug report: : He had an oom condition where threads were faulting on task->robust_list : and repeatedly called the oom killer but it would defer killing a thread : because it saw other PF_EXITING threads. This can happen anytime we need : to allocate memory after setting PF_EXITING and before detaching our mm; : if there are other threads in the same state then the oom killer won't do : anything unless one of them happens to be killed from userspace. : : So instead of only deferring for PF_EXITING and !task->robust_list, it's : better to just give them access to memory reserves to prevent a potential : livelock so that any other faults that may be introduced in the future in : the exit path don't cause the same problem (and hopefully we don't allow : too many of those!). Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Minchan Kim <minchan@kernel.org> Tested-by: Luigi Semenzato <semenzato@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9ff4868e3051d9128a24dd330bed32011a11421d;" This can happen anytime we need : if there are other threads in the same state then the oom killer won't do : anything unless one of them happens to be killed from userspace";no;no;yes 209;MDY6Q29tbWl0MjMyNTI5ODo5ZmY0ODY4ZTMwNTFkOTEyOGEyNGRkMzMwYmVkMzIwMTFhMTE0MjFk;David Rientjes;Linus Torvalds;"mm, oom: allow exiting threads to have access to memory reserves Exiting threads, those with PF_EXITING set, can pagefault and require memory before they can make forward progress. This happens, for instance, when a process must fault task->robust_list, a userspace structure, before detaching its memory. These threads also aren't guaranteed to get access to memory reserves unless oom killed or killed from userspace. The oom killer won't grant memory reserves if other threads are also exiting other than current and stalling at the same point. This prevents needlessly killing processes when others are already exiting. Instead of special casing all the possible situations between PF_EXITING getting set and a thread detaching its mm where it may allocate memory, which probably wouldn't get updated when a change is made to the exit path, the solution is to give all exiting threads access to memory reserves if they call the oom killer. This allows them to quickly allocate, detach its mm, and free the memory it represents. Summary of Luigi's bug report: : He had an oom condition where threads were faulting on task->robust_list : and repeatedly called the oom killer but it would defer killing a thread : because it saw other PF_EXITING threads. This can happen anytime we need : to allocate memory after setting PF_EXITING and before detaching our mm; : if there are other threads in the same state then the oom killer won't do : anything unless one of them happens to be killed from userspace. : : So instead of only deferring for PF_EXITING and !task->robust_list, it's : better to just give them access to memory reserves to prevent a potential : livelock so that any other faults that may be introduced in the future in : the exit path don't cause the same problem (and hopefully we don't allow : too many of those!). Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Minchan Kim <minchan@kernel.org> Tested-by: Luigi Semenzato <semenzato@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9ff4868e3051d9128a24dd330bed32011a11421d;": So instead of only deferring for PF_EXITING and !task->robust_list, it's : better to just give them access to memory reserves to prevent a potential : livelock so that any other faults that may be introduced in the future in : the exit path don't cause the same problem (and hopefully we don't allow : too many of those!).";yes;yes;yes 210;MDY6Q29tbWl0MjMyNTI5ODowMWRjNTJlYmRmNDcyZjc3Y2NhNjIzY2E2OTNjYTI0Y2ZjMGYxYmJl;Davidlohr Bueso;Linus Torvalds;"oom: remove deprecated oom_adj The deprecated /proc/<pid>/oom_adj is scheduled for removal this month. Signed-off-by: Davidlohr Bueso <dave@gnu.org> Acked-by: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/01dc52ebdf472f77cca623ca693ca24cfc0f1bbe;oom: remove deprecated oom_adj;yes;yes;no 210;MDY6Q29tbWl0MjMyNTI5ODowMWRjNTJlYmRmNDcyZjc3Y2NhNjIzY2E2OTNjYTI0Y2ZjMGYxYmJl;Davidlohr Bueso;Linus Torvalds;"oom: remove deprecated oom_adj The deprecated /proc/<pid>/oom_adj is scheduled for removal this month. Signed-off-by: Davidlohr Bueso <dave@gnu.org> Acked-by: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/01dc52ebdf472f77cca623ca693ca24cfc0f1bbe;The deprecated /proc/<pid>/oom_adj is scheduled for removal this month.;yes;yes;yes 211;MDY6Q29tbWl0MjMyNTI5ODo4NzZhYWZiZmQ5YmE1YmIzNTJmMWIxNDYyMmMyN2YzZmU5YTk5MDEz;David Rientjes;Linus Torvalds;"mm, memcg: move all oom handling to memcontrol.c By globally defining check_panic_on_oom(), the memcg oom handler can be moved entirely to mm/memcontrol.c. This removes the ugly #ifdef in the oom killer and cleans up the code. Signed-off-by: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Oleg Nesterov <oleg@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/876aafbfd9ba5bb352f1b14622c27f3fe9a99013;mm, memcg: move all oom handling to memcontrol.c;yes;no;no 211;MDY6Q29tbWl0MjMyNTI5ODo4NzZhYWZiZmQ5YmE1YmIzNTJmMWIxNDYyMmMyN2YzZmU5YTk5MDEz;David Rientjes;Linus Torvalds;"mm, memcg: move all oom handling to memcontrol.c By globally defining check_panic_on_oom(), the memcg oom handler can be moved entirely to mm/memcontrol.c. This removes the ugly #ifdef in the oom killer and cleans up the code. Signed-off-by: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Oleg Nesterov <oleg@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/876aafbfd9ba5bb352f1b14622c27f3fe9a99013;"By globally defining check_panic_on_oom(), the memcg oom handler can be moved entirely to mm/memcontrol.c";yes;yes;no 211;MDY6Q29tbWl0MjMyNTI5ODo4NzZhYWZiZmQ5YmE1YmIzNTJmMWIxNDYyMmMyN2YzZmU5YTk5MDEz;David Rientjes;Linus Torvalds;"mm, memcg: move all oom handling to memcontrol.c By globally defining check_panic_on_oom(), the memcg oom handler can be moved entirely to mm/memcontrol.c. This removes the ugly #ifdef in the oom killer and cleans up the code. Signed-off-by: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Oleg Nesterov <oleg@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/876aafbfd9ba5bb352f1b14622c27f3fe9a99013;" This removes the ugly #ifdef in the oom killer and cleans up the code.";yes;yes;no 212;MDY6Q29tbWl0MjMyNTI5ODo2YjBjODFiM2JlMTE0YTkzZjc5YmQ0YzU2MzlhZGU1MTA3ZDc3YzIx;David Rientjes;Linus Torvalds;"mm, oom: reduce dependency on tasklist_lock Since exiting tasks require write_lock_irq(&tasklist_lock) several times, try to reduce the amount of time the readside is held for oom kills. This makes the interface with the memcg oom handler more consistent since it now never needs to take tasklist_lock unnecessarily. The only time the oom killer now takes tasklist_lock is when iterating the children of the selected task, everything else is protected by rcu_read_lock(). This requires that a reference to the selected process, p, is grabbed before calling oom_kill_process(). It may release it and grab a reference on another one of p's threads if !p->mm, but it also guarantees that it will release the reference before returning. [hughd@google.com: fix duplicate put_task_struct()] Signed-off-by: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Oleg Nesterov <oleg@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/6b0c81b3be114a93f79bd4c5639ade5107d77c21;mm, oom: reduce dependency on tasklist_lock;yes;yes;no 212;MDY6Q29tbWl0MjMyNTI5ODo2YjBjODFiM2JlMTE0YTkzZjc5YmQ0YzU2MzlhZGU1MTA3ZDc3YzIx;David Rientjes;Linus Torvalds;"mm, oom: reduce dependency on tasklist_lock Since exiting tasks require write_lock_irq(&tasklist_lock) several times, try to reduce the amount of time the readside is held for oom kills. This makes the interface with the memcg oom handler more consistent since it now never needs to take tasklist_lock unnecessarily. The only time the oom killer now takes tasklist_lock is when iterating the children of the selected task, everything else is protected by rcu_read_lock(). This requires that a reference to the selected process, p, is grabbed before calling oom_kill_process(). It may release it and grab a reference on another one of p's threads if !p->mm, but it also guarantees that it will release the reference before returning. [hughd@google.com: fix duplicate put_task_struct()] Signed-off-by: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Oleg Nesterov <oleg@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/6b0c81b3be114a93f79bd4c5639ade5107d77c21;"Since exiting tasks require write_lock_irq(&tasklist_lock) several times, try to reduce the amount of time the readside is held for oom kills";yes;yes;yes 212;MDY6Q29tbWl0MjMyNTI5ODo2YjBjODFiM2JlMTE0YTkzZjc5YmQ0YzU2MzlhZGU1MTA3ZDc3YzIx;David Rientjes;Linus Torvalds;"mm, oom: reduce dependency on tasklist_lock Since exiting tasks require write_lock_irq(&tasklist_lock) several times, try to reduce the amount of time the readside is held for oom kills. This makes the interface with the memcg oom handler more consistent since it now never needs to take tasklist_lock unnecessarily. The only time the oom killer now takes tasklist_lock is when iterating the children of the selected task, everything else is protected by rcu_read_lock(). This requires that a reference to the selected process, p, is grabbed before calling oom_kill_process(). It may release it and grab a reference on another one of p's threads if !p->mm, but it also guarantees that it will release the reference before returning. [hughd@google.com: fix duplicate put_task_struct()] Signed-off-by: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Oleg Nesterov <oleg@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/6b0c81b3be114a93f79bd4c5639ade5107d77c21;" This makes the interface with the memcg oom handler more consistent since it now never needs to take tasklist_lock unnecessarily";no;yes;yes 212;MDY6Q29tbWl0MjMyNTI5ODo2YjBjODFiM2JlMTE0YTkzZjc5YmQ0YzU2MzlhZGU1MTA3ZDc3YzIx;David Rientjes;Linus Torvalds;"mm, oom: reduce dependency on tasklist_lock Since exiting tasks require write_lock_irq(&tasklist_lock) several times, try to reduce the amount of time the readside is held for oom kills. This makes the interface with the memcg oom handler more consistent since it now never needs to take tasklist_lock unnecessarily. The only time the oom killer now takes tasklist_lock is when iterating the children of the selected task, everything else is protected by rcu_read_lock(). This requires that a reference to the selected process, p, is grabbed before calling oom_kill_process(). It may release it and grab a reference on another one of p's threads if !p->mm, but it also guarantees that it will release the reference before returning. [hughd@google.com: fix duplicate put_task_struct()] Signed-off-by: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Oleg Nesterov <oleg@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/6b0c81b3be114a93f79bd4c5639ade5107d77c21;"The only time the oom killer now takes tasklist_lock is when iterating the children of the selected task, everything else is protected by rcu_read_lock()";yes;no;yes 212;MDY6Q29tbWl0MjMyNTI5ODo2YjBjODFiM2JlMTE0YTkzZjc5YmQ0YzU2MzlhZGU1MTA3ZDc3YzIx;David Rientjes;Linus Torvalds;"mm, oom: reduce dependency on tasklist_lock Since exiting tasks require write_lock_irq(&tasklist_lock) several times, try to reduce the amount of time the readside is held for oom kills. This makes the interface with the memcg oom handler more consistent since it now never needs to take tasklist_lock unnecessarily. The only time the oom killer now takes tasklist_lock is when iterating the children of the selected task, everything else is protected by rcu_read_lock(). This requires that a reference to the selected process, p, is grabbed before calling oom_kill_process(). It may release it and grab a reference on another one of p's threads if !p->mm, but it also guarantees that it will release the reference before returning. [hughd@google.com: fix duplicate put_task_struct()] Signed-off-by: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Oleg Nesterov <oleg@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/6b0c81b3be114a93f79bd4c5639ade5107d77c21;"This requires that a reference to the selected process, p, is grabbed before calling oom_kill_process()";yes;no;yes 212;MDY6Q29tbWl0MjMyNTI5ODo2YjBjODFiM2JlMTE0YTkzZjc5YmQ0YzU2MzlhZGU1MTA3ZDc3YzIx;David Rientjes;Linus Torvalds;"mm, oom: reduce dependency on tasklist_lock Since exiting tasks require write_lock_irq(&tasklist_lock) several times, try to reduce the amount of time the readside is held for oom kills. This makes the interface with the memcg oom handler more consistent since it now never needs to take tasklist_lock unnecessarily. The only time the oom killer now takes tasklist_lock is when iterating the children of the selected task, everything else is protected by rcu_read_lock(). This requires that a reference to the selected process, p, is grabbed before calling oom_kill_process(). It may release it and grab a reference on another one of p's threads if !p->mm, but it also guarantees that it will release the reference before returning. [hughd@google.com: fix duplicate put_task_struct()] Signed-off-by: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Oleg Nesterov <oleg@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/6b0c81b3be114a93f79bd4c5639ade5107d77c21;" It may release it and grab a reference on another one of p's threads if !p->mm, but it also guarantees that it will release the reference before returning.";yes;no;yes 213;MDY6Q29tbWl0MjMyNTI5ODo5Y2JiNzhiYjMxNDM2MGE4NjBhOGIyMzcyMzk3MWNiNmZjYjU0MTc2;David Rientjes;Linus Torvalds;"mm, memcg: introduce own oom handler to iterate only over its own threads The global oom killer is serialized by the per-zonelist try_set_zonelist_oom() which is used in the page allocator. Concurrent oom kills are thus a rare event and only occur in systems using mempolicies and with a large number of nodes. Memory controller oom kills, however, can frequently be concurrent since there is no serialization once the oom killer is called for oom conditions in several different memcgs in parallel. This creates a massive contention on tasklist_lock since the oom killer requires the readside for the tasklist iteration. If several memcgs are calling the oom killer, this lock can be held for a substantial amount of time, especially if threads continue to enter it as other threads are exiting. Since the exit path grabs the writeside of the lock with irqs disabled in a few different places, this can cause a soft lockup on cpus as a result of tasklist_lock starvation. The kernel lacks unfair writelocks, and successful calls to the oom killer usually result in at least one thread entering the exit path, so an alternative solution is needed. This patch introduces a seperate oom handler for memcgs so that they do not require tasklist_lock for as much time. Instead, it iterates only over the threads attached to the oom memcg and grabs a reference to the selected thread before calling oom_kill_process() to ensure it doesn't prematurely exit. This still requires tasklist_lock for the tasklist dump, iterating children of the selected process, and killing all other threads on the system sharing the same memory as the selected victim. So while this isn't a complete solution to tasklist_lock starvation, it significantly reduces the amount of time that it is held. Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Sha Zhengju <handai.szj@taobao.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9cbb78bb314360a860a8b23723971cb6fcb54176;mm, memcg: introduce own oom handler to iterate only over its own threads;yes;no;no 213;MDY6Q29tbWl0MjMyNTI5ODo5Y2JiNzhiYjMxNDM2MGE4NjBhOGIyMzcyMzk3MWNiNmZjYjU0MTc2;David Rientjes;Linus Torvalds;"mm, memcg: introduce own oom handler to iterate only over its own threads The global oom killer is serialized by the per-zonelist try_set_zonelist_oom() which is used in the page allocator. Concurrent oom kills are thus a rare event and only occur in systems using mempolicies and with a large number of nodes. Memory controller oom kills, however, can frequently be concurrent since there is no serialization once the oom killer is called for oom conditions in several different memcgs in parallel. This creates a massive contention on tasklist_lock since the oom killer requires the readside for the tasklist iteration. If several memcgs are calling the oom killer, this lock can be held for a substantial amount of time, especially if threads continue to enter it as other threads are exiting. Since the exit path grabs the writeside of the lock with irqs disabled in a few different places, this can cause a soft lockup on cpus as a result of tasklist_lock starvation. The kernel lacks unfair writelocks, and successful calls to the oom killer usually result in at least one thread entering the exit path, so an alternative solution is needed. This patch introduces a seperate oom handler for memcgs so that they do not require tasklist_lock for as much time. Instead, it iterates only over the threads attached to the oom memcg and grabs a reference to the selected thread before calling oom_kill_process() to ensure it doesn't prematurely exit. This still requires tasklist_lock for the tasklist dump, iterating children of the selected process, and killing all other threads on the system sharing the same memory as the selected victim. So while this isn't a complete solution to tasklist_lock starvation, it significantly reduces the amount of time that it is held. Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Sha Zhengju <handai.szj@taobao.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9cbb78bb314360a860a8b23723971cb6fcb54176;"The global oom killer is serialized by the per-zonelist try_set_zonelist_oom() which is used in the page allocator";no;no;yes 213;MDY6Q29tbWl0MjMyNTI5ODo5Y2JiNzhiYjMxNDM2MGE4NjBhOGIyMzcyMzk3MWNiNmZjYjU0MTc2;David Rientjes;Linus Torvalds;"mm, memcg: introduce own oom handler to iterate only over its own threads The global oom killer is serialized by the per-zonelist try_set_zonelist_oom() which is used in the page allocator. Concurrent oom kills are thus a rare event and only occur in systems using mempolicies and with a large number of nodes. Memory controller oom kills, however, can frequently be concurrent since there is no serialization once the oom killer is called for oom conditions in several different memcgs in parallel. This creates a massive contention on tasklist_lock since the oom killer requires the readside for the tasklist iteration. If several memcgs are calling the oom killer, this lock can be held for a substantial amount of time, especially if threads continue to enter it as other threads are exiting. Since the exit path grabs the writeside of the lock with irqs disabled in a few different places, this can cause a soft lockup on cpus as a result of tasklist_lock starvation. The kernel lacks unfair writelocks, and successful calls to the oom killer usually result in at least one thread entering the exit path, so an alternative solution is needed. This patch introduces a seperate oom handler for memcgs so that they do not require tasklist_lock for as much time. Instead, it iterates only over the threads attached to the oom memcg and grabs a reference to the selected thread before calling oom_kill_process() to ensure it doesn't prematurely exit. This still requires tasklist_lock for the tasklist dump, iterating children of the selected process, and killing all other threads on the system sharing the same memory as the selected victim. So while this isn't a complete solution to tasklist_lock starvation, it significantly reduces the amount of time that it is held. Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Sha Zhengju <handai.szj@taobao.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9cbb78bb314360a860a8b23723971cb6fcb54176;" Concurrent oom kills are thus a rare event and only occur in systems using mempolicies and with a large number of nodes";no;yes;yes 213;MDY6Q29tbWl0MjMyNTI5ODo5Y2JiNzhiYjMxNDM2MGE4NjBhOGIyMzcyMzk3MWNiNmZjYjU0MTc2;David Rientjes;Linus Torvalds;"mm, memcg: introduce own oom handler to iterate only over its own threads The global oom killer is serialized by the per-zonelist try_set_zonelist_oom() which is used in the page allocator. Concurrent oom kills are thus a rare event and only occur in systems using mempolicies and with a large number of nodes. Memory controller oom kills, however, can frequently be concurrent since there is no serialization once the oom killer is called for oom conditions in several different memcgs in parallel. This creates a massive contention on tasklist_lock since the oom killer requires the readside for the tasklist iteration. If several memcgs are calling the oom killer, this lock can be held for a substantial amount of time, especially if threads continue to enter it as other threads are exiting. Since the exit path grabs the writeside of the lock with irqs disabled in a few different places, this can cause a soft lockup on cpus as a result of tasklist_lock starvation. The kernel lacks unfair writelocks, and successful calls to the oom killer usually result in at least one thread entering the exit path, so an alternative solution is needed. This patch introduces a seperate oom handler for memcgs so that they do not require tasklist_lock for as much time. Instead, it iterates only over the threads attached to the oom memcg and grabs a reference to the selected thread before calling oom_kill_process() to ensure it doesn't prematurely exit. This still requires tasklist_lock for the tasklist dump, iterating children of the selected process, and killing all other threads on the system sharing the same memory as the selected victim. So while this isn't a complete solution to tasklist_lock starvation, it significantly reduces the amount of time that it is held. Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Sha Zhengju <handai.szj@taobao.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9cbb78bb314360a860a8b23723971cb6fcb54176;"Memory controller oom kills, however, can frequently be concurrent since there is no serialization once the oom killer is called for oom conditions in several different memcgs in parallel";no;no;yes 213;MDY6Q29tbWl0MjMyNTI5ODo5Y2JiNzhiYjMxNDM2MGE4NjBhOGIyMzcyMzk3MWNiNmZjYjU0MTc2;David Rientjes;Linus Torvalds;"mm, memcg: introduce own oom handler to iterate only over its own threads The global oom killer is serialized by the per-zonelist try_set_zonelist_oom() which is used in the page allocator. Concurrent oom kills are thus a rare event and only occur in systems using mempolicies and with a large number of nodes. Memory controller oom kills, however, can frequently be concurrent since there is no serialization once the oom killer is called for oom conditions in several different memcgs in parallel. This creates a massive contention on tasklist_lock since the oom killer requires the readside for the tasklist iteration. If several memcgs are calling the oom killer, this lock can be held for a substantial amount of time, especially if threads continue to enter it as other threads are exiting. Since the exit path grabs the writeside of the lock with irqs disabled in a few different places, this can cause a soft lockup on cpus as a result of tasklist_lock starvation. The kernel lacks unfair writelocks, and successful calls to the oom killer usually result in at least one thread entering the exit path, so an alternative solution is needed. This patch introduces a seperate oom handler for memcgs so that they do not require tasklist_lock for as much time. Instead, it iterates only over the threads attached to the oom memcg and grabs a reference to the selected thread before calling oom_kill_process() to ensure it doesn't prematurely exit. This still requires tasklist_lock for the tasklist dump, iterating children of the selected process, and killing all other threads on the system sharing the same memory as the selected victim. So while this isn't a complete solution to tasklist_lock starvation, it significantly reduces the amount of time that it is held. Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Sha Zhengju <handai.szj@taobao.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9cbb78bb314360a860a8b23723971cb6fcb54176;"This creates a massive contention on tasklist_lock since the oom killer requires the readside for the tasklist iteration";no;yes;yes 213;MDY6Q29tbWl0MjMyNTI5ODo5Y2JiNzhiYjMxNDM2MGE4NjBhOGIyMzcyMzk3MWNiNmZjYjU0MTc2;David Rientjes;Linus Torvalds;"mm, memcg: introduce own oom handler to iterate only over its own threads The global oom killer is serialized by the per-zonelist try_set_zonelist_oom() which is used in the page allocator. Concurrent oom kills are thus a rare event and only occur in systems using mempolicies and with a large number of nodes. Memory controller oom kills, however, can frequently be concurrent since there is no serialization once the oom killer is called for oom conditions in several different memcgs in parallel. This creates a massive contention on tasklist_lock since the oom killer requires the readside for the tasklist iteration. If several memcgs are calling the oom killer, this lock can be held for a substantial amount of time, especially if threads continue to enter it as other threads are exiting. Since the exit path grabs the writeside of the lock with irqs disabled in a few different places, this can cause a soft lockup on cpus as a result of tasklist_lock starvation. The kernel lacks unfair writelocks, and successful calls to the oom killer usually result in at least one thread entering the exit path, so an alternative solution is needed. This patch introduces a seperate oom handler for memcgs so that they do not require tasklist_lock for as much time. Instead, it iterates only over the threads attached to the oom memcg and grabs a reference to the selected thread before calling oom_kill_process() to ensure it doesn't prematurely exit. This still requires tasklist_lock for the tasklist dump, iterating children of the selected process, and killing all other threads on the system sharing the same memory as the selected victim. So while this isn't a complete solution to tasklist_lock starvation, it significantly reduces the amount of time that it is held. Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Sha Zhengju <handai.szj@taobao.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9cbb78bb314360a860a8b23723971cb6fcb54176;" If several memcgs are calling the oom killer, this lock can be held for a substantial amount of time, especially if threads continue to enter it as other threads are exiting";no;yes;yes 213;MDY6Q29tbWl0MjMyNTI5ODo5Y2JiNzhiYjMxNDM2MGE4NjBhOGIyMzcyMzk3MWNiNmZjYjU0MTc2;David Rientjes;Linus Torvalds;"mm, memcg: introduce own oom handler to iterate only over its own threads The global oom killer is serialized by the per-zonelist try_set_zonelist_oom() which is used in the page allocator. Concurrent oom kills are thus a rare event and only occur in systems using mempolicies and with a large number of nodes. Memory controller oom kills, however, can frequently be concurrent since there is no serialization once the oom killer is called for oom conditions in several different memcgs in parallel. This creates a massive contention on tasklist_lock since the oom killer requires the readside for the tasklist iteration. If several memcgs are calling the oom killer, this lock can be held for a substantial amount of time, especially if threads continue to enter it as other threads are exiting. Since the exit path grabs the writeside of the lock with irqs disabled in a few different places, this can cause a soft lockup on cpus as a result of tasklist_lock starvation. The kernel lacks unfair writelocks, and successful calls to the oom killer usually result in at least one thread entering the exit path, so an alternative solution is needed. This patch introduces a seperate oom handler for memcgs so that they do not require tasklist_lock for as much time. Instead, it iterates only over the threads attached to the oom memcg and grabs a reference to the selected thread before calling oom_kill_process() to ensure it doesn't prematurely exit. This still requires tasklist_lock for the tasklist dump, iterating children of the selected process, and killing all other threads on the system sharing the same memory as the selected victim. So while this isn't a complete solution to tasklist_lock starvation, it significantly reduces the amount of time that it is held. Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Sha Zhengju <handai.szj@taobao.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9cbb78bb314360a860a8b23723971cb6fcb54176;"Since the exit path grabs the writeside of the lock with irqs disabled in a few different places, this can cause a soft lockup on cpus as a result of tasklist_lock starvation";no;yes;yes 213;MDY6Q29tbWl0MjMyNTI5ODo5Y2JiNzhiYjMxNDM2MGE4NjBhOGIyMzcyMzk3MWNiNmZjYjU0MTc2;David Rientjes;Linus Torvalds;"mm, memcg: introduce own oom handler to iterate only over its own threads The global oom killer is serialized by the per-zonelist try_set_zonelist_oom() which is used in the page allocator. Concurrent oom kills are thus a rare event and only occur in systems using mempolicies and with a large number of nodes. Memory controller oom kills, however, can frequently be concurrent since there is no serialization once the oom killer is called for oom conditions in several different memcgs in parallel. This creates a massive contention on tasklist_lock since the oom killer requires the readside for the tasklist iteration. If several memcgs are calling the oom killer, this lock can be held for a substantial amount of time, especially if threads continue to enter it as other threads are exiting. Since the exit path grabs the writeside of the lock with irqs disabled in a few different places, this can cause a soft lockup on cpus as a result of tasklist_lock starvation. The kernel lacks unfair writelocks, and successful calls to the oom killer usually result in at least one thread entering the exit path, so an alternative solution is needed. This patch introduces a seperate oom handler for memcgs so that they do not require tasklist_lock for as much time. Instead, it iterates only over the threads attached to the oom memcg and grabs a reference to the selected thread before calling oom_kill_process() to ensure it doesn't prematurely exit. This still requires tasklist_lock for the tasklist dump, iterating children of the selected process, and killing all other threads on the system sharing the same memory as the selected victim. So while this isn't a complete solution to tasklist_lock starvation, it significantly reduces the amount of time that it is held. Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Sha Zhengju <handai.szj@taobao.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9cbb78bb314360a860a8b23723971cb6fcb54176;"The kernel lacks unfair writelocks, and successful calls to the oom killer usually result in at least one thread entering the exit path, so an alternative solution is needed";no;yes;yes 213;MDY6Q29tbWl0MjMyNTI5ODo5Y2JiNzhiYjMxNDM2MGE4NjBhOGIyMzcyMzk3MWNiNmZjYjU0MTc2;David Rientjes;Linus Torvalds;"mm, memcg: introduce own oom handler to iterate only over its own threads The global oom killer is serialized by the per-zonelist try_set_zonelist_oom() which is used in the page allocator. Concurrent oom kills are thus a rare event and only occur in systems using mempolicies and with a large number of nodes. Memory controller oom kills, however, can frequently be concurrent since there is no serialization once the oom killer is called for oom conditions in several different memcgs in parallel. This creates a massive contention on tasklist_lock since the oom killer requires the readside for the tasklist iteration. If several memcgs are calling the oom killer, this lock can be held for a substantial amount of time, especially if threads continue to enter it as other threads are exiting. Since the exit path grabs the writeside of the lock with irqs disabled in a few different places, this can cause a soft lockup on cpus as a result of tasklist_lock starvation. The kernel lacks unfair writelocks, and successful calls to the oom killer usually result in at least one thread entering the exit path, so an alternative solution is needed. This patch introduces a seperate oom handler for memcgs so that they do not require tasklist_lock for as much time. Instead, it iterates only over the threads attached to the oom memcg and grabs a reference to the selected thread before calling oom_kill_process() to ensure it doesn't prematurely exit. This still requires tasklist_lock for the tasklist dump, iterating children of the selected process, and killing all other threads on the system sharing the same memory as the selected victim. So while this isn't a complete solution to tasklist_lock starvation, it significantly reduces the amount of time that it is held. Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Sha Zhengju <handai.szj@taobao.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9cbb78bb314360a860a8b23723971cb6fcb54176;"This patch introduces a seperate oom handler for memcgs so that they do not require tasklist_lock for as much time";yes;yes;no 213;MDY6Q29tbWl0MjMyNTI5ODo5Y2JiNzhiYjMxNDM2MGE4NjBhOGIyMzcyMzk3MWNiNmZjYjU0MTc2;David Rientjes;Linus Torvalds;"mm, memcg: introduce own oom handler to iterate only over its own threads The global oom killer is serialized by the per-zonelist try_set_zonelist_oom() which is used in the page allocator. Concurrent oom kills are thus a rare event and only occur in systems using mempolicies and with a large number of nodes. Memory controller oom kills, however, can frequently be concurrent since there is no serialization once the oom killer is called for oom conditions in several different memcgs in parallel. This creates a massive contention on tasklist_lock since the oom killer requires the readside for the tasklist iteration. If several memcgs are calling the oom killer, this lock can be held for a substantial amount of time, especially if threads continue to enter it as other threads are exiting. Since the exit path grabs the writeside of the lock with irqs disabled in a few different places, this can cause a soft lockup on cpus as a result of tasklist_lock starvation. The kernel lacks unfair writelocks, and successful calls to the oom killer usually result in at least one thread entering the exit path, so an alternative solution is needed. This patch introduces a seperate oom handler for memcgs so that they do not require tasklist_lock for as much time. Instead, it iterates only over the threads attached to the oom memcg and grabs a reference to the selected thread before calling oom_kill_process() to ensure it doesn't prematurely exit. This still requires tasklist_lock for the tasklist dump, iterating children of the selected process, and killing all other threads on the system sharing the same memory as the selected victim. So while this isn't a complete solution to tasklist_lock starvation, it significantly reduces the amount of time that it is held. Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Sha Zhengju <handai.szj@taobao.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9cbb78bb314360a860a8b23723971cb6fcb54176;" Instead, it iterates only over the threads attached to the oom memcg and grabs a reference to the selected thread before calling oom_kill_process() to ensure it doesn't prematurely exit";yes;no;no 213;MDY6Q29tbWl0MjMyNTI5ODo5Y2JiNzhiYjMxNDM2MGE4NjBhOGIyMzcyMzk3MWNiNmZjYjU0MTc2;David Rientjes;Linus Torvalds;"mm, memcg: introduce own oom handler to iterate only over its own threads The global oom killer is serialized by the per-zonelist try_set_zonelist_oom() which is used in the page allocator. Concurrent oom kills are thus a rare event and only occur in systems using mempolicies and with a large number of nodes. Memory controller oom kills, however, can frequently be concurrent since there is no serialization once the oom killer is called for oom conditions in several different memcgs in parallel. This creates a massive contention on tasklist_lock since the oom killer requires the readside for the tasklist iteration. If several memcgs are calling the oom killer, this lock can be held for a substantial amount of time, especially if threads continue to enter it as other threads are exiting. Since the exit path grabs the writeside of the lock with irqs disabled in a few different places, this can cause a soft lockup on cpus as a result of tasklist_lock starvation. The kernel lacks unfair writelocks, and successful calls to the oom killer usually result in at least one thread entering the exit path, so an alternative solution is needed. This patch introduces a seperate oom handler for memcgs so that they do not require tasklist_lock for as much time. Instead, it iterates only over the threads attached to the oom memcg and grabs a reference to the selected thread before calling oom_kill_process() to ensure it doesn't prematurely exit. This still requires tasklist_lock for the tasklist dump, iterating children of the selected process, and killing all other threads on the system sharing the same memory as the selected victim. So while this isn't a complete solution to tasklist_lock starvation, it significantly reduces the amount of time that it is held. Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Sha Zhengju <handai.szj@taobao.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9cbb78bb314360a860a8b23723971cb6fcb54176;"This still requires tasklist_lock for the tasklist dump, iterating children of the selected process, and killing all other threads on the system sharing the same memory as the selected victim";yes;no;yes 213;MDY6Q29tbWl0MjMyNTI5ODo5Y2JiNzhiYjMxNDM2MGE4NjBhOGIyMzcyMzk3MWNiNmZjYjU0MTc2;David Rientjes;Linus Torvalds;"mm, memcg: introduce own oom handler to iterate only over its own threads The global oom killer is serialized by the per-zonelist try_set_zonelist_oom() which is used in the page allocator. Concurrent oom kills are thus a rare event and only occur in systems using mempolicies and with a large number of nodes. Memory controller oom kills, however, can frequently be concurrent since there is no serialization once the oom killer is called for oom conditions in several different memcgs in parallel. This creates a massive contention on tasklist_lock since the oom killer requires the readside for the tasklist iteration. If several memcgs are calling the oom killer, this lock can be held for a substantial amount of time, especially if threads continue to enter it as other threads are exiting. Since the exit path grabs the writeside of the lock with irqs disabled in a few different places, this can cause a soft lockup on cpus as a result of tasklist_lock starvation. The kernel lacks unfair writelocks, and successful calls to the oom killer usually result in at least one thread entering the exit path, so an alternative solution is needed. This patch introduces a seperate oom handler for memcgs so that they do not require tasklist_lock for as much time. Instead, it iterates only over the threads attached to the oom memcg and grabs a reference to the selected thread before calling oom_kill_process() to ensure it doesn't prematurely exit. This still requires tasklist_lock for the tasklist dump, iterating children of the selected process, and killing all other threads on the system sharing the same memory as the selected victim. So while this isn't a complete solution to tasklist_lock starvation, it significantly reduces the amount of time that it is held. Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Sha Zhengju <handai.szj@taobao.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9cbb78bb314360a860a8b23723971cb6fcb54176;" So while this isn't a complete solution to tasklist_lock starvation, it significantly reduces the amount of time that it is held.";yes;yes;no 214;MDY6Q29tbWl0MjMyNTI5ODo0NjI2MDdlY2M1MTliMTk3ZjdiNWNjNmIwMjRhMWMyNmZhNmZjMGFj;David Rientjes;Linus Torvalds;"mm, oom: introduce helper function to process threads during scan This patch introduces a helper function to process each thread during the iteration over the tasklist. A new return type, enum oom_scan_t, is defined to determine the future behavior of the iteration: - OOM_SCAN_OK: continue scanning the thread and find its badness, - OOM_SCAN_CONTINUE: do not consider this thread for oom kill, it's ineligible, - OOM_SCAN_ABORT: abort the iteration and return, or - OOM_SCAN_SELECT: always select this thread with the highest badness possible. There is no functional change with this patch. This new helper function will be used in the next patch in the memory controller. Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Sha Zhengju <handai.szj@taobao.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/462607ecc519b197f7b5cc6b024a1c26fa6fc0ac;mm, oom: introduce helper function to process threads during scan;yes;yes;no 214;MDY6Q29tbWl0MjMyNTI5ODo0NjI2MDdlY2M1MTliMTk3ZjdiNWNjNmIwMjRhMWMyNmZhNmZjMGFj;David Rientjes;Linus Torvalds;"mm, oom: introduce helper function to process threads during scan This patch introduces a helper function to process each thread during the iteration over the tasklist. A new return type, enum oom_scan_t, is defined to determine the future behavior of the iteration: - OOM_SCAN_OK: continue scanning the thread and find its badness, - OOM_SCAN_CONTINUE: do not consider this thread for oom kill, it's ineligible, - OOM_SCAN_ABORT: abort the iteration and return, or - OOM_SCAN_SELECT: always select this thread with the highest badness possible. There is no functional change with this patch. This new helper function will be used in the next patch in the memory controller. Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Sha Zhengju <handai.szj@taobao.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/462607ecc519b197f7b5cc6b024a1c26fa6fc0ac;"This patch introduces a helper function to process each thread during the iteration over the tasklist";yes;yes;no 214;MDY6Q29tbWl0MjMyNTI5ODo0NjI2MDdlY2M1MTliMTk3ZjdiNWNjNmIwMjRhMWMyNmZhNmZjMGFj;David Rientjes;Linus Torvalds;"mm, oom: introduce helper function to process threads during scan This patch introduces a helper function to process each thread during the iteration over the tasklist. A new return type, enum oom_scan_t, is defined to determine the future behavior of the iteration: - OOM_SCAN_OK: continue scanning the thread and find its badness, - OOM_SCAN_CONTINUE: do not consider this thread for oom kill, it's ineligible, - OOM_SCAN_ABORT: abort the iteration and return, or - OOM_SCAN_SELECT: always select this thread with the highest badness possible. There is no functional change with this patch. This new helper function will be used in the next patch in the memory controller. Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Sha Zhengju <handai.szj@taobao.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/462607ecc519b197f7b5cc6b024a1c26fa6fc0ac;" A new return type, enum oom_scan_t, is defined to determine the future behavior of the iteration";yes;yes;no 214;MDY6Q29tbWl0MjMyNTI5ODo0NjI2MDdlY2M1MTliMTk3ZjdiNWNjNmIwMjRhMWMyNmZhNmZjMGFj;David Rientjes;Linus Torvalds;"mm, oom: introduce helper function to process threads during scan This patch introduces a helper function to process each thread during the iteration over the tasklist. A new return type, enum oom_scan_t, is defined to determine the future behavior of the iteration: - OOM_SCAN_OK: continue scanning the thread and find its badness, - OOM_SCAN_CONTINUE: do not consider this thread for oom kill, it's ineligible, - OOM_SCAN_ABORT: abort the iteration and return, or - OOM_SCAN_SELECT: always select this thread with the highest badness possible. There is no functional change with this patch. This new helper function will be used in the next patch in the memory controller. Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Sha Zhengju <handai.szj@taobao.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/462607ecc519b197f7b5cc6b024a1c26fa6fc0ac;" - OOM_SCAN_OK: continue scanning the thread and find its badness, - OOM_SCAN_CONTINUE: do not consider this thread for oom kill, it's ineligible, - OOM_SCAN_ABORT: abort the iteration and return, or - OOM_SCAN_SELECT: always select this thread with the highest badness possible";yes;no;yes 214;MDY6Q29tbWl0MjMyNTI5ODo0NjI2MDdlY2M1MTliMTk3ZjdiNWNjNmIwMjRhMWMyNmZhNmZjMGFj;David Rientjes;Linus Torvalds;"mm, oom: introduce helper function to process threads during scan This patch introduces a helper function to process each thread during the iteration over the tasklist. A new return type, enum oom_scan_t, is defined to determine the future behavior of the iteration: - OOM_SCAN_OK: continue scanning the thread and find its badness, - OOM_SCAN_CONTINUE: do not consider this thread for oom kill, it's ineligible, - OOM_SCAN_ABORT: abort the iteration and return, or - OOM_SCAN_SELECT: always select this thread with the highest badness possible. There is no functional change with this patch. This new helper function will be used in the next patch in the memory controller. Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Sha Zhengju <handai.szj@taobao.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/462607ecc519b197f7b5cc6b024a1c26fa6fc0ac;There is no functional change with this patch;yes;no;no 214;MDY6Q29tbWl0MjMyNTI5ODo0NjI2MDdlY2M1MTliMTk3ZjdiNWNjNmIwMjRhMWMyNmZhNmZjMGFj;David Rientjes;Linus Torvalds;"mm, oom: introduce helper function to process threads during scan This patch introduces a helper function to process each thread during the iteration over the tasklist. A new return type, enum oom_scan_t, is defined to determine the future behavior of the iteration: - OOM_SCAN_OK: continue scanning the thread and find its badness, - OOM_SCAN_CONTINUE: do not consider this thread for oom kill, it's ineligible, - OOM_SCAN_ABORT: abort the iteration and return, or - OOM_SCAN_SELECT: always select this thread with the highest badness possible. There is no functional change with this patch. This new helper function will be used in the next patch in the memory controller. Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Sha Zhengju <handai.szj@taobao.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/462607ecc519b197f7b5cc6b024a1c26fa6fc0ac;" This new helper function will be used in the next patch in the memory controller.";yes;yes;no 215;MDY6Q29tbWl0MjMyNTI5ODpjMjU1YTQ1ODA1NWU0NTlmNjVlYjdiN2Y1MWRjNWRiZGQwY2FmMWQ4;Andrew Morton;Linus Torvalds;"memcg: rename config variables Sanity: CONFIG_CGROUP_MEM_RES_CTLR -> CONFIG_MEMCG CONFIG_CGROUP_MEM_RES_CTLR_SWAP -> CONFIG_MEMCG_SWAP CONFIG_CGROUP_MEM_RES_CTLR_SWAP_ENABLED -> CONFIG_MEMCG_SWAP_ENABLED CONFIG_CGROUP_MEM_RES_CTLR_KMEM -> CONFIG_MEMCG_KMEM [mhocko@suse.cz: fix missed bits] Cc: Glauber Costa <glommer@parallels.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/c255a458055e459f65eb7b7f51dc5dbdd0caf1d8;memcg: rename config variables;yes;no;no 215;MDY6Q29tbWl0MjMyNTI5ODpjMjU1YTQ1ODA1NWU0NTlmNjVlYjdiN2Y1MWRjNWRiZGQwY2FmMWQ4;Andrew Morton;Linus Torvalds;"memcg: rename config variables Sanity: CONFIG_CGROUP_MEM_RES_CTLR -> CONFIG_MEMCG CONFIG_CGROUP_MEM_RES_CTLR_SWAP -> CONFIG_MEMCG_SWAP CONFIG_CGROUP_MEM_RES_CTLR_SWAP_ENABLED -> CONFIG_MEMCG_SWAP_ENABLED CONFIG_CGROUP_MEM_RES_CTLR_KMEM -> CONFIG_MEMCG_KMEM [mhocko@suse.cz: fix missed bits] Cc: Glauber Costa <glommer@parallels.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/c255a458055e459f65eb7b7f51dc5dbdd0caf1d8;Sanity;no;yes;no 217;MDY6Q29tbWl0MjMyNTI5ODoxMjFkMWJhMGEwMTllMTQ2NWE1MzUzM2FlYTEzM2IxYjBmNmI0NDJk;David Rientjes;Linus Torvalds;"mm, oom: fix potential killing of thread that is disabled from oom killing /proc/sys/vm/oom_kill_allocating_task will immediately kill current when the oom killer is called to avoid a potentially expensive tasklist scan for large systems. Currently, however, it is not checking current's oom_score_adj value which may be OOM_SCORE_ADJ_MIN, meaning that it has been disabled from oom killing. This patch avoids killing current in such a condition and simply falls back to the tasklist scan since memory still needs to be freed. Signed-off-by: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/121d1ba0a019e1465a53533aea133b1b0f6b442d;mm, oom: fix potential killing of thread that is disabled from oom killing;yes;yes;no 217;MDY6Q29tbWl0MjMyNTI5ODoxMjFkMWJhMGEwMTllMTQ2NWE1MzUzM2FlYTEzM2IxYjBmNmI0NDJk;David Rientjes;Linus Torvalds;"mm, oom: fix potential killing of thread that is disabled from oom killing /proc/sys/vm/oom_kill_allocating_task will immediately kill current when the oom killer is called to avoid a potentially expensive tasklist scan for large systems. Currently, however, it is not checking current's oom_score_adj value which may be OOM_SCORE_ADJ_MIN, meaning that it has been disabled from oom killing. This patch avoids killing current in such a condition and simply falls back to the tasklist scan since memory still needs to be freed. Signed-off-by: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/121d1ba0a019e1465a53533aea133b1b0f6b442d;"/proc/sys/vm/oom_kill_allocating_task will immediately kill current when the oom killer is called to avoid a potentially expensive tasklist scan for large systems";no;no;yes 217;MDY6Q29tbWl0MjMyNTI5ODoxMjFkMWJhMGEwMTllMTQ2NWE1MzUzM2FlYTEzM2IxYjBmNmI0NDJk;David Rientjes;Linus Torvalds;"mm, oom: fix potential killing of thread that is disabled from oom killing /proc/sys/vm/oom_kill_allocating_task will immediately kill current when the oom killer is called to avoid a potentially expensive tasklist scan for large systems. Currently, however, it is not checking current's oom_score_adj value which may be OOM_SCORE_ADJ_MIN, meaning that it has been disabled from oom killing. This patch avoids killing current in such a condition and simply falls back to the tasklist scan since memory still needs to be freed. Signed-off-by: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/121d1ba0a019e1465a53533aea133b1b0f6b442d;"Currently, however, it is not checking current's oom_score_adj value which may be OOM_SCORE_ADJ_MIN, meaning that it has been disabled from oom killing";no;yes;yes 217;MDY6Q29tbWl0MjMyNTI5ODoxMjFkMWJhMGEwMTllMTQ2NWE1MzUzM2FlYTEzM2IxYjBmNmI0NDJk;David Rientjes;Linus Torvalds;"mm, oom: fix potential killing of thread that is disabled from oom killing /proc/sys/vm/oom_kill_allocating_task will immediately kill current when the oom killer is called to avoid a potentially expensive tasklist scan for large systems. Currently, however, it is not checking current's oom_score_adj value which may be OOM_SCORE_ADJ_MIN, meaning that it has been disabled from oom killing. This patch avoids killing current in such a condition and simply falls back to the tasklist scan since memory still needs to be freed. Signed-off-by: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/121d1ba0a019e1465a53533aea133b1b0f6b442d;"This patch avoids killing current in such a condition and simply falls back to the tasklist scan since memory still needs to be freed.";yes;yes;no 219;MDY6Q29tbWl0MjMyNTI5ODpkYWQ3NTU3ZWI3MDU2ODgwNDBhYWMxMzRlZmE1NDE4YjY2ZDVlZDky;Wanpeng Li;Linus Torvalds;"mm: fix kernel-doc warnings Fix kernel-doc warnings such as Warning(../mm/page_cgroup.c:432): No description found for parameter 'id' Warning(../mm/page_cgroup.c:432): Excess function parameter 'mem' description in 'swap_cgroup_record' Signed-off-by: Wanpeng Li <liwp@linux.vnet.ibm.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/dad7557eb705688040aac134efa5418b66d5ed92;mm: fix kernel-doc warnings;yes;yes;no 219;MDY6Q29tbWl0MjMyNTI5ODpkYWQ3NTU3ZWI3MDU2ODgwNDBhYWMxMzRlZmE1NDE4YjY2ZDVlZDky;Wanpeng Li;Linus Torvalds;"mm: fix kernel-doc warnings Fix kernel-doc warnings such as Warning(../mm/page_cgroup.c:432): No description found for parameter 'id' Warning(../mm/page_cgroup.c:432): Excess function parameter 'mem' description in 'swap_cgroup_record' Signed-off-by: Wanpeng Li <liwp@linux.vnet.ibm.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/dad7557eb705688040aac134efa5418b66d5ed92;"Fix kernel-doc warnings such as Warning(../mm/page_cgroup.c:432): No description found for parameter 'id' Warning(../mm/page_cgroup.c:432): Excess function parameter 'mem' description in 'swap_cgroup_record'";yes;yes;no 221;MDY6Q29tbWl0MjMyNTI5ODoxZTExYWQ4ZGM0Mjk3NWQ1YzJiYWI3ZDQ3OGY2Y2Q4NzU2MDJlZGE0;David Rientjes;Linus Torvalds;"mm, oom: fix badness score underflow If the privileges given to root threads (3% of allowable memory) or a negative value of /proc/pid/oom_score_adj happen to exceed the amount of rss of a thread, its badness score overflows as a result of commit a7f638f999ff (""mm, oom: normalize oom scores to oom_score_adj scale only for userspace""). Fix this by making the type signed and return 1, meaning the thread is still eligible for kill, if the value is negative. Reported-by: Dave Jones <davej@redhat.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/1e11ad8dc42975d5c2bab7d478f6cd875602eda4;mm, oom: fix badness score underflow;yes;yes;no 221;MDY6Q29tbWl0MjMyNTI5ODoxZTExYWQ4ZGM0Mjk3NWQ1YzJiYWI3ZDQ3OGY2Y2Q4NzU2MDJlZGE0;David Rientjes;Linus Torvalds;"mm, oom: fix badness score underflow If the privileges given to root threads (3% of allowable memory) or a negative value of /proc/pid/oom_score_adj happen to exceed the amount of rss of a thread, its badness score overflows as a result of commit a7f638f999ff (""mm, oom: normalize oom scores to oom_score_adj scale only for userspace""). Fix this by making the type signed and return 1, meaning the thread is still eligible for kill, if the value is negative. Reported-by: Dave Jones <davej@redhat.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/1e11ad8dc42975d5c2bab7d478f6cd875602eda4;"If the privileges given to root threads (3% of allowable memory) or a negative value of /proc/pid/oom_score_adj happen to exceed the amount of rss of a thread, its badness score overflows as a result of commit a7f638f999ff (""mm, oom: normalize oom scores to oom_score_adj scale only for userspace"")";no;yes;yes 221;MDY6Q29tbWl0MjMyNTI5ODoxZTExYWQ4ZGM0Mjk3NWQ1YzJiYWI3ZDQ3OGY2Y2Q4NzU2MDJlZGE0;David Rientjes;Linus Torvalds;"mm, oom: fix badness score underflow If the privileges given to root threads (3% of allowable memory) or a negative value of /proc/pid/oom_score_adj happen to exceed the amount of rss of a thread, its badness score overflows as a result of commit a7f638f999ff (""mm, oom: normalize oom scores to oom_score_adj scale only for userspace""). Fix this by making the type signed and return 1, meaning the thread is still eligible for kill, if the value is negative. Reported-by: Dave Jones <davej@redhat.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/1e11ad8dc42975d5c2bab7d478f6cd875602eda4;"Fix this by making the type signed and return 1, meaning the thread is still eligible for kill, if the value is negative.";yes;yes;no 222;MDY6Q29tbWl0MjMyNTI5ODphN2Y2MzhmOTk5ZmY0MjMxMGU5NTgyMjczYjFmZTI1ZWE2ZTQ2OWJh;David Rientjes;Linus Torvalds;"mm, oom: normalize oom scores to oom_score_adj scale only for userspace The oom_score_adj scale ranges from -1000 to 1000 and represents the proportion of memory available to the process at allocation time. This means an oom_score_adj value of 300, for example, will bias a process as though it was using an extra 30.0% of available memory and a value of -350 will discount 35.0% of available memory from its usage. The oom killer badness heuristic also uses this scale to report the oom score for each eligible process in determining the ""best"" process to kill. Thus, it can only differentiate each process's memory usage by 0.1% of system RAM. On large systems, this can end up being a large amount of memory: 256MB on 256GB systems, for example. This can be fixed by having the badness heuristic to use the actual memory usage in scoring threads and then normalizing it to the oom_score_adj scale for userspace. This results in better comparison between eligible threads for kill and no change from the userspace perspective. Suggested-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Tested-by: Dave Jones <davej@redhat.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/a7f638f999ff42310e9582273b1fe25ea6e469ba;mm, oom: normalize oom scores to oom_score_adj scale only for userspace;yes;no;no 222;MDY6Q29tbWl0MjMyNTI5ODphN2Y2MzhmOTk5ZmY0MjMxMGU5NTgyMjczYjFmZTI1ZWE2ZTQ2OWJh;David Rientjes;Linus Torvalds;"mm, oom: normalize oom scores to oom_score_adj scale only for userspace The oom_score_adj scale ranges from -1000 to 1000 and represents the proportion of memory available to the process at allocation time. This means an oom_score_adj value of 300, for example, will bias a process as though it was using an extra 30.0% of available memory and a value of -350 will discount 35.0% of available memory from its usage. The oom killer badness heuristic also uses this scale to report the oom score for each eligible process in determining the ""best"" process to kill. Thus, it can only differentiate each process's memory usage by 0.1% of system RAM. On large systems, this can end up being a large amount of memory: 256MB on 256GB systems, for example. This can be fixed by having the badness heuristic to use the actual memory usage in scoring threads and then normalizing it to the oom_score_adj scale for userspace. This results in better comparison between eligible threads for kill and no change from the userspace perspective. Suggested-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Tested-by: Dave Jones <davej@redhat.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/a7f638f999ff42310e9582273b1fe25ea6e469ba;"The oom_score_adj scale ranges from -1000 to 1000 and represents the proportion of memory available to the process at allocation time";no;no;yes 222;MDY6Q29tbWl0MjMyNTI5ODphN2Y2MzhmOTk5ZmY0MjMxMGU5NTgyMjczYjFmZTI1ZWE2ZTQ2OWJh;David Rientjes;Linus Torvalds;"mm, oom: normalize oom scores to oom_score_adj scale only for userspace The oom_score_adj scale ranges from -1000 to 1000 and represents the proportion of memory available to the process at allocation time. This means an oom_score_adj value of 300, for example, will bias a process as though it was using an extra 30.0% of available memory and a value of -350 will discount 35.0% of available memory from its usage. The oom killer badness heuristic also uses this scale to report the oom score for each eligible process in determining the ""best"" process to kill. Thus, it can only differentiate each process's memory usage by 0.1% of system RAM. On large systems, this can end up being a large amount of memory: 256MB on 256GB systems, for example. This can be fixed by having the badness heuristic to use the actual memory usage in scoring threads and then normalizing it to the oom_score_adj scale for userspace. This results in better comparison between eligible threads for kill and no change from the userspace perspective. Suggested-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Tested-by: Dave Jones <davej@redhat.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/a7f638f999ff42310e9582273b1fe25ea6e469ba;" This means an oom_score_adj value of 300, for example, will bias a process as though it was using an extra 30.0% of available memory and a value of -350 will discount 35.0% of available memory from its usage";no;no;yes 222;MDY6Q29tbWl0MjMyNTI5ODphN2Y2MzhmOTk5ZmY0MjMxMGU5NTgyMjczYjFmZTI1ZWE2ZTQ2OWJh;David Rientjes;Linus Torvalds;"mm, oom: normalize oom scores to oom_score_adj scale only for userspace The oom_score_adj scale ranges from -1000 to 1000 and represents the proportion of memory available to the process at allocation time. This means an oom_score_adj value of 300, for example, will bias a process as though it was using an extra 30.0% of available memory and a value of -350 will discount 35.0% of available memory from its usage. The oom killer badness heuristic also uses this scale to report the oom score for each eligible process in determining the ""best"" process to kill. Thus, it can only differentiate each process's memory usage by 0.1% of system RAM. On large systems, this can end up being a large amount of memory: 256MB on 256GB systems, for example. This can be fixed by having the badness heuristic to use the actual memory usage in scoring threads and then normalizing it to the oom_score_adj scale for userspace. This results in better comparison between eligible threads for kill and no change from the userspace perspective. Suggested-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Tested-by: Dave Jones <davej@redhat.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/a7f638f999ff42310e9582273b1fe25ea6e469ba;"The oom killer badness heuristic also uses this scale to report the oom score for each eligible process in determining the ""best"" process to kill";no;no;yes 222;MDY6Q29tbWl0MjMyNTI5ODphN2Y2MzhmOTk5ZmY0MjMxMGU5NTgyMjczYjFmZTI1ZWE2ZTQ2OWJh;David Rientjes;Linus Torvalds;"mm, oom: normalize oom scores to oom_score_adj scale only for userspace The oom_score_adj scale ranges from -1000 to 1000 and represents the proportion of memory available to the process at allocation time. This means an oom_score_adj value of 300, for example, will bias a process as though it was using an extra 30.0% of available memory and a value of -350 will discount 35.0% of available memory from its usage. The oom killer badness heuristic also uses this scale to report the oom score for each eligible process in determining the ""best"" process to kill. Thus, it can only differentiate each process's memory usage by 0.1% of system RAM. On large systems, this can end up being a large amount of memory: 256MB on 256GB systems, for example. This can be fixed by having the badness heuristic to use the actual memory usage in scoring threads and then normalizing it to the oom_score_adj scale for userspace. This results in better comparison between eligible threads for kill and no change from the userspace perspective. Suggested-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Tested-by: Dave Jones <davej@redhat.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/a7f638f999ff42310e9582273b1fe25ea6e469ba;" Thus, it can only differentiate each process's memory usage by 0.1% of system RAM";no;yes;yes 222;MDY6Q29tbWl0MjMyNTI5ODphN2Y2MzhmOTk5ZmY0MjMxMGU5NTgyMjczYjFmZTI1ZWE2ZTQ2OWJh;David Rientjes;Linus Torvalds;"mm, oom: normalize oom scores to oom_score_adj scale only for userspace The oom_score_adj scale ranges from -1000 to 1000 and represents the proportion of memory available to the process at allocation time. This means an oom_score_adj value of 300, for example, will bias a process as though it was using an extra 30.0% of available memory and a value of -350 will discount 35.0% of available memory from its usage. The oom killer badness heuristic also uses this scale to report the oom score for each eligible process in determining the ""best"" process to kill. Thus, it can only differentiate each process's memory usage by 0.1% of system RAM. On large systems, this can end up being a large amount of memory: 256MB on 256GB systems, for example. This can be fixed by having the badness heuristic to use the actual memory usage in scoring threads and then normalizing it to the oom_score_adj scale for userspace. This results in better comparison between eligible threads for kill and no change from the userspace perspective. Suggested-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Tested-by: Dave Jones <davej@redhat.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/a7f638f999ff42310e9582273b1fe25ea6e469ba;"On large systems, this can end up being a large amount of memory: 256MB on 256GB systems, for example";no;yes;yes 222;MDY6Q29tbWl0MjMyNTI5ODphN2Y2MzhmOTk5ZmY0MjMxMGU5NTgyMjczYjFmZTI1ZWE2ZTQ2OWJh;David Rientjes;Linus Torvalds;"mm, oom: normalize oom scores to oom_score_adj scale only for userspace The oom_score_adj scale ranges from -1000 to 1000 and represents the proportion of memory available to the process at allocation time. This means an oom_score_adj value of 300, for example, will bias a process as though it was using an extra 30.0% of available memory and a value of -350 will discount 35.0% of available memory from its usage. The oom killer badness heuristic also uses this scale to report the oom score for each eligible process in determining the ""best"" process to kill. Thus, it can only differentiate each process's memory usage by 0.1% of system RAM. On large systems, this can end up being a large amount of memory: 256MB on 256GB systems, for example. This can be fixed by having the badness heuristic to use the actual memory usage in scoring threads and then normalizing it to the oom_score_adj scale for userspace. This results in better comparison between eligible threads for kill and no change from the userspace perspective. Suggested-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Tested-by: Dave Jones <davej@redhat.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/a7f638f999ff42310e9582273b1fe25ea6e469ba;"This can be fixed by having the badness heuristic to use the actual memory usage in scoring threads and then normalizing it to the oom_score_adj scale for userspace";yes;no;no 222;MDY6Q29tbWl0MjMyNTI5ODphN2Y2MzhmOTk5ZmY0MjMxMGU5NTgyMjczYjFmZTI1ZWE2ZTQ2OWJh;David Rientjes;Linus Torvalds;"mm, oom: normalize oom scores to oom_score_adj scale only for userspace The oom_score_adj scale ranges from -1000 to 1000 and represents the proportion of memory available to the process at allocation time. This means an oom_score_adj value of 300, for example, will bias a process as though it was using an extra 30.0% of available memory and a value of -350 will discount 35.0% of available memory from its usage. The oom killer badness heuristic also uses this scale to report the oom score for each eligible process in determining the ""best"" process to kill. Thus, it can only differentiate each process's memory usage by 0.1% of system RAM. On large systems, this can end up being a large amount of memory: 256MB on 256GB systems, for example. This can be fixed by having the badness heuristic to use the actual memory usage in scoring threads and then normalizing it to the oom_score_adj scale for userspace. This results in better comparison between eligible threads for kill and no change from the userspace perspective. Suggested-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Tested-by: Dave Jones <davej@redhat.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/a7f638f999ff42310e9582273b1fe25ea6e469ba;" This results in better comparison between eligible threads for kill and no change from the userspace perspective.";yes;yes;no 223;MDY6Q29tbWl0MjMyNTI5ODowNzhkZTVmNzA2ZWNlMzZhZmQ3M2JiNGI4MjgzMzE0MTMyZDJkZmRm;Eric W. Biederman;Eric W. Biederman;"userns: Store uid and gid values in struct cred with kuid_t and kgid_t types cred.h and a few trivial users of struct cred are changed. The rest of the users of struct cred are left for other patches as there are too many changes to make in one go and leave the change reviewable. If the user namespace is disabled and CONFIG_UIDGID_STRICT_TYPE_CHECKS are disabled the code will contiue to compile and behave correctly. Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>";https://api.github.com/repos/torvalds/linux/git/commits/078de5f706ece36afd73bb4b8283314132d2dfdf;userns: Store uid and gid values in struct cred with kuid_t and kgid_t types;yes;no;no 223;MDY6Q29tbWl0MjMyNTI5ODowNzhkZTVmNzA2ZWNlMzZhZmQ3M2JiNGI4MjgzMzE0MTMyZDJkZmRm;Eric W. Biederman;Eric W. Biederman;"userns: Store uid and gid values in struct cred with kuid_t and kgid_t types cred.h and a few trivial users of struct cred are changed. The rest of the users of struct cred are left for other patches as there are too many changes to make in one go and leave the change reviewable. If the user namespace is disabled and CONFIG_UIDGID_STRICT_TYPE_CHECKS are disabled the code will contiue to compile and behave correctly. Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>";https://api.github.com/repos/torvalds/linux/git/commits/078de5f706ece36afd73bb4b8283314132d2dfdf;cred.h and a few trivial users of struct cred are changed;yes;no;no 223;MDY6Q29tbWl0MjMyNTI5ODowNzhkZTVmNzA2ZWNlMzZhZmQ3M2JiNGI4MjgzMzE0MTMyZDJkZmRm;Eric W. Biederman;Eric W. Biederman;"userns: Store uid and gid values in struct cred with kuid_t and kgid_t types cred.h and a few trivial users of struct cred are changed. The rest of the users of struct cred are left for other patches as there are too many changes to make in one go and leave the change reviewable. If the user namespace is disabled and CONFIG_UIDGID_STRICT_TYPE_CHECKS are disabled the code will contiue to compile and behave correctly. Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>";https://api.github.com/repos/torvalds/linux/git/commits/078de5f706ece36afd73bb4b8283314132d2dfdf;" The rest of the users of struct cred are left for other patches as there are too many changes to make in one go and leave the change reviewable";yes;yes;no 223;MDY6Q29tbWl0MjMyNTI5ODowNzhkZTVmNzA2ZWNlMzZhZmQ3M2JiNGI4MjgzMzE0MTMyZDJkZmRm;Eric W. Biederman;Eric W. Biederman;"userns: Store uid and gid values in struct cred with kuid_t and kgid_t types cred.h and a few trivial users of struct cred are changed. The rest of the users of struct cred are left for other patches as there are too many changes to make in one go and leave the change reviewable. If the user namespace is disabled and CONFIG_UIDGID_STRICT_TYPE_CHECKS are disabled the code will contiue to compile and behave correctly. Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>";https://api.github.com/repos/torvalds/linux/git/commits/078de5f706ece36afd73bb4b8283314132d2dfdf;" If the user namespace is disabled and CONFIG_UIDGID_STRICT_TYPE_CHECKS are disabled the code will contiue to compile and behave correctly.";yes;no;yes 224;MDY6Q29tbWl0MjMyNTI5ODpkMmQzOTMwOTlkZTIxZWRhOTFjNWVjNmEwNWQ2MGU1ZGVlNGQ1MTc1;Oleg Nesterov;Linus Torvalds;"signal: oom_kill_task: use SEND_SIG_FORCED instead of force_sig() Change oom_kill_task() to use do_send_sig_info(SEND_SIG_FORCED) instead of force_sig(SIGKILL). With the recent changes we do not need force_ to kill the CLONE_NEWPID tasks. And this is more correct. force_sig() can race with the exiting thread even if oom_kill_task() checks p->mm != NULL, while do_send_sig_info(group => true) kille the whole process. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: Anton Vorontsov <anton.vorontsov@linaro.org> Cc: ""Eric W. Biederman"" <ebiederm@xmission.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/d2d393099de21eda91c5ec6a05d60e5dee4d5175;signal: oom_kill_task: use SEND_SIG_FORCED instead of force_sig();yes;no;no 224;MDY6Q29tbWl0MjMyNTI5ODpkMmQzOTMwOTlkZTIxZWRhOTFjNWVjNmEwNWQ2MGU1ZGVlNGQ1MTc1;Oleg Nesterov;Linus Torvalds;"signal: oom_kill_task: use SEND_SIG_FORCED instead of force_sig() Change oom_kill_task() to use do_send_sig_info(SEND_SIG_FORCED) instead of force_sig(SIGKILL). With the recent changes we do not need force_ to kill the CLONE_NEWPID tasks. And this is more correct. force_sig() can race with the exiting thread even if oom_kill_task() checks p->mm != NULL, while do_send_sig_info(group => true) kille the whole process. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: Anton Vorontsov <anton.vorontsov@linaro.org> Cc: ""Eric W. Biederman"" <ebiederm@xmission.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/d2d393099de21eda91c5ec6a05d60e5dee4d5175;"Change oom_kill_task() to use do_send_sig_info(SEND_SIG_FORCED) instead of force_sig(SIGKILL)";yes;no;no 224;MDY6Q29tbWl0MjMyNTI5ODpkMmQzOTMwOTlkZTIxZWRhOTFjNWVjNmEwNWQ2MGU1ZGVlNGQ1MTc1;Oleg Nesterov;Linus Torvalds;"signal: oom_kill_task: use SEND_SIG_FORCED instead of force_sig() Change oom_kill_task() to use do_send_sig_info(SEND_SIG_FORCED) instead of force_sig(SIGKILL). With the recent changes we do not need force_ to kill the CLONE_NEWPID tasks. And this is more correct. force_sig() can race with the exiting thread even if oom_kill_task() checks p->mm != NULL, while do_send_sig_info(group => true) kille the whole process. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: Anton Vorontsov <anton.vorontsov@linaro.org> Cc: ""Eric W. Biederman"" <ebiederm@xmission.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/d2d393099de21eda91c5ec6a05d60e5dee4d5175;" With the recent changes we do not need force_ to kill the CLONE_NEWPID tasks";no;yes;yes 224;MDY6Q29tbWl0MjMyNTI5ODpkMmQzOTMwOTlkZTIxZWRhOTFjNWVjNmEwNWQ2MGU1ZGVlNGQ1MTc1;Oleg Nesterov;Linus Torvalds;"signal: oom_kill_task: use SEND_SIG_FORCED instead of force_sig() Change oom_kill_task() to use do_send_sig_info(SEND_SIG_FORCED) instead of force_sig(SIGKILL). With the recent changes we do not need force_ to kill the CLONE_NEWPID tasks. And this is more correct. force_sig() can race with the exiting thread even if oom_kill_task() checks p->mm != NULL, while do_send_sig_info(group => true) kille the whole process. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: Anton Vorontsov <anton.vorontsov@linaro.org> Cc: ""Eric W. Biederman"" <ebiederm@xmission.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/d2d393099de21eda91c5ec6a05d60e5dee4d5175;And this is more correct;no;yes;no 224;MDY6Q29tbWl0MjMyNTI5ODpkMmQzOTMwOTlkZTIxZWRhOTFjNWVjNmEwNWQ2MGU1ZGVlNGQ1MTc1;Oleg Nesterov;Linus Torvalds;"signal: oom_kill_task: use SEND_SIG_FORCED instead of force_sig() Change oom_kill_task() to use do_send_sig_info(SEND_SIG_FORCED) instead of force_sig(SIGKILL). With the recent changes we do not need force_ to kill the CLONE_NEWPID tasks. And this is more correct. force_sig() can race with the exiting thread even if oom_kill_task() checks p->mm != NULL, while do_send_sig_info(group => true) kille the whole process. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: Anton Vorontsov <anton.vorontsov@linaro.org> Cc: ""Eric W. Biederman"" <ebiederm@xmission.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/d2d393099de21eda91c5ec6a05d60e5dee4d5175;" force_sig() can race with the exiting thread even if oom_kill_task() checks p->mm != NULL, while do_send_sig_info(group => true) kille the whole process.";no;yes;yes 225;MDY6Q29tbWl0MjMyNTI5ODplODQ1ZTE5OTM2MmNjNTcxMmJhMGU3ZWVkYzE0ZWVkNzBlMTQ0MjU4;David Rientjes;Linus Torvalds;"mm, memcg: pass charge order to oom killer The oom killer typically displays the allocation order at the time of oom as a part of its diangostic messages (for global, cpuset, and mempolicy ooms). The memory controller may also pass the charge order to the oom killer so it can emit the same information. This is useful in determining how large the memory allocation is that triggered the oom killer. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Balbir Singh <bsingharora@gmail.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/e845e199362cc5712ba0e7eedc14eed70e144258;mm, memcg: pass charge order to oom killer;yes;no;no 225;MDY6Q29tbWl0MjMyNTI5ODplODQ1ZTE5OTM2MmNjNTcxMmJhMGU3ZWVkYzE0ZWVkNzBlMTQ0MjU4;David Rientjes;Linus Torvalds;"mm, memcg: pass charge order to oom killer The oom killer typically displays the allocation order at the time of oom as a part of its diangostic messages (for global, cpuset, and mempolicy ooms). The memory controller may also pass the charge order to the oom killer so it can emit the same information. This is useful in determining how large the memory allocation is that triggered the oom killer. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Balbir Singh <bsingharora@gmail.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/e845e199362cc5712ba0e7eedc14eed70e144258;"The oom killer typically displays the allocation order at the time of oom as a part of its diangostic messages (for global, cpuset, and mempolicy ooms)";no;no;yes 225;MDY6Q29tbWl0MjMyNTI5ODplODQ1ZTE5OTM2MmNjNTcxMmJhMGU3ZWVkYzE0ZWVkNzBlMTQ0MjU4;David Rientjes;Linus Torvalds;"mm, memcg: pass charge order to oom killer The oom killer typically displays the allocation order at the time of oom as a part of its diangostic messages (for global, cpuset, and mempolicy ooms). The memory controller may also pass the charge order to the oom killer so it can emit the same information. This is useful in determining how large the memory allocation is that triggered the oom killer. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Balbir Singh <bsingharora@gmail.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/e845e199362cc5712ba0e7eedc14eed70e144258;"The memory controller may also pass the charge order to the oom killer so it can emit the same information";yes;no;yes 225;MDY6Q29tbWl0MjMyNTI5ODplODQ1ZTE5OTM2MmNjNTcxMmJhMGU3ZWVkYzE0ZWVkNzBlMTQ0MjU4;David Rientjes;Linus Torvalds;"mm, memcg: pass charge order to oom killer The oom killer typically displays the allocation order at the time of oom as a part of its diangostic messages (for global, cpuset, and mempolicy ooms). The memory controller may also pass the charge order to the oom killer so it can emit the same information. This is useful in determining how large the memory allocation is that triggered the oom killer. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Balbir Singh <bsingharora@gmail.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/e845e199362cc5712ba0e7eedc14eed70e144258;" This is useful in determining how large the memory allocation is that triggered the oom killer.";no;yes;no 226;MDY6Q29tbWl0MjMyNTI5ODowOGFiOWIxMGQ0M2FjYTA5MWZkZmY1OGI2OWZjMWVjODljNWI4YTgz;David Rientjes;Linus Torvalds;"mm, oom: force oom kill on sysrq+f The oom killer chooses not to kill a thread if: - an eligible thread has already been oom killed and has yet to exit, and - an eligible thread is exiting but has yet to free all its memory and is not the thread attempting to currently allocate memory. SysRq+F manually invokes the global oom killer to kill a memory-hogging task. This is normally done as a last resort to free memory when no progress is being made or to test the oom killer itself. For both uses, we always want to kill a thread and never defer. This patch causes SysRq+F to always kill an eligible thread and can be used to force a kill even if another oom killed thread has failed to exit. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Pekka Enberg <penberg@kernel.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/08ab9b10d43aca091fdff58b69fc1ec89c5b8a83;mm, oom: force oom kill on sysrq+f;yes;yes;no 226;MDY6Q29tbWl0MjMyNTI5ODowOGFiOWIxMGQ0M2FjYTA5MWZkZmY1OGI2OWZjMWVjODljNWI4YTgz;David Rientjes;Linus Torvalds;"mm, oom: force oom kill on sysrq+f The oom killer chooses not to kill a thread if: - an eligible thread has already been oom killed and has yet to exit, and - an eligible thread is exiting but has yet to free all its memory and is not the thread attempting to currently allocate memory. SysRq+F manually invokes the global oom killer to kill a memory-hogging task. This is normally done as a last resort to free memory when no progress is being made or to test the oom killer itself. For both uses, we always want to kill a thread and never defer. This patch causes SysRq+F to always kill an eligible thread and can be used to force a kill even if another oom killed thread has failed to exit. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Pekka Enberg <penberg@kernel.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/08ab9b10d43aca091fdff58b69fc1ec89c5b8a83;The oom killer chooses not to kill a thread if;no;no;yes 226;MDY6Q29tbWl0MjMyNTI5ODowOGFiOWIxMGQ0M2FjYTA5MWZkZmY1OGI2OWZjMWVjODljNWI4YTgz;David Rientjes;Linus Torvalds;"mm, oom: force oom kill on sysrq+f The oom killer chooses not to kill a thread if: - an eligible thread has already been oom killed and has yet to exit, and - an eligible thread is exiting but has yet to free all its memory and is not the thread attempting to currently allocate memory. SysRq+F manually invokes the global oom killer to kill a memory-hogging task. This is normally done as a last resort to free memory when no progress is being made or to test the oom killer itself. For both uses, we always want to kill a thread and never defer. This patch causes SysRq+F to always kill an eligible thread and can be used to force a kill even if another oom killed thread has failed to exit. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Pekka Enberg <penberg@kernel.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/08ab9b10d43aca091fdff58b69fc1ec89c5b8a83;" - an eligible thread has already been oom killed and has yet to exit, - an eligible thread is exiting but has yet to free all its memory and is not the thread attempting to currently allocate memory";no;no;yes 226;MDY6Q29tbWl0MjMyNTI5ODowOGFiOWIxMGQ0M2FjYTA5MWZkZmY1OGI2OWZjMWVjODljNWI4YTgz;David Rientjes;Linus Torvalds;"mm, oom: force oom kill on sysrq+f The oom killer chooses not to kill a thread if: - an eligible thread has already been oom killed and has yet to exit, and - an eligible thread is exiting but has yet to free all its memory and is not the thread attempting to currently allocate memory. SysRq+F manually invokes the global oom killer to kill a memory-hogging task. This is normally done as a last resort to free memory when no progress is being made or to test the oom killer itself. For both uses, we always want to kill a thread and never defer. This patch causes SysRq+F to always kill an eligible thread and can be used to force a kill even if another oom killed thread has failed to exit. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Pekka Enberg <penberg@kernel.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/08ab9b10d43aca091fdff58b69fc1ec89c5b8a83;"SysRq+F manually invokes the global oom killer to kill a memory-hogging task";no;no;yes 226;MDY6Q29tbWl0MjMyNTI5ODowOGFiOWIxMGQ0M2FjYTA5MWZkZmY1OGI2OWZjMWVjODljNWI4YTgz;David Rientjes;Linus Torvalds;"mm, oom: force oom kill on sysrq+f The oom killer chooses not to kill a thread if: - an eligible thread has already been oom killed and has yet to exit, and - an eligible thread is exiting but has yet to free all its memory and is not the thread attempting to currently allocate memory. SysRq+F manually invokes the global oom killer to kill a memory-hogging task. This is normally done as a last resort to free memory when no progress is being made or to test the oom killer itself. For both uses, we always want to kill a thread and never defer. This patch causes SysRq+F to always kill an eligible thread and can be used to force a kill even if another oom killed thread has failed to exit. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Pekka Enberg <penberg@kernel.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/08ab9b10d43aca091fdff58b69fc1ec89c5b8a83;" This is normally done as a last resort to free memory when no progress is being made or to test the oom killer itself";no;no;yes 226;MDY6Q29tbWl0MjMyNTI5ODowOGFiOWIxMGQ0M2FjYTA5MWZkZmY1OGI2OWZjMWVjODljNWI4YTgz;David Rientjes;Linus Torvalds;"mm, oom: force oom kill on sysrq+f The oom killer chooses not to kill a thread if: - an eligible thread has already been oom killed and has yet to exit, and - an eligible thread is exiting but has yet to free all its memory and is not the thread attempting to currently allocate memory. SysRq+F manually invokes the global oom killer to kill a memory-hogging task. This is normally done as a last resort to free memory when no progress is being made or to test the oom killer itself. For both uses, we always want to kill a thread and never defer. This patch causes SysRq+F to always kill an eligible thread and can be used to force a kill even if another oom killed thread has failed to exit. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Pekka Enberg <penberg@kernel.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/08ab9b10d43aca091fdff58b69fc1ec89c5b8a83;For both uses, we always want to kill a thread and never defer;no;yes;yes 226;MDY6Q29tbWl0MjMyNTI5ODowOGFiOWIxMGQ0M2FjYTA5MWZkZmY1OGI2OWZjMWVjODljNWI4YTgz;David Rientjes;Linus Torvalds;"mm, oom: force oom kill on sysrq+f The oom killer chooses not to kill a thread if: - an eligible thread has already been oom killed and has yet to exit, and - an eligible thread is exiting but has yet to free all its memory and is not the thread attempting to currently allocate memory. SysRq+F manually invokes the global oom killer to kill a memory-hogging task. This is normally done as a last resort to free memory when no progress is being made or to test the oom killer itself. For both uses, we always want to kill a thread and never defer. This patch causes SysRq+F to always kill an eligible thread and can be used to force a kill even if another oom killed thread has failed to exit. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Pekka Enberg <penberg@kernel.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/08ab9b10d43aca091fdff58b69fc1ec89c5b8a83;" This patch causes SysRq+F to always kill an eligible thread and can be used to force a kill even if another oom killed thread has failed to exit.";yes;yes;no 227;MDY6Q29tbWl0MjMyNTI5ODpkYzNmMjFlYWRlZWE2ZDk4OTgyNzFmZjMyZDM1ZDVlMDBjNjg3MmVh;David Rientjes;Linus Torvalds;"mm, oom: introduce independent oom killer ratelimit state printk_ratelimit() uses the global ratelimit state for all printks. The oom killer should not be subjected to this state just because another subsystem or driver may be flooding the kernel log. This patch introduces printk ratelimiting specifically for the oom killer. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/dc3f21eadeea6d9898271ff32d35d5e00c6872ea;mm, oom: introduce independent oom killer ratelimit state;yes;no;no 227;MDY6Q29tbWl0MjMyNTI5ODpkYzNmMjFlYWRlZWE2ZDk4OTgyNzFmZjMyZDM1ZDVlMDBjNjg3MmVh;David Rientjes;Linus Torvalds;"mm, oom: introduce independent oom killer ratelimit state printk_ratelimit() uses the global ratelimit state for all printks. The oom killer should not be subjected to this state just because another subsystem or driver may be flooding the kernel log. This patch introduces printk ratelimiting specifically for the oom killer. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/dc3f21eadeea6d9898271ff32d35d5e00c6872ea;printk_ratelimit() uses the global ratelimit state for all printks;no;no;yes 227;MDY6Q29tbWl0MjMyNTI5ODpkYzNmMjFlYWRlZWE2ZDk4OTgyNzFmZjMyZDM1ZDVlMDBjNjg3MmVh;David Rientjes;Linus Torvalds;"mm, oom: introduce independent oom killer ratelimit state printk_ratelimit() uses the global ratelimit state for all printks. The oom killer should not be subjected to this state just because another subsystem or driver may be flooding the kernel log. This patch introduces printk ratelimiting specifically for the oom killer. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/dc3f21eadeea6d9898271ff32d35d5e00c6872ea;" The oom killer should not be subjected to this state just because another subsystem or driver may be flooding the kernel log";no;yes;no 227;MDY6Q29tbWl0MjMyNTI5ODpkYzNmMjFlYWRlZWE2ZDk4OTgyNzFmZjMyZDM1ZDVlMDBjNjg3MmVh;David Rientjes;Linus Torvalds;"mm, oom: introduce independent oom killer ratelimit state printk_ratelimit() uses the global ratelimit state for all printks. The oom killer should not be subjected to this state just because another subsystem or driver may be flooding the kernel log. This patch introduces printk ratelimiting specifically for the oom killer. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/dc3f21eadeea6d9898271ff32d35d5e00c6872ea;This patch introduces printk ratelimiting specifically for the oom killer.;yes;no;no 228;MDY6Q29tbWl0MjMyNTI5ODo4NDQ3ZDk1MGU3NDQ1Y2FlNzFhZDY2ZDBlMzM3ODRmODM4OGFhZjlk;David Rientjes;Linus Torvalds;"mm, oom: do not emit oom killer warning if chosen thread is already exiting If a thread is chosen for oom kill and is already PF_EXITING, then the oom killer simply sets TIF_MEMDIE and returns. This allows the thread to have access to memory reserves so that it may quickly exit. This logic is preceeded with a comment saying there's no need to alarm the sysadmin. This patch adds truth to that statement. There's no need to emit any warning about the oom condition if the thread is already exiting since it will not be killed. In this condition, just silently return the oom killer since its only giving access to memory reserves and is otherwise a no-op. Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/8447d950e7445cae71ad66d0e33784f8388aaf9d;mm, oom: do not emit oom killer warning if chosen thread is already exiting;yes;no;no 228;MDY6Q29tbWl0MjMyNTI5ODo4NDQ3ZDk1MGU3NDQ1Y2FlNzFhZDY2ZDBlMzM3ODRmODM4OGFhZjlk;David Rientjes;Linus Torvalds;"mm, oom: do not emit oom killer warning if chosen thread is already exiting If a thread is chosen for oom kill and is already PF_EXITING, then the oom killer simply sets TIF_MEMDIE and returns. This allows the thread to have access to memory reserves so that it may quickly exit. This logic is preceeded with a comment saying there's no need to alarm the sysadmin. This patch adds truth to that statement. There's no need to emit any warning about the oom condition if the thread is already exiting since it will not be killed. In this condition, just silently return the oom killer since its only giving access to memory reserves and is otherwise a no-op. Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/8447d950e7445cae71ad66d0e33784f8388aaf9d;"If a thread is chosen for oom kill and is already PF_EXITING, then the oom killer simply sets TIF_MEMDIE and returns";no;no;yes 228;MDY6Q29tbWl0MjMyNTI5ODo4NDQ3ZDk1MGU3NDQ1Y2FlNzFhZDY2ZDBlMzM3ODRmODM4OGFhZjlk;David Rientjes;Linus Torvalds;"mm, oom: do not emit oom killer warning if chosen thread is already exiting If a thread is chosen for oom kill and is already PF_EXITING, then the oom killer simply sets TIF_MEMDIE and returns. This allows the thread to have access to memory reserves so that it may quickly exit. This logic is preceeded with a comment saying there's no need to alarm the sysadmin. This patch adds truth to that statement. There's no need to emit any warning about the oom condition if the thread is already exiting since it will not be killed. In this condition, just silently return the oom killer since its only giving access to memory reserves and is otherwise a no-op. Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/8447d950e7445cae71ad66d0e33784f8388aaf9d;" This allows the thread to have access to memory reserves so that it may quickly exit";no;no;yes 228;MDY6Q29tbWl0MjMyNTI5ODo4NDQ3ZDk1MGU3NDQ1Y2FlNzFhZDY2ZDBlMzM3ODRmODM4OGFhZjlk;David Rientjes;Linus Torvalds;"mm, oom: do not emit oom killer warning if chosen thread is already exiting If a thread is chosen for oom kill and is already PF_EXITING, then the oom killer simply sets TIF_MEMDIE and returns. This allows the thread to have access to memory reserves so that it may quickly exit. This logic is preceeded with a comment saying there's no need to alarm the sysadmin. This patch adds truth to that statement. There's no need to emit any warning about the oom condition if the thread is already exiting since it will not be killed. In this condition, just silently return the oom killer since its only giving access to memory reserves and is otherwise a no-op. Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/8447d950e7445cae71ad66d0e33784f8388aaf9d;" This logic is preceeded with a comment saying there's no need to alarm the sysadmin";no;no;yes 228;MDY6Q29tbWl0MjMyNTI5ODo4NDQ3ZDk1MGU3NDQ1Y2FlNzFhZDY2ZDBlMzM3ODRmODM4OGFhZjlk;David Rientjes;Linus Torvalds;"mm, oom: do not emit oom killer warning if chosen thread is already exiting If a thread is chosen for oom kill and is already PF_EXITING, then the oom killer simply sets TIF_MEMDIE and returns. This allows the thread to have access to memory reserves so that it may quickly exit. This logic is preceeded with a comment saying there's no need to alarm the sysadmin. This patch adds truth to that statement. There's no need to emit any warning about the oom condition if the thread is already exiting since it will not be killed. In this condition, just silently return the oom killer since its only giving access to memory reserves and is otherwise a no-op. Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/8447d950e7445cae71ad66d0e33784f8388aaf9d;This patch adds truth to that statement;yes;yes;no 228;MDY6Q29tbWl0MjMyNTI5ODo4NDQ3ZDk1MGU3NDQ1Y2FlNzFhZDY2ZDBlMzM3ODRmODM4OGFhZjlk;David Rientjes;Linus Torvalds;"mm, oom: do not emit oom killer warning if chosen thread is already exiting If a thread is chosen for oom kill and is already PF_EXITING, then the oom killer simply sets TIF_MEMDIE and returns. This allows the thread to have access to memory reserves so that it may quickly exit. This logic is preceeded with a comment saying there's no need to alarm the sysadmin. This patch adds truth to that statement. There's no need to emit any warning about the oom condition if the thread is already exiting since it will not be killed. In this condition, just silently return the oom killer since its only giving access to memory reserves and is otherwise a no-op. Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/8447d950e7445cae71ad66d0e33784f8388aaf9d;"There's no need to emit any warning about the oom condition if the thread is already exiting since it will not be killed";yes;yes;yes 228;MDY6Q29tbWl0MjMyNTI5ODo4NDQ3ZDk1MGU3NDQ1Y2FlNzFhZDY2ZDBlMzM3ODRmODM4OGFhZjlk;David Rientjes;Linus Torvalds;"mm, oom: do not emit oom killer warning if chosen thread is already exiting If a thread is chosen for oom kill and is already PF_EXITING, then the oom killer simply sets TIF_MEMDIE and returns. This allows the thread to have access to memory reserves so that it may quickly exit. This logic is preceeded with a comment saying there's no need to alarm the sysadmin. This patch adds truth to that statement. There's no need to emit any warning about the oom condition if the thread is already exiting since it will not be killed. In this condition, just silently return the oom killer since its only giving access to memory reserves and is otherwise a no-op. Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/8447d950e7445cae71ad66d0e33784f8388aaf9d;" In this condition, just silently return the oom killer since its only giving access to memory reserves and is otherwise a no-op.";yes;yes;yes 229;MDY6Q29tbWl0MjMyNTI5ODo2NDdmMmJkZjRhMDBkYmNhYTg5NjQyODY1MDFkNjhlN2QyZTZkYTkz;David Rientjes;Linus Torvalds;"mm, oom: fold oom_kill_task() into oom_kill_process() oom_kill_task() has a single caller, so fold it into its parent function, oom_kill_process(). Slightly reduces the number of lines in the oom killer. Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/647f2bdf4a00dbcaa8964286501d68e7d2e6da93;mm, oom: fold oom_kill_task() into oom_kill_process();yes;no;no 229;MDY6Q29tbWl0MjMyNTI5ODo2NDdmMmJkZjRhMDBkYmNhYTg5NjQyODY1MDFkNjhlN2QyZTZkYTkz;David Rientjes;Linus Torvalds;"mm, oom: fold oom_kill_task() into oom_kill_process() oom_kill_task() has a single caller, so fold it into its parent function, oom_kill_process(). Slightly reduces the number of lines in the oom killer. Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/647f2bdf4a00dbcaa8964286501d68e7d2e6da93;"oom_kill_task() has a single caller, so fold it into its parent function, oom_kill_process()";yes;yes;yes 229;MDY6Q29tbWl0MjMyNTI5ODo2NDdmMmJkZjRhMDBkYmNhYTg5NjQyODY1MDFkNjhlN2QyZTZkYTkz;David Rientjes;Linus Torvalds;"mm, oom: fold oom_kill_task() into oom_kill_process() oom_kill_task() has a single caller, so fold it into its parent function, oom_kill_process(). Slightly reduces the number of lines in the oom killer. Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/647f2bdf4a00dbcaa8964286501d68e7d2e6da93;" Slightly reduces the number of lines in the oom killer.";no;yes;no 230;MDY6Q29tbWl0MjMyNTI5ODoyYTFjOWIxZmMwYTBlYTJlMzBjZGViNjkwNjI2NDdjNWM1YWU2NjFm;David Rientjes;Linus Torvalds;"mm, oom: avoid looping when chosen thread detaches its mm oom_kill_task() returns non-zero iff the chosen process does not have any threads with an attached ->mm. In such a case, it's better to just return to the page allocator and retry the allocation because memory could have been freed in the interim and the oom condition may no longer exist. It's unnecessary to loop in the oom killer and find another thread to kill. This allows both oom_kill_task() and oom_kill_process() to be converted to void functions. If the oom condition persists, the oom killer will be recalled. Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/2a1c9b1fc0a0ea2e30cdeb69062647c5c5ae661f;mm, oom: avoid looping when chosen thread detaches its mm;yes;yes;no 230;MDY6Q29tbWl0MjMyNTI5ODoyYTFjOWIxZmMwYTBlYTJlMzBjZGViNjkwNjI2NDdjNWM1YWU2NjFm;David Rientjes;Linus Torvalds;"mm, oom: avoid looping when chosen thread detaches its mm oom_kill_task() returns non-zero iff the chosen process does not have any threads with an attached ->mm. In such a case, it's better to just return to the page allocator and retry the allocation because memory could have been freed in the interim and the oom condition may no longer exist. It's unnecessary to loop in the oom killer and find another thread to kill. This allows both oom_kill_task() and oom_kill_process() to be converted to void functions. If the oom condition persists, the oom killer will be recalled. Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/2a1c9b1fc0a0ea2e30cdeb69062647c5c5ae661f;"oom_kill_task() returns non-zero iff the chosen process does not have any threads with an attached ->mm";no;no;yes 230;MDY6Q29tbWl0MjMyNTI5ODoyYTFjOWIxZmMwYTBlYTJlMzBjZGViNjkwNjI2NDdjNWM1YWU2NjFm;David Rientjes;Linus Torvalds;"mm, oom: avoid looping when chosen thread detaches its mm oom_kill_task() returns non-zero iff the chosen process does not have any threads with an attached ->mm. In such a case, it's better to just return to the page allocator and retry the allocation because memory could have been freed in the interim and the oom condition may no longer exist. It's unnecessary to loop in the oom killer and find another thread to kill. This allows both oom_kill_task() and oom_kill_process() to be converted to void functions. If the oom condition persists, the oom killer will be recalled. Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/2a1c9b1fc0a0ea2e30cdeb69062647c5c5ae661f;"In such a case, it's better to just return to the page allocator and retry the allocation because memory could have been freed in the interim and the oom condition may no longer exist";yes;yes;yes 230;MDY6Q29tbWl0MjMyNTI5ODoyYTFjOWIxZmMwYTBlYTJlMzBjZGViNjkwNjI2NDdjNWM1YWU2NjFm;David Rientjes;Linus Torvalds;"mm, oom: avoid looping when chosen thread detaches its mm oom_kill_task() returns non-zero iff the chosen process does not have any threads with an attached ->mm. In such a case, it's better to just return to the page allocator and retry the allocation because memory could have been freed in the interim and the oom condition may no longer exist. It's unnecessary to loop in the oom killer and find another thread to kill. This allows both oom_kill_task() and oom_kill_process() to be converted to void functions. If the oom condition persists, the oom killer will be recalled. Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/2a1c9b1fc0a0ea2e30cdeb69062647c5c5ae661f;" It's unnecessary to loop in the oom killer and find another thread to kill";no;yes;no 230;MDY6Q29tbWl0MjMyNTI5ODoyYTFjOWIxZmMwYTBlYTJlMzBjZGViNjkwNjI2NDdjNWM1YWU2NjFm;David Rientjes;Linus Torvalds;"mm, oom: avoid looping when chosen thread detaches its mm oom_kill_task() returns non-zero iff the chosen process does not have any threads with an attached ->mm. In such a case, it's better to just return to the page allocator and retry the allocation because memory could have been freed in the interim and the oom condition may no longer exist. It's unnecessary to loop in the oom killer and find another thread to kill. This allows both oom_kill_task() and oom_kill_process() to be converted to void functions. If the oom condition persists, the oom killer will be recalled. Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/2a1c9b1fc0a0ea2e30cdeb69062647c5c5ae661f;"This allows both oom_kill_task() and oom_kill_process() to be converted to void functions";yes;no;yes 230;MDY6Q29tbWl0MjMyNTI5ODoyYTFjOWIxZmMwYTBlYTJlMzBjZGViNjkwNjI2NDdjNWM1YWU2NjFm;David Rientjes;Linus Torvalds;"mm, oom: avoid looping when chosen thread detaches its mm oom_kill_task() returns non-zero iff the chosen process does not have any threads with an attached ->mm. In such a case, it's better to just return to the page allocator and retry the allocation because memory could have been freed in the interim and the oom condition may no longer exist. It's unnecessary to loop in the oom killer and find another thread to kill. This allows both oom_kill_task() and oom_kill_process() to be converted to void functions. If the oom condition persists, the oom killer will be recalled. Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/2a1c9b1fc0a0ea2e30cdeb69062647c5c5ae661f;" If the oom condition persists, the oom killer will be recalled.";yes;no;yes 231;MDY6Q29tbWl0MjMyNTI5ODo3MjgzNWM4NmNhMTVkMDEyNjM1NGI3M2Q1ZjI5Y2U5MTk0OTMxYzli;Johannes Weiner;Linus Torvalds;"mm: unify remaining mem_cont, mem, etc. variable names to memcg Signed-off-by: Johannes Weiner <jweiner@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/72835c86ca15d0126354b73d5f29ce9194931c9b;mm: unify remaining mem_cont, mem, etc. variable names to memcg;yes;yes;no 232;MDY6Q29tbWl0MjMyNTI5ODplYzBmZmZkODRiMTYyZTA1NjNhMjhhODFhYTA0OWY5NDZiMzFhOGUy;Johannes Weiner;Linus Torvalds;"mm: oom_kill: remove memcg argument from oom_kill_task() The memcg argument of oom_kill_task() hasn't been used since 341aea2 'oom-kill: remove boost_dying_task_prio()'. Kill it. Signed-off-by: Johannes Weiner <jweiner@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/ec0fffd84b162e0563a28a81aa049f946b31a8e2;mm: oom_kill: remove memcg argument from oom_kill_task();yes;no;no 232;MDY6Q29tbWl0MjMyNTI5ODplYzBmZmZkODRiMTYyZTA1NjNhMjhhODFhYTA0OWY5NDZiMzFhOGUy;Johannes Weiner;Linus Torvalds;"mm: oom_kill: remove memcg argument from oom_kill_task() The memcg argument of oom_kill_task() hasn't been used since 341aea2 'oom-kill: remove boost_dying_task_prio()'. Kill it. Signed-off-by: Johannes Weiner <jweiner@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/ec0fffd84b162e0563a28a81aa049f946b31a8e2;"The memcg argument of oom_kill_task() hasn't been used since 341aea2 'oom-kill: remove boost_dying_task_prio()'";no;yes;yes 232;MDY6Q29tbWl0MjMyNTI5ODplYzBmZmZkODRiMTYyZTA1NjNhMjhhODFhYTA0OWY5NDZiMzFhOGUy;Johannes Weiner;Linus Torvalds;"mm: oom_kill: remove memcg argument from oom_kill_task() The memcg argument of oom_kill_task() hasn't been used since 341aea2 'oom-kill: remove boost_dying_task_prio()'. Kill it. Signed-off-by: Johannes Weiner <jweiner@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/ec0fffd84b162e0563a28a81aa049f946b31a8e2; Kill it.;yes;no;no 233;MDY6Q29tbWl0MjMyNTI5ODo0M2QyYjExMzI0MWQ2Nzk3Yjg5MDMxODc2N2UwYWY3OGUzMTM0MTRi;KAMEZAWA Hiroyuki;Linus Torvalds;"tracepoint: add tracepoints for debugging oom_score_adj oom_score_adj is used for guarding processes from OOM-Killer. One of problem is that it's inherited at fork(). When a daemon set oom_score_adj and make children, it's hard to know where the value is set. This patch adds some tracepoints useful for debugging. This patch adds 3 trace points. - creating new task - renaming a task (exec) - set oom_score_adj To debug, users need to enable some trace pointer. Maybe filtering is useful as # EVENT=/sys/kernel/debug/tracing/events/task/ # echo ""oom_score_adj != 0"" > $EVENT/task_newtask/filter # echo ""oom_score_adj != 0"" > $EVENT/task_rename/filter # echo 1 > $EVENT/enable # EVENT=/sys/kernel/debug/tracing/events/oom/ # echo 1 > $EVENT/enable output will be like this. # grep oom /sys/kernel/debug/tracing/trace bash-7699 [007] d..3 5140.744510: oom_score_adj_update: pid=7699 comm=bash oom_score_adj=-1000 bash-7699 [007] ...1 5151.818022: task_newtask: pid=7729 comm=bash clone_flags=1200011 oom_score_adj=-1000 ls-7729 [003] ...2 5151.818504: task_rename: pid=7729 oldcomm=bash newcomm=ls oom_score_adj=-1000 bash-7699 [002] ...1 5175.701468: task_newtask: pid=7730 comm=bash clone_flags=1200011 oom_score_adj=-1000 grep-7730 [007] ...2 5175.701993: task_rename: pid=7730 oldcomm=bash newcomm=grep oom_score_adj=-1000 Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/43d2b113241d6797b890318767e0af78e313414b;tracepoint: add tracepoints for debugging oom_score_adj;yes;yes;no 233;MDY6Q29tbWl0MjMyNTI5ODo0M2QyYjExMzI0MWQ2Nzk3Yjg5MDMxODc2N2UwYWY3OGUzMTM0MTRi;KAMEZAWA Hiroyuki;Linus Torvalds;"tracepoint: add tracepoints for debugging oom_score_adj oom_score_adj is used for guarding processes from OOM-Killer. One of problem is that it's inherited at fork(). When a daemon set oom_score_adj and make children, it's hard to know where the value is set. This patch adds some tracepoints useful for debugging. This patch adds 3 trace points. - creating new task - renaming a task (exec) - set oom_score_adj To debug, users need to enable some trace pointer. Maybe filtering is useful as # EVENT=/sys/kernel/debug/tracing/events/task/ # echo ""oom_score_adj != 0"" > $EVENT/task_newtask/filter # echo ""oom_score_adj != 0"" > $EVENT/task_rename/filter # echo 1 > $EVENT/enable # EVENT=/sys/kernel/debug/tracing/events/oom/ # echo 1 > $EVENT/enable output will be like this. # grep oom /sys/kernel/debug/tracing/trace bash-7699 [007] d..3 5140.744510: oom_score_adj_update: pid=7699 comm=bash oom_score_adj=-1000 bash-7699 [007] ...1 5151.818022: task_newtask: pid=7729 comm=bash clone_flags=1200011 oom_score_adj=-1000 ls-7729 [003] ...2 5151.818504: task_rename: pid=7729 oldcomm=bash newcomm=ls oom_score_adj=-1000 bash-7699 [002] ...1 5175.701468: task_newtask: pid=7730 comm=bash clone_flags=1200011 oom_score_adj=-1000 grep-7730 [007] ...2 5175.701993: task_rename: pid=7730 oldcomm=bash newcomm=grep oom_score_adj=-1000 Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/43d2b113241d6797b890318767e0af78e313414b;oom_score_adj is used for guarding processes from OOM-Killer;no;no;yes 233;MDY6Q29tbWl0MjMyNTI5ODo0M2QyYjExMzI0MWQ2Nzk3Yjg5MDMxODc2N2UwYWY3OGUzMTM0MTRi;KAMEZAWA Hiroyuki;Linus Torvalds;"tracepoint: add tracepoints for debugging oom_score_adj oom_score_adj is used for guarding processes from OOM-Killer. One of problem is that it's inherited at fork(). When a daemon set oom_score_adj and make children, it's hard to know where the value is set. This patch adds some tracepoints useful for debugging. This patch adds 3 trace points. - creating new task - renaming a task (exec) - set oom_score_adj To debug, users need to enable some trace pointer. Maybe filtering is useful as # EVENT=/sys/kernel/debug/tracing/events/task/ # echo ""oom_score_adj != 0"" > $EVENT/task_newtask/filter # echo ""oom_score_adj != 0"" > $EVENT/task_rename/filter # echo 1 > $EVENT/enable # EVENT=/sys/kernel/debug/tracing/events/oom/ # echo 1 > $EVENT/enable output will be like this. # grep oom /sys/kernel/debug/tracing/trace bash-7699 [007] d..3 5140.744510: oom_score_adj_update: pid=7699 comm=bash oom_score_adj=-1000 bash-7699 [007] ...1 5151.818022: task_newtask: pid=7729 comm=bash clone_flags=1200011 oom_score_adj=-1000 ls-7729 [003] ...2 5151.818504: task_rename: pid=7729 oldcomm=bash newcomm=ls oom_score_adj=-1000 bash-7699 [002] ...1 5175.701468: task_newtask: pid=7730 comm=bash clone_flags=1200011 oom_score_adj=-1000 grep-7730 [007] ...2 5175.701993: task_rename: pid=7730 oldcomm=bash newcomm=grep oom_score_adj=-1000 Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/43d2b113241d6797b890318767e0af78e313414b;" One of problem is that it's inherited at fork()";no;yes;yes 233;MDY6Q29tbWl0MjMyNTI5ODo0M2QyYjExMzI0MWQ2Nzk3Yjg5MDMxODc2N2UwYWY3OGUzMTM0MTRi;KAMEZAWA Hiroyuki;Linus Torvalds;"tracepoint: add tracepoints for debugging oom_score_adj oom_score_adj is used for guarding processes from OOM-Killer. One of problem is that it's inherited at fork(). When a daemon set oom_score_adj and make children, it's hard to know where the value is set. This patch adds some tracepoints useful for debugging. This patch adds 3 trace points. - creating new task - renaming a task (exec) - set oom_score_adj To debug, users need to enable some trace pointer. Maybe filtering is useful as # EVENT=/sys/kernel/debug/tracing/events/task/ # echo ""oom_score_adj != 0"" > $EVENT/task_newtask/filter # echo ""oom_score_adj != 0"" > $EVENT/task_rename/filter # echo 1 > $EVENT/enable # EVENT=/sys/kernel/debug/tracing/events/oom/ # echo 1 > $EVENT/enable output will be like this. # grep oom /sys/kernel/debug/tracing/trace bash-7699 [007] d..3 5140.744510: oom_score_adj_update: pid=7699 comm=bash oom_score_adj=-1000 bash-7699 [007] ...1 5151.818022: task_newtask: pid=7729 comm=bash clone_flags=1200011 oom_score_adj=-1000 ls-7729 [003] ...2 5151.818504: task_rename: pid=7729 oldcomm=bash newcomm=ls oom_score_adj=-1000 bash-7699 [002] ...1 5175.701468: task_newtask: pid=7730 comm=bash clone_flags=1200011 oom_score_adj=-1000 grep-7730 [007] ...2 5175.701993: task_rename: pid=7730 oldcomm=bash newcomm=grep oom_score_adj=-1000 Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/43d2b113241d6797b890318767e0af78e313414b;" When a daemon set oom_score_adj and make children, it's hard to know where the value is set";no;yes;yes 233;MDY6Q29tbWl0MjMyNTI5ODo0M2QyYjExMzI0MWQ2Nzk3Yjg5MDMxODc2N2UwYWY3OGUzMTM0MTRi;KAMEZAWA Hiroyuki;Linus Torvalds;"tracepoint: add tracepoints for debugging oom_score_adj oom_score_adj is used for guarding processes from OOM-Killer. One of problem is that it's inherited at fork(). When a daemon set oom_score_adj and make children, it's hard to know where the value is set. This patch adds some tracepoints useful for debugging. This patch adds 3 trace points. - creating new task - renaming a task (exec) - set oom_score_adj To debug, users need to enable some trace pointer. Maybe filtering is useful as # EVENT=/sys/kernel/debug/tracing/events/task/ # echo ""oom_score_adj != 0"" > $EVENT/task_newtask/filter # echo ""oom_score_adj != 0"" > $EVENT/task_rename/filter # echo 1 > $EVENT/enable # EVENT=/sys/kernel/debug/tracing/events/oom/ # echo 1 > $EVENT/enable output will be like this. # grep oom /sys/kernel/debug/tracing/trace bash-7699 [007] d..3 5140.744510: oom_score_adj_update: pid=7699 comm=bash oom_score_adj=-1000 bash-7699 [007] ...1 5151.818022: task_newtask: pid=7729 comm=bash clone_flags=1200011 oom_score_adj=-1000 ls-7729 [003] ...2 5151.818504: task_rename: pid=7729 oldcomm=bash newcomm=ls oom_score_adj=-1000 bash-7699 [002] ...1 5175.701468: task_newtask: pid=7730 comm=bash clone_flags=1200011 oom_score_adj=-1000 grep-7730 [007] ...2 5175.701993: task_rename: pid=7730 oldcomm=bash newcomm=grep oom_score_adj=-1000 Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/43d2b113241d6797b890318767e0af78e313414b;This patch adds some tracepoints useful for debugging;yes;yes;no 233;MDY6Q29tbWl0MjMyNTI5ODo0M2QyYjExMzI0MWQ2Nzk3Yjg5MDMxODc2N2UwYWY3OGUzMTM0MTRi;KAMEZAWA Hiroyuki;Linus Torvalds;"tracepoint: add tracepoints for debugging oom_score_adj oom_score_adj is used for guarding processes from OOM-Killer. One of problem is that it's inherited at fork(). When a daemon set oom_score_adj and make children, it's hard to know where the value is set. This patch adds some tracepoints useful for debugging. This patch adds 3 trace points. - creating new task - renaming a task (exec) - set oom_score_adj To debug, users need to enable some trace pointer. Maybe filtering is useful as # EVENT=/sys/kernel/debug/tracing/events/task/ # echo ""oom_score_adj != 0"" > $EVENT/task_newtask/filter # echo ""oom_score_adj != 0"" > $EVENT/task_rename/filter # echo 1 > $EVENT/enable # EVENT=/sys/kernel/debug/tracing/events/oom/ # echo 1 > $EVENT/enable output will be like this. # grep oom /sys/kernel/debug/tracing/trace bash-7699 [007] d..3 5140.744510: oom_score_adj_update: pid=7699 comm=bash oom_score_adj=-1000 bash-7699 [007] ...1 5151.818022: task_newtask: pid=7729 comm=bash clone_flags=1200011 oom_score_adj=-1000 ls-7729 [003] ...2 5151.818504: task_rename: pid=7729 oldcomm=bash newcomm=ls oom_score_adj=-1000 bash-7699 [002] ...1 5175.701468: task_newtask: pid=7730 comm=bash clone_flags=1200011 oom_score_adj=-1000 grep-7730 [007] ...2 5175.701993: task_rename: pid=7730 oldcomm=bash newcomm=grep oom_score_adj=-1000 Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/43d2b113241d6797b890318767e0af78e313414b;"This patch adds 3 trace points";yes;no;no 233;MDY6Q29tbWl0MjMyNTI5ODo0M2QyYjExMzI0MWQ2Nzk3Yjg5MDMxODc2N2UwYWY3OGUzMTM0MTRi;KAMEZAWA Hiroyuki;Linus Torvalds;"tracepoint: add tracepoints for debugging oom_score_adj oom_score_adj is used for guarding processes from OOM-Killer. One of problem is that it's inherited at fork(). When a daemon set oom_score_adj and make children, it's hard to know where the value is set. This patch adds some tracepoints useful for debugging. This patch adds 3 trace points. - creating new task - renaming a task (exec) - set oom_score_adj To debug, users need to enable some trace pointer. Maybe filtering is useful as # EVENT=/sys/kernel/debug/tracing/events/task/ # echo ""oom_score_adj != 0"" > $EVENT/task_newtask/filter # echo ""oom_score_adj != 0"" > $EVENT/task_rename/filter # echo 1 > $EVENT/enable # EVENT=/sys/kernel/debug/tracing/events/oom/ # echo 1 > $EVENT/enable output will be like this. # grep oom /sys/kernel/debug/tracing/trace bash-7699 [007] d..3 5140.744510: oom_score_adj_update: pid=7699 comm=bash oom_score_adj=-1000 bash-7699 [007] ...1 5151.818022: task_newtask: pid=7729 comm=bash clone_flags=1200011 oom_score_adj=-1000 ls-7729 [003] ...2 5151.818504: task_rename: pid=7729 oldcomm=bash newcomm=ls oom_score_adj=-1000 bash-7699 [002] ...1 5175.701468: task_newtask: pid=7730 comm=bash clone_flags=1200011 oom_score_adj=-1000 grep-7730 [007] ...2 5175.701993: task_rename: pid=7730 oldcomm=bash newcomm=grep oom_score_adj=-1000 Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/43d2b113241d6797b890318767e0af78e313414b;" - creating new task - renaming a task (exec) - set oom_score_adj To debug, users need to enable some trace pointer";yes;no;no 233;MDY6Q29tbWl0MjMyNTI5ODo0M2QyYjExMzI0MWQ2Nzk3Yjg5MDMxODc2N2UwYWY3OGUzMTM0MTRi;KAMEZAWA Hiroyuki;Linus Torvalds;"tracepoint: add tracepoints for debugging oom_score_adj oom_score_adj is used for guarding processes from OOM-Killer. One of problem is that it's inherited at fork(). When a daemon set oom_score_adj and make children, it's hard to know where the value is set. This patch adds some tracepoints useful for debugging. This patch adds 3 trace points. - creating new task - renaming a task (exec) - set oom_score_adj To debug, users need to enable some trace pointer. Maybe filtering is useful as # EVENT=/sys/kernel/debug/tracing/events/task/ # echo ""oom_score_adj != 0"" > $EVENT/task_newtask/filter # echo ""oom_score_adj != 0"" > $EVENT/task_rename/filter # echo 1 > $EVENT/enable # EVENT=/sys/kernel/debug/tracing/events/oom/ # echo 1 > $EVENT/enable output will be like this. # grep oom /sys/kernel/debug/tracing/trace bash-7699 [007] d..3 5140.744510: oom_score_adj_update: pid=7699 comm=bash oom_score_adj=-1000 bash-7699 [007] ...1 5151.818022: task_newtask: pid=7729 comm=bash clone_flags=1200011 oom_score_adj=-1000 ls-7729 [003] ...2 5151.818504: task_rename: pid=7729 oldcomm=bash newcomm=ls oom_score_adj=-1000 bash-7699 [002] ...1 5175.701468: task_newtask: pid=7730 comm=bash clone_flags=1200011 oom_score_adj=-1000 grep-7730 [007] ...2 5175.701993: task_rename: pid=7730 oldcomm=bash newcomm=grep oom_score_adj=-1000 Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/43d2b113241d6797b890318767e0af78e313414b;"Maybe filtering is useful as # EVENT=/sys/kernel/debug/tracing/events/task/ # EVENT=/sys/kernel/debug/tracing/events/oom/ output will be like this.";yes;no;yes 234;MDY6Q29tbWl0MjMyNTI5ODpmZjA1YjZmN2FlNzYyYjZlYjQ2NDE4M2VlYzk5NGIyOGVhMDlmNmRk;Frantisek Hrbata;Linus Torvalds;"oom: fix integer overflow of points in oom_badness An integer overflow will happen on 64bit archs if task's sum of rss, swapents and nr_ptes exceeds (2^31)/1000 value. This was introduced by commit f755a04 oom: use pte pages in OOM score where the oom score computation was divided into several steps and it's no longer computed as one expression in unsigned long(rss, swapents, nr_pte are unsigned long), where the result value assigned to points(int) is in range(1..1000). So there could be an int overflow while computing 176 points *= 1000; and points may have negative value. Meaning the oom score for a mem hog task will be one. 196 if (points <= 0) 197 return 1; For example: [ 3366] 0 3366 35390480 24303939 5 0 0 oom01 Out of memory: Kill process 3366 (oom01) score 1 or sacrifice child Here the oom1 process consumes more than 24303939(rss)*4096~=92GB physical memory, but it's oom score is one. In this situation the mem hog task is skipped and oom killer kills another and most probably innocent task with oom score greater than one. The points variable should be of type long instead of int to prevent the int overflow. Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> [2.6.36+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/ff05b6f7ae762b6eb464183eec994b28ea09f6dd;oom: fix integer overflow of points in oom_badness;yes;yes;no 234;MDY6Q29tbWl0MjMyNTI5ODpmZjA1YjZmN2FlNzYyYjZlYjQ2NDE4M2VlYzk5NGIyOGVhMDlmNmRk;Frantisek Hrbata;Linus Torvalds;"oom: fix integer overflow of points in oom_badness An integer overflow will happen on 64bit archs if task's sum of rss, swapents and nr_ptes exceeds (2^31)/1000 value. This was introduced by commit f755a04 oom: use pte pages in OOM score where the oom score computation was divided into several steps and it's no longer computed as one expression in unsigned long(rss, swapents, nr_pte are unsigned long), where the result value assigned to points(int) is in range(1..1000). So there could be an int overflow while computing 176 points *= 1000; and points may have negative value. Meaning the oom score for a mem hog task will be one. 196 if (points <= 0) 197 return 1; For example: [ 3366] 0 3366 35390480 24303939 5 0 0 oom01 Out of memory: Kill process 3366 (oom01) score 1 or sacrifice child Here the oom1 process consumes more than 24303939(rss)*4096~=92GB physical memory, but it's oom score is one. In this situation the mem hog task is skipped and oom killer kills another and most probably innocent task with oom score greater than one. The points variable should be of type long instead of int to prevent the int overflow. Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> [2.6.36+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/ff05b6f7ae762b6eb464183eec994b28ea09f6dd;"An integer overflow will happen on 64bit archs if task's sum of rss, swapents and nr_ptes exceeds (2^31)/1000 value";no;yes;yes 234;MDY6Q29tbWl0MjMyNTI5ODpmZjA1YjZmN2FlNzYyYjZlYjQ2NDE4M2VlYzk5NGIyOGVhMDlmNmRk;Frantisek Hrbata;Linus Torvalds;"oom: fix integer overflow of points in oom_badness An integer overflow will happen on 64bit archs if task's sum of rss, swapents and nr_ptes exceeds (2^31)/1000 value. This was introduced by commit f755a04 oom: use pte pages in OOM score where the oom score computation was divided into several steps and it's no longer computed as one expression in unsigned long(rss, swapents, nr_pte are unsigned long), where the result value assigned to points(int) is in range(1..1000). So there could be an int overflow while computing 176 points *= 1000; and points may have negative value. Meaning the oom score for a mem hog task will be one. 196 if (points <= 0) 197 return 1; For example: [ 3366] 0 3366 35390480 24303939 5 0 0 oom01 Out of memory: Kill process 3366 (oom01) score 1 or sacrifice child Here the oom1 process consumes more than 24303939(rss)*4096~=92GB physical memory, but it's oom score is one. In this situation the mem hog task is skipped and oom killer kills another and most probably innocent task with oom score greater than one. The points variable should be of type long instead of int to prevent the int overflow. Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> [2.6.36+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/ff05b6f7ae762b6eb464183eec994b28ea09f6dd;" This was introduced by commit f755a04 oom: use pte pages in OOM score where the oom score computation was divided into several steps and it's no longer computed as one expression in unsigned long(rss, swapents, nr_pte are unsigned long), where the result value assigned to points(int) is in range(1..1000)";no;no;yes 234;MDY6Q29tbWl0MjMyNTI5ODpmZjA1YjZmN2FlNzYyYjZlYjQ2NDE4M2VlYzk5NGIyOGVhMDlmNmRk;Frantisek Hrbata;Linus Torvalds;"oom: fix integer overflow of points in oom_badness An integer overflow will happen on 64bit archs if task's sum of rss, swapents and nr_ptes exceeds (2^31)/1000 value. This was introduced by commit f755a04 oom: use pte pages in OOM score where the oom score computation was divided into several steps and it's no longer computed as one expression in unsigned long(rss, swapents, nr_pte are unsigned long), where the result value assigned to points(int) is in range(1..1000). So there could be an int overflow while computing 176 points *= 1000; and points may have negative value. Meaning the oom score for a mem hog task will be one. 196 if (points <= 0) 197 return 1; For example: [ 3366] 0 3366 35390480 24303939 5 0 0 oom01 Out of memory: Kill process 3366 (oom01) score 1 or sacrifice child Here the oom1 process consumes more than 24303939(rss)*4096~=92GB physical memory, but it's oom score is one. In this situation the mem hog task is skipped and oom killer kills another and most probably innocent task with oom score greater than one. The points variable should be of type long instead of int to prevent the int overflow. Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> [2.6.36+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/ff05b6f7ae762b6eb464183eec994b28ea09f6dd;" So there could be an int overflow while computing and points may have negative value";no;yes;yes 234;MDY6Q29tbWl0MjMyNTI5ODpmZjA1YjZmN2FlNzYyYjZlYjQ2NDE4M2VlYzk5NGIyOGVhMDlmNmRk;Frantisek Hrbata;Linus Torvalds;"oom: fix integer overflow of points in oom_badness An integer overflow will happen on 64bit archs if task's sum of rss, swapents and nr_ptes exceeds (2^31)/1000 value. This was introduced by commit f755a04 oom: use pte pages in OOM score where the oom score computation was divided into several steps and it's no longer computed as one expression in unsigned long(rss, swapents, nr_pte are unsigned long), where the result value assigned to points(int) is in range(1..1000). So there could be an int overflow while computing 176 points *= 1000; and points may have negative value. Meaning the oom score for a mem hog task will be one. 196 if (points <= 0) 197 return 1; For example: [ 3366] 0 3366 35390480 24303939 5 0 0 oom01 Out of memory: Kill process 3366 (oom01) score 1 or sacrifice child Here the oom1 process consumes more than 24303939(rss)*4096~=92GB physical memory, but it's oom score is one. In this situation the mem hog task is skipped and oom killer kills another and most probably innocent task with oom score greater than one. The points variable should be of type long instead of int to prevent the int overflow. Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> [2.6.36+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/ff05b6f7ae762b6eb464183eec994b28ea09f6dd;"Meaning the oom score for a mem hog task will be one";no;yes;yes 234;MDY6Q29tbWl0MjMyNTI5ODpmZjA1YjZmN2FlNzYyYjZlYjQ2NDE4M2VlYzk5NGIyOGVhMDlmNmRk;Frantisek Hrbata;Linus Torvalds;"oom: fix integer overflow of points in oom_badness An integer overflow will happen on 64bit archs if task's sum of rss, swapents and nr_ptes exceeds (2^31)/1000 value. This was introduced by commit f755a04 oom: use pte pages in OOM score where the oom score computation was divided into several steps and it's no longer computed as one expression in unsigned long(rss, swapents, nr_pte are unsigned long), where the result value assigned to points(int) is in range(1..1000). So there could be an int overflow while computing 176 points *= 1000; and points may have negative value. Meaning the oom score for a mem hog task will be one. 196 if (points <= 0) 197 return 1; For example: [ 3366] 0 3366 35390480 24303939 5 0 0 oom01 Out of memory: Kill process 3366 (oom01) score 1 or sacrifice child Here the oom1 process consumes more than 24303939(rss)*4096~=92GB physical memory, but it's oom score is one. In this situation the mem hog task is skipped and oom killer kills another and most probably innocent task with oom score greater than one. The points variable should be of type long instead of int to prevent the int overflow. Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> [2.6.36+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/ff05b6f7ae762b6eb464183eec994b28ea09f6dd;For example;no;yes;yes 234;MDY6Q29tbWl0MjMyNTI5ODpmZjA1YjZmN2FlNzYyYjZlYjQ2NDE4M2VlYzk5NGIyOGVhMDlmNmRk;Frantisek Hrbata;Linus Torvalds;"oom: fix integer overflow of points in oom_badness An integer overflow will happen on 64bit archs if task's sum of rss, swapents and nr_ptes exceeds (2^31)/1000 value. This was introduced by commit f755a04 oom: use pte pages in OOM score where the oom score computation was divided into several steps and it's no longer computed as one expression in unsigned long(rss, swapents, nr_pte are unsigned long), where the result value assigned to points(int) is in range(1..1000). So there could be an int overflow while computing 176 points *= 1000; and points may have negative value. Meaning the oom score for a mem hog task will be one. 196 if (points <= 0) 197 return 1; For example: [ 3366] 0 3366 35390480 24303939 5 0 0 oom01 Out of memory: Kill process 3366 (oom01) score 1 or sacrifice child Here the oom1 process consumes more than 24303939(rss)*4096~=92GB physical memory, but it's oom score is one. In this situation the mem hog task is skipped and oom killer kills another and most probably innocent task with oom score greater than one. The points variable should be of type long instead of int to prevent the int overflow. Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> [2.6.36+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/ff05b6f7ae762b6eb464183eec994b28ea09f6dd;"[ 3366] 0 3366 35390480 24303939 5 0 0 oom01 Out of memory: Kill process 3366 (oom01) score 1 or sacrifice child Here the oom1 process consumes more than 24303939(rss)*4096~=92GB physical memory, but it's oom score is one";no;yes;yes 234;MDY6Q29tbWl0MjMyNTI5ODpmZjA1YjZmN2FlNzYyYjZlYjQ2NDE4M2VlYzk5NGIyOGVhMDlmNmRk;Frantisek Hrbata;Linus Torvalds;"oom: fix integer overflow of points in oom_badness An integer overflow will happen on 64bit archs if task's sum of rss, swapents and nr_ptes exceeds (2^31)/1000 value. This was introduced by commit f755a04 oom: use pte pages in OOM score where the oom score computation was divided into several steps and it's no longer computed as one expression in unsigned long(rss, swapents, nr_pte are unsigned long), where the result value assigned to points(int) is in range(1..1000). So there could be an int overflow while computing 176 points *= 1000; and points may have negative value. Meaning the oom score for a mem hog task will be one. 196 if (points <= 0) 197 return 1; For example: [ 3366] 0 3366 35390480 24303939 5 0 0 oom01 Out of memory: Kill process 3366 (oom01) score 1 or sacrifice child Here the oom1 process consumes more than 24303939(rss)*4096~=92GB physical memory, but it's oom score is one. In this situation the mem hog task is skipped and oom killer kills another and most probably innocent task with oom score greater than one. The points variable should be of type long instead of int to prevent the int overflow. Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> [2.6.36+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/ff05b6f7ae762b6eb464183eec994b28ea09f6dd;"In this situation the mem hog task is skipped and oom killer kills another and most probably innocent task with oom score greater than one";no;yes;yes 234;MDY6Q29tbWl0MjMyNTI5ODpmZjA1YjZmN2FlNzYyYjZlYjQ2NDE4M2VlYzk5NGIyOGVhMDlmNmRk;Frantisek Hrbata;Linus Torvalds;"oom: fix integer overflow of points in oom_badness An integer overflow will happen on 64bit archs if task's sum of rss, swapents and nr_ptes exceeds (2^31)/1000 value. This was introduced by commit f755a04 oom: use pte pages in OOM score where the oom score computation was divided into several steps and it's no longer computed as one expression in unsigned long(rss, swapents, nr_pte are unsigned long), where the result value assigned to points(int) is in range(1..1000). So there could be an int overflow while computing 176 points *= 1000; and points may have negative value. Meaning the oom score for a mem hog task will be one. 196 if (points <= 0) 197 return 1; For example: [ 3366] 0 3366 35390480 24303939 5 0 0 oom01 Out of memory: Kill process 3366 (oom01) score 1 or sacrifice child Here the oom1 process consumes more than 24303939(rss)*4096~=92GB physical memory, but it's oom score is one. In this situation the mem hog task is skipped and oom killer kills another and most probably innocent task with oom score greater than one. The points variable should be of type long instead of int to prevent the int overflow. Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> [2.6.36+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/ff05b6f7ae762b6eb464183eec994b28ea09f6dd;"The points variable should be of type long instead of int to prevent the int overflow.";yes;yes;no 235;MDY6Q29tbWl0MjMyNTI5ODphNWJlMmQwZDFhODc0NmU3YmU1MjEwZTNkNmI5MDQ0NTUwMDA0NDNj;Tejun Heo;Tejun Heo;"freezer: rename thaw_process() to __thaw_task() and simplify the implementation thaw_process() now has only internal users - system and cgroup freezers. Remove the unnecessary return value, rename, unexport and collapse __thaw_process() into it. This will help further updates to the freezer code. -v3: oom_kill grew a use of thaw_process() while this patch was pending. Convert it to use __thaw_task() for now. In the longer term, this should be handled by allowing tasks to die if killed even if it's frozen. -v2: minor style update as suggested by Matt. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Paul Menage <menage@google.com> Cc: Matt Helsley <matthltc@us.ibm.com>";https://api.github.com/repos/torvalds/linux/git/commits/a5be2d0d1a8746e7be5210e3d6b904455000443c;freezer: rename thaw_process() to __thaw_task() and simplify the implementation;yes;yes;no 235;MDY6Q29tbWl0MjMyNTI5ODphNWJlMmQwZDFhODc0NmU3YmU1MjEwZTNkNmI5MDQ0NTUwMDA0NDNj;Tejun Heo;Tejun Heo;"freezer: rename thaw_process() to __thaw_task() and simplify the implementation thaw_process() now has only internal users - system and cgroup freezers. Remove the unnecessary return value, rename, unexport and collapse __thaw_process() into it. This will help further updates to the freezer code. -v3: oom_kill grew a use of thaw_process() while this patch was pending. Convert it to use __thaw_task() for now. In the longer term, this should be handled by allowing tasks to die if killed even if it's frozen. -v2: minor style update as suggested by Matt. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Paul Menage <menage@google.com> Cc: Matt Helsley <matthltc@us.ibm.com>";https://api.github.com/repos/torvalds/linux/git/commits/a5be2d0d1a8746e7be5210e3d6b904455000443c;"thaw_process() now has only internal users - system and cgroup freezers";no;no;yes 235;MDY6Q29tbWl0MjMyNTI5ODphNWJlMmQwZDFhODc0NmU3YmU1MjEwZTNkNmI5MDQ0NTUwMDA0NDNj;Tejun Heo;Tejun Heo;"freezer: rename thaw_process() to __thaw_task() and simplify the implementation thaw_process() now has only internal users - system and cgroup freezers. Remove the unnecessary return value, rename, unexport and collapse __thaw_process() into it. This will help further updates to the freezer code. -v3: oom_kill grew a use of thaw_process() while this patch was pending. Convert it to use __thaw_task() for now. In the longer term, this should be handled by allowing tasks to die if killed even if it's frozen. -v2: minor style update as suggested by Matt. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Paul Menage <menage@google.com> Cc: Matt Helsley <matthltc@us.ibm.com>";https://api.github.com/repos/torvalds/linux/git/commits/a5be2d0d1a8746e7be5210e3d6b904455000443c;" Remove the unnecessary return value, rename, unexport and collapse __thaw_process() into it";yes;yes;no 235;MDY6Q29tbWl0MjMyNTI5ODphNWJlMmQwZDFhODc0NmU3YmU1MjEwZTNkNmI5MDQ0NTUwMDA0NDNj;Tejun Heo;Tejun Heo;"freezer: rename thaw_process() to __thaw_task() and simplify the implementation thaw_process() now has only internal users - system and cgroup freezers. Remove the unnecessary return value, rename, unexport and collapse __thaw_process() into it. This will help further updates to the freezer code. -v3: oom_kill grew a use of thaw_process() while this patch was pending. Convert it to use __thaw_task() for now. In the longer term, this should be handled by allowing tasks to die if killed even if it's frozen. -v2: minor style update as suggested by Matt. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Paul Menage <menage@google.com> Cc: Matt Helsley <matthltc@us.ibm.com>";https://api.github.com/repos/torvalds/linux/git/commits/a5be2d0d1a8746e7be5210e3d6b904455000443c;" This will help further updates to the freezer code";no;yes;no 235;MDY6Q29tbWl0MjMyNTI5ODphNWJlMmQwZDFhODc0NmU3YmU1MjEwZTNkNmI5MDQ0NTUwMDA0NDNj;Tejun Heo;Tejun Heo;"freezer: rename thaw_process() to __thaw_task() and simplify the implementation thaw_process() now has only internal users - system and cgroup freezers. Remove the unnecessary return value, rename, unexport and collapse __thaw_process() into it. This will help further updates to the freezer code. -v3: oom_kill grew a use of thaw_process() while this patch was pending. Convert it to use __thaw_task() for now. In the longer term, this should be handled by allowing tasks to die if killed even if it's frozen. -v2: minor style update as suggested by Matt. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Paul Menage <menage@google.com> Cc: Matt Helsley <matthltc@us.ibm.com>";https://api.github.com/repos/torvalds/linux/git/commits/a5be2d0d1a8746e7be5210e3d6b904455000443c;#NAME?;no;no;yes 235;MDY6Q29tbWl0MjMyNTI5ODphNWJlMmQwZDFhODc0NmU3YmU1MjEwZTNkNmI5MDQ0NTUwMDA0NDNj;Tejun Heo;Tejun Heo;"freezer: rename thaw_process() to __thaw_task() and simplify the implementation thaw_process() now has only internal users - system and cgroup freezers. Remove the unnecessary return value, rename, unexport and collapse __thaw_process() into it. This will help further updates to the freezer code. -v3: oom_kill grew a use of thaw_process() while this patch was pending. Convert it to use __thaw_task() for now. In the longer term, this should be handled by allowing tasks to die if killed even if it's frozen. -v2: minor style update as suggested by Matt. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Paul Menage <menage@google.com> Cc: Matt Helsley <matthltc@us.ibm.com>";https://api.github.com/repos/torvalds/linux/git/commits/a5be2d0d1a8746e7be5210e3d6b904455000443c; Convert it to use __thaw_task() for now;yes;no;no 235;MDY6Q29tbWl0MjMyNTI5ODphNWJlMmQwZDFhODc0NmU3YmU1MjEwZTNkNmI5MDQ0NTUwMDA0NDNj;Tejun Heo;Tejun Heo;"freezer: rename thaw_process() to __thaw_task() and simplify the implementation thaw_process() now has only internal users - system and cgroup freezers. Remove the unnecessary return value, rename, unexport and collapse __thaw_process() into it. This will help further updates to the freezer code. -v3: oom_kill grew a use of thaw_process() while this patch was pending. Convert it to use __thaw_task() for now. In the longer term, this should be handled by allowing tasks to die if killed even if it's frozen. -v2: minor style update as suggested by Matt. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Paul Menage <menage@google.com> Cc: Matt Helsley <matthltc@us.ibm.com>";https://api.github.com/repos/torvalds/linux/git/commits/a5be2d0d1a8746e7be5210e3d6b904455000443c;" In the longer term, this should be handled by allowing tasks to die if killed even if it's frozen";yes;no;no 235;MDY6Q29tbWl0MjMyNTI5ODphNWJlMmQwZDFhODc0NmU3YmU1MjEwZTNkNmI5MDQ0NTUwMDA0NDNj;Tejun Heo;Tejun Heo;"freezer: rename thaw_process() to __thaw_task() and simplify the implementation thaw_process() now has only internal users - system and cgroup freezers. Remove the unnecessary return value, rename, unexport and collapse __thaw_process() into it. This will help further updates to the freezer code. -v3: oom_kill grew a use of thaw_process() while this patch was pending. Convert it to use __thaw_task() for now. In the longer term, this should be handled by allowing tasks to die if killed even if it's frozen. -v2: minor style update as suggested by Matt. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Paul Menage <menage@google.com> Cc: Matt Helsley <matthltc@us.ibm.com>";https://api.github.com/repos/torvalds/linux/git/commits/a5be2d0d1a8746e7be5210e3d6b904455000443c;#NAME?;yes;no;no 236;MDY6Q29tbWl0MjMyNTI5ODo1YWVjYzg1YWJkYjlhYzJiMGU2NTQ4ZDEzNjUyYTM0MTQyZTdhZTg5;Michal Hocko;Linus Torvalds;"oom: do not kill tasks with oom_score_adj OOM_SCORE_ADJ_MIN Commit c9f01245 (""oom: remove oom_disable_count"") has removed the oom_disable_count counter which has been used for early break out from oom_badness so we could never select a task with oom_score_adj set to OOM_SCORE_ADJ_MIN (oom disabled). Now that the counter is gone we are always going through heuristics calculation and we always return a non zero positive value. This means that we can end up killing a task with OOM disabled because it is indistinguishable from regular tasks with 1% resp. CAP_SYS_ADMIN tasks with 3% usage of memory or tasks with oom_score_adj set but OOM enabled. Let's break out early if the task should have OOM disabled. Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/5aecc85abdb9ac2b0e6548d13652a34142e7ae89;oom: do not kill tasks with oom_score_adj OOM_SCORE_ADJ_MIN;yes;no;no 236;MDY6Q29tbWl0MjMyNTI5ODo1YWVjYzg1YWJkYjlhYzJiMGU2NTQ4ZDEzNjUyYTM0MTQyZTdhZTg5;Michal Hocko;Linus Torvalds;"oom: do not kill tasks with oom_score_adj OOM_SCORE_ADJ_MIN Commit c9f01245 (""oom: remove oom_disable_count"") has removed the oom_disable_count counter which has been used for early break out from oom_badness so we could never select a task with oom_score_adj set to OOM_SCORE_ADJ_MIN (oom disabled). Now that the counter is gone we are always going through heuristics calculation and we always return a non zero positive value. This means that we can end up killing a task with OOM disabled because it is indistinguishable from regular tasks with 1% resp. CAP_SYS_ADMIN tasks with 3% usage of memory or tasks with oom_score_adj set but OOM enabled. Let's break out early if the task should have OOM disabled. Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/5aecc85abdb9ac2b0e6548d13652a34142e7ae89;"Commit c9f01245 (""oom: remove oom_disable_count"") has removed the oom_disable_count counter which has been used for early break out from oom_badness so we could never select a task with oom_score_adj set to OOM_SCORE_ADJ_MIN (oom disabled)";no;no;yes 236;MDY6Q29tbWl0MjMyNTI5ODo1YWVjYzg1YWJkYjlhYzJiMGU2NTQ4ZDEzNjUyYTM0MTQyZTdhZTg5;Michal Hocko;Linus Torvalds;"oom: do not kill tasks with oom_score_adj OOM_SCORE_ADJ_MIN Commit c9f01245 (""oom: remove oom_disable_count"") has removed the oom_disable_count counter which has been used for early break out from oom_badness so we could never select a task with oom_score_adj set to OOM_SCORE_ADJ_MIN (oom disabled). Now that the counter is gone we are always going through heuristics calculation and we always return a non zero positive value. This means that we can end up killing a task with OOM disabled because it is indistinguishable from regular tasks with 1% resp. CAP_SYS_ADMIN tasks with 3% usage of memory or tasks with oom_score_adj set but OOM enabled. Let's break out early if the task should have OOM disabled. Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/5aecc85abdb9ac2b0e6548d13652a34142e7ae89;"Now that the counter is gone we are always going through heuristics calculation and we always return a non zero positive value";no;no;yes 236;MDY6Q29tbWl0MjMyNTI5ODo1YWVjYzg1YWJkYjlhYzJiMGU2NTQ4ZDEzNjUyYTM0MTQyZTdhZTg5;Michal Hocko;Linus Torvalds;"oom: do not kill tasks with oom_score_adj OOM_SCORE_ADJ_MIN Commit c9f01245 (""oom: remove oom_disable_count"") has removed the oom_disable_count counter which has been used for early break out from oom_badness so we could never select a task with oom_score_adj set to OOM_SCORE_ADJ_MIN (oom disabled). Now that the counter is gone we are always going through heuristics calculation and we always return a non zero positive value. This means that we can end up killing a task with OOM disabled because it is indistinguishable from regular tasks with 1% resp. CAP_SYS_ADMIN tasks with 3% usage of memory or tasks with oom_score_adj set but OOM enabled. Let's break out early if the task should have OOM disabled. Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/5aecc85abdb9ac2b0e6548d13652a34142e7ae89;" This means that we can end up killing a task with OOM disabled because it is indistinguishable from regular tasks with 1% resp";no;yes;yes 236;MDY6Q29tbWl0MjMyNTI5ODo1YWVjYzg1YWJkYjlhYzJiMGU2NTQ4ZDEzNjUyYTM0MTQyZTdhZTg5;Michal Hocko;Linus Torvalds;"oom: do not kill tasks with oom_score_adj OOM_SCORE_ADJ_MIN Commit c9f01245 (""oom: remove oom_disable_count"") has removed the oom_disable_count counter which has been used for early break out from oom_badness so we could never select a task with oom_score_adj set to OOM_SCORE_ADJ_MIN (oom disabled). Now that the counter is gone we are always going through heuristics calculation and we always return a non zero positive value. This means that we can end up killing a task with OOM disabled because it is indistinguishable from regular tasks with 1% resp. CAP_SYS_ADMIN tasks with 3% usage of memory or tasks with oom_score_adj set but OOM enabled. Let's break out early if the task should have OOM disabled. Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/5aecc85abdb9ac2b0e6548d13652a34142e7ae89;" CAP_SYS_ADMIN tasks with 3% usage of memory or tasks with oom_score_adj set but OOM enabled";no;yes;yes 236;MDY6Q29tbWl0MjMyNTI5ODo1YWVjYzg1YWJkYjlhYzJiMGU2NTQ4ZDEzNjUyYTM0MTQyZTdhZTg5;Michal Hocko;Linus Torvalds;"oom: do not kill tasks with oom_score_adj OOM_SCORE_ADJ_MIN Commit c9f01245 (""oom: remove oom_disable_count"") has removed the oom_disable_count counter which has been used for early break out from oom_badness so we could never select a task with oom_score_adj set to OOM_SCORE_ADJ_MIN (oom disabled). Now that the counter is gone we are always going through heuristics calculation and we always return a non zero positive value. This means that we can end up killing a task with OOM disabled because it is indistinguishable from regular tasks with 1% resp. CAP_SYS_ADMIN tasks with 3% usage of memory or tasks with oom_score_adj set but OOM enabled. Let's break out early if the task should have OOM disabled. Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/5aecc85abdb9ac2b0e6548d13652a34142e7ae89;Let's break out early if the task should have OOM disabled.;yes;no;no 237;MDY6Q29tbWl0MjMyNTI5ODo0MzM2MmE0OTc3ZTM3ZGI0NmY4NmY3ZTZhYjkzNWYwMDA2OTU2NjMy;David Rientjes;Linus Torvalds;"oom: fix race while temporarily setting current's oom_score_adj test_set_oom_score_adj() was introduced in 72788c385604 (""oom: replace PF_OOM_ORIGIN with toggling oom_score_adj"") to temporarily elevate current's oom_score_adj for ksm and swapoff without requiring an additional per-process flag. Using that function to both set oom_score_adj to OOM_SCORE_ADJ_MAX and then reinstate the previous value is racy since it's possible that userspace can set the value to something else itself before the old value is reinstated. That results in userspace setting current's oom_score_adj to a different value and then the kernel immediately setting it back to its previous value without notification. To fix this, a new compare_swap_oom_score_adj() function is introduced with the same semantics as the compare and swap CAS instruction, or CMPXCHG on x86. It is used to reinstate the previous value of oom_score_adj if and only if the present value is the same as the old value. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Ying Han <yinghan@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/43362a4977e37db46f86f7e6ab935f0006956632;oom: fix race while temporarily setting current's oom_score_adj;yes;yes;no 237;MDY6Q29tbWl0MjMyNTI5ODo0MzM2MmE0OTc3ZTM3ZGI0NmY4NmY3ZTZhYjkzNWYwMDA2OTU2NjMy;David Rientjes;Linus Torvalds;"oom: fix race while temporarily setting current's oom_score_adj test_set_oom_score_adj() was introduced in 72788c385604 (""oom: replace PF_OOM_ORIGIN with toggling oom_score_adj"") to temporarily elevate current's oom_score_adj for ksm and swapoff without requiring an additional per-process flag. Using that function to both set oom_score_adj to OOM_SCORE_ADJ_MAX and then reinstate the previous value is racy since it's possible that userspace can set the value to something else itself before the old value is reinstated. That results in userspace setting current's oom_score_adj to a different value and then the kernel immediately setting it back to its previous value without notification. To fix this, a new compare_swap_oom_score_adj() function is introduced with the same semantics as the compare and swap CAS instruction, or CMPXCHG on x86. It is used to reinstate the previous value of oom_score_adj if and only if the present value is the same as the old value. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Ying Han <yinghan@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/43362a4977e37db46f86f7e6ab935f0006956632;"test_set_oom_score_adj() was introduced in 72788c385604 (""oom: replace PF_OOM_ORIGIN with toggling oom_score_adj"") to temporarily elevate current's oom_score_adj for ksm and swapoff without requiring an additional per-process flag";no;no;yes 237;MDY6Q29tbWl0MjMyNTI5ODo0MzM2MmE0OTc3ZTM3ZGI0NmY4NmY3ZTZhYjkzNWYwMDA2OTU2NjMy;David Rientjes;Linus Torvalds;"oom: fix race while temporarily setting current's oom_score_adj test_set_oom_score_adj() was introduced in 72788c385604 (""oom: replace PF_OOM_ORIGIN with toggling oom_score_adj"") to temporarily elevate current's oom_score_adj for ksm and swapoff without requiring an additional per-process flag. Using that function to both set oom_score_adj to OOM_SCORE_ADJ_MAX and then reinstate the previous value is racy since it's possible that userspace can set the value to something else itself before the old value is reinstated. That results in userspace setting current's oom_score_adj to a different value and then the kernel immediately setting it back to its previous value without notification. To fix this, a new compare_swap_oom_score_adj() function is introduced with the same semantics as the compare and swap CAS instruction, or CMPXCHG on x86. It is used to reinstate the previous value of oom_score_adj if and only if the present value is the same as the old value. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Ying Han <yinghan@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/43362a4977e37db46f86f7e6ab935f0006956632;"Using that function to both set oom_score_adj to OOM_SCORE_ADJ_MAX and then reinstate the previous value is racy since it's possible that userspace can set the value to something else itself before the old value is reinstated";no;yes;yes 237;MDY6Q29tbWl0MjMyNTI5ODo0MzM2MmE0OTc3ZTM3ZGI0NmY4NmY3ZTZhYjkzNWYwMDA2OTU2NjMy;David Rientjes;Linus Torvalds;"oom: fix race while temporarily setting current's oom_score_adj test_set_oom_score_adj() was introduced in 72788c385604 (""oom: replace PF_OOM_ORIGIN with toggling oom_score_adj"") to temporarily elevate current's oom_score_adj for ksm and swapoff without requiring an additional per-process flag. Using that function to both set oom_score_adj to OOM_SCORE_ADJ_MAX and then reinstate the previous value is racy since it's possible that userspace can set the value to something else itself before the old value is reinstated. That results in userspace setting current's oom_score_adj to a different value and then the kernel immediately setting it back to its previous value without notification. To fix this, a new compare_swap_oom_score_adj() function is introduced with the same semantics as the compare and swap CAS instruction, or CMPXCHG on x86. It is used to reinstate the previous value of oom_score_adj if and only if the present value is the same as the old value. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Ying Han <yinghan@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/43362a4977e37db46f86f7e6ab935f0006956632;" That results in userspace setting current's oom_score_adj to a different value and then the kernel immediately setting it back to its previous value without notification";no;yes;yes 237;MDY6Q29tbWl0MjMyNTI5ODo0MzM2MmE0OTc3ZTM3ZGI0NmY4NmY3ZTZhYjkzNWYwMDA2OTU2NjMy;David Rientjes;Linus Torvalds;"oom: fix race while temporarily setting current's oom_score_adj test_set_oom_score_adj() was introduced in 72788c385604 (""oom: replace PF_OOM_ORIGIN with toggling oom_score_adj"") to temporarily elevate current's oom_score_adj for ksm and swapoff without requiring an additional per-process flag. Using that function to both set oom_score_adj to OOM_SCORE_ADJ_MAX and then reinstate the previous value is racy since it's possible that userspace can set the value to something else itself before the old value is reinstated. That results in userspace setting current's oom_score_adj to a different value and then the kernel immediately setting it back to its previous value without notification. To fix this, a new compare_swap_oom_score_adj() function is introduced with the same semantics as the compare and swap CAS instruction, or CMPXCHG on x86. It is used to reinstate the previous value of oom_score_adj if and only if the present value is the same as the old value. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Ying Han <yinghan@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/43362a4977e37db46f86f7e6ab935f0006956632;"To fix this, a new compare_swap_oom_score_adj() function is introduced with the same semantics as the compare and swap CAS instruction, or CMPXCHG on x86";yes;yes;no 237;MDY6Q29tbWl0MjMyNTI5ODo0MzM2MmE0OTc3ZTM3ZGI0NmY4NmY3ZTZhYjkzNWYwMDA2OTU2NjMy;David Rientjes;Linus Torvalds;"oom: fix race while temporarily setting current's oom_score_adj test_set_oom_score_adj() was introduced in 72788c385604 (""oom: replace PF_OOM_ORIGIN with toggling oom_score_adj"") to temporarily elevate current's oom_score_adj for ksm and swapoff without requiring an additional per-process flag. Using that function to both set oom_score_adj to OOM_SCORE_ADJ_MAX and then reinstate the previous value is racy since it's possible that userspace can set the value to something else itself before the old value is reinstated. That results in userspace setting current's oom_score_adj to a different value and then the kernel immediately setting it back to its previous value without notification. To fix this, a new compare_swap_oom_score_adj() function is introduced with the same semantics as the compare and swap CAS instruction, or CMPXCHG on x86. It is used to reinstate the previous value of oom_score_adj if and only if the present value is the same as the old value. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Ying Han <yinghan@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/43362a4977e37db46f86f7e6ab935f0006956632;" It is used to reinstate the previous value of oom_score_adj if and only if the present value is the same as the old value.";yes;no;no 238;MDY6Q29tbWl0MjMyNTI5ODpjOWYwMTI0NWI2YTdkNzdkMTdkZWFhNzFhZjEwZjZhY2ExNGZhMjRl;David Rientjes;Linus Torvalds;"oom: remove oom_disable_count This removes mm->oom_disable_count entirely since it's unnecessary and currently buggy. The counter was intended to be per-process but it's currently decremented in the exit path for each thread that exits, causing it to underflow. The count was originally intended to prevent oom killing threads that share memory with threads that cannot be killed since it doesn't lead to future memory freeing. The counter could be fixed to represent all threads sharing the same mm, but it's better to remove the count since: - it is possible that the OOM_DISABLE thread sharing memory with the victim is waiting on that thread to exit and will actually cause future memory freeing, and - there is no guarantee that a thread is disabled from oom killing just because another thread sharing its mm is oom disabled. Signed-off-by: David Rientjes <rientjes@google.com> Reported-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Cc: Ying Han <yinghan@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/c9f01245b6a7d77d17deaa71af10f6aca14fa24e;oom: remove oom_disable_count;yes;no;no 238;MDY6Q29tbWl0MjMyNTI5ODpjOWYwMTI0NWI2YTdkNzdkMTdkZWFhNzFhZjEwZjZhY2ExNGZhMjRl;David Rientjes;Linus Torvalds;"oom: remove oom_disable_count This removes mm->oom_disable_count entirely since it's unnecessary and currently buggy. The counter was intended to be per-process but it's currently decremented in the exit path for each thread that exits, causing it to underflow. The count was originally intended to prevent oom killing threads that share memory with threads that cannot be killed since it doesn't lead to future memory freeing. The counter could be fixed to represent all threads sharing the same mm, but it's better to remove the count since: - it is possible that the OOM_DISABLE thread sharing memory with the victim is waiting on that thread to exit and will actually cause future memory freeing, and - there is no guarantee that a thread is disabled from oom killing just because another thread sharing its mm is oom disabled. Signed-off-by: David Rientjes <rientjes@google.com> Reported-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Cc: Ying Han <yinghan@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/c9f01245b6a7d77d17deaa71af10f6aca14fa24e;"This removes mm->oom_disable_count entirely since it's unnecessary and currently buggy";yes;yes;no 238;MDY6Q29tbWl0MjMyNTI5ODpjOWYwMTI0NWI2YTdkNzdkMTdkZWFhNzFhZjEwZjZhY2ExNGZhMjRl;David Rientjes;Linus Torvalds;"oom: remove oom_disable_count This removes mm->oom_disable_count entirely since it's unnecessary and currently buggy. The counter was intended to be per-process but it's currently decremented in the exit path for each thread that exits, causing it to underflow. The count was originally intended to prevent oom killing threads that share memory with threads that cannot be killed since it doesn't lead to future memory freeing. The counter could be fixed to represent all threads sharing the same mm, but it's better to remove the count since: - it is possible that the OOM_DISABLE thread sharing memory with the victim is waiting on that thread to exit and will actually cause future memory freeing, and - there is no guarantee that a thread is disabled from oom killing just because another thread sharing its mm is oom disabled. Signed-off-by: David Rientjes <rientjes@google.com> Reported-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Cc: Ying Han <yinghan@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/c9f01245b6a7d77d17deaa71af10f6aca14fa24e;" The counter was intended to be per-process but it's currently decremented in the exit path for each thread that exits, causing it to underflow";no;yes;yes 238;MDY6Q29tbWl0MjMyNTI5ODpjOWYwMTI0NWI2YTdkNzdkMTdkZWFhNzFhZjEwZjZhY2ExNGZhMjRl;David Rientjes;Linus Torvalds;"oom: remove oom_disable_count This removes mm->oom_disable_count entirely since it's unnecessary and currently buggy. The counter was intended to be per-process but it's currently decremented in the exit path for each thread that exits, causing it to underflow. The count was originally intended to prevent oom killing threads that share memory with threads that cannot be killed since it doesn't lead to future memory freeing. The counter could be fixed to represent all threads sharing the same mm, but it's better to remove the count since: - it is possible that the OOM_DISABLE thread sharing memory with the victim is waiting on that thread to exit and will actually cause future memory freeing, and - there is no guarantee that a thread is disabled from oom killing just because another thread sharing its mm is oom disabled. Signed-off-by: David Rientjes <rientjes@google.com> Reported-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Cc: Ying Han <yinghan@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/c9f01245b6a7d77d17deaa71af10f6aca14fa24e;"The count was originally intended to prevent oom killing threads that share memory with threads that cannot be killed since it doesn't lead to future memory freeing";no;no;yes 238;MDY6Q29tbWl0MjMyNTI5ODpjOWYwMTI0NWI2YTdkNzdkMTdkZWFhNzFhZjEwZjZhY2ExNGZhMjRl;David Rientjes;Linus Torvalds;"oom: remove oom_disable_count This removes mm->oom_disable_count entirely since it's unnecessary and currently buggy. The counter was intended to be per-process but it's currently decremented in the exit path for each thread that exits, causing it to underflow. The count was originally intended to prevent oom killing threads that share memory with threads that cannot be killed since it doesn't lead to future memory freeing. The counter could be fixed to represent all threads sharing the same mm, but it's better to remove the count since: - it is possible that the OOM_DISABLE thread sharing memory with the victim is waiting on that thread to exit and will actually cause future memory freeing, and - there is no guarantee that a thread is disabled from oom killing just because another thread sharing its mm is oom disabled. Signed-off-by: David Rientjes <rientjes@google.com> Reported-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Cc: Ying Han <yinghan@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/c9f01245b6a7d77d17deaa71af10f6aca14fa24e;" The counter could be fixed to represent all threads sharing the same mm, but it's better to remove the count since";yes;yes;no 238;MDY6Q29tbWl0MjMyNTI5ODpjOWYwMTI0NWI2YTdkNzdkMTdkZWFhNzFhZjEwZjZhY2ExNGZhMjRl;David Rientjes;Linus Torvalds;"oom: remove oom_disable_count This removes mm->oom_disable_count entirely since it's unnecessary and currently buggy. The counter was intended to be per-process but it's currently decremented in the exit path for each thread that exits, causing it to underflow. The count was originally intended to prevent oom killing threads that share memory with threads that cannot be killed since it doesn't lead to future memory freeing. The counter could be fixed to represent all threads sharing the same mm, but it's better to remove the count since: - it is possible that the OOM_DISABLE thread sharing memory with the victim is waiting on that thread to exit and will actually cause future memory freeing, and - there is no guarantee that a thread is disabled from oom killing just because another thread sharing its mm is oom disabled. Signed-off-by: David Rientjes <rientjes@google.com> Reported-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Cc: Ying Han <yinghan@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/c9f01245b6a7d77d17deaa71af10f6aca14fa24e;" - it is possible that the OOM_DISABLE thread sharing memory with the victim is waiting on that thread to exit and will actually cause future memory freeing, and - there is no guarantee that a thread is disabled from oom killing just because another thread sharing its mm is oom disabled.";no;yes;yes 239;MDY6Q29tbWl0MjMyNTI5ODo3YjBkNDRmYTQ5YjFkY2ZkY2Y0ODk3ZjEyZGRkMTJkZGVhYjFhOWQ3;David Rientjes;Linus Torvalds;"oom: avoid killing kthreads if they assume the oom killed thread's mm After selecting a task to kill, the oom killer iterates all processes and kills all other threads that share the same mm_struct in different thread groups. It would not otherwise be helpful to kill a thread if its memory would not be subsequently freed. A kernel thread, however, may assume a user thread's mm by using use_mm(). This is only temporary and should not result in sending a SIGKILL to that kthread. This patch ensures that only user threads and not kthreads are sent a SIGKILL if they share the same mm_struct as the oom killed task. Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/7b0d44fa49b1dcfdcf4897f12ddd12ddeab1a9d7;oom: avoid killing kthreads if they assume the oom killed thread's mm;yes;no;no 239;MDY6Q29tbWl0MjMyNTI5ODo3YjBkNDRmYTQ5YjFkY2ZkY2Y0ODk3ZjEyZGRkMTJkZGVhYjFhOWQ3;David Rientjes;Linus Torvalds;"oom: avoid killing kthreads if they assume the oom killed thread's mm After selecting a task to kill, the oom killer iterates all processes and kills all other threads that share the same mm_struct in different thread groups. It would not otherwise be helpful to kill a thread if its memory would not be subsequently freed. A kernel thread, however, may assume a user thread's mm by using use_mm(). This is only temporary and should not result in sending a SIGKILL to that kthread. This patch ensures that only user threads and not kthreads are sent a SIGKILL if they share the same mm_struct as the oom killed task. Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/7b0d44fa49b1dcfdcf4897f12ddd12ddeab1a9d7;"After selecting a task to kill, the oom killer iterates all processes and kills all other threads that share the same mm_struct in different thread groups";no;no;yes 239;MDY6Q29tbWl0MjMyNTI5ODo3YjBkNDRmYTQ5YjFkY2ZkY2Y0ODk3ZjEyZGRkMTJkZGVhYjFhOWQ3;David Rientjes;Linus Torvalds;"oom: avoid killing kthreads if they assume the oom killed thread's mm After selecting a task to kill, the oom killer iterates all processes and kills all other threads that share the same mm_struct in different thread groups. It would not otherwise be helpful to kill a thread if its memory would not be subsequently freed. A kernel thread, however, may assume a user thread's mm by using use_mm(). This is only temporary and should not result in sending a SIGKILL to that kthread. This patch ensures that only user threads and not kthreads are sent a SIGKILL if they share the same mm_struct as the oom killed task. Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/7b0d44fa49b1dcfdcf4897f12ddd12ddeab1a9d7;" It would not otherwise be helpful to kill a thread if its memory would not be subsequently freed";no;no;yes 239;MDY6Q29tbWl0MjMyNTI5ODo3YjBkNDRmYTQ5YjFkY2ZkY2Y0ODk3ZjEyZGRkMTJkZGVhYjFhOWQ3;David Rientjes;Linus Torvalds;"oom: avoid killing kthreads if they assume the oom killed thread's mm After selecting a task to kill, the oom killer iterates all processes and kills all other threads that share the same mm_struct in different thread groups. It would not otherwise be helpful to kill a thread if its memory would not be subsequently freed. A kernel thread, however, may assume a user thread's mm by using use_mm(). This is only temporary and should not result in sending a SIGKILL to that kthread. This patch ensures that only user threads and not kthreads are sent a SIGKILL if they share the same mm_struct as the oom killed task. Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/7b0d44fa49b1dcfdcf4897f12ddd12ddeab1a9d7;"A kernel thread, however, may assume a user thread's mm by using use_mm()";no;yes;yes 239;MDY6Q29tbWl0MjMyNTI5ODo3YjBkNDRmYTQ5YjFkY2ZkY2Y0ODk3ZjEyZGRkMTJkZGVhYjFhOWQ3;David Rientjes;Linus Torvalds;"oom: avoid killing kthreads if they assume the oom killed thread's mm After selecting a task to kill, the oom killer iterates all processes and kills all other threads that share the same mm_struct in different thread groups. It would not otherwise be helpful to kill a thread if its memory would not be subsequently freed. A kernel thread, however, may assume a user thread's mm by using use_mm(). This is only temporary and should not result in sending a SIGKILL to that kthread. This patch ensures that only user threads and not kthreads are sent a SIGKILL if they share the same mm_struct as the oom killed task. Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/7b0d44fa49b1dcfdcf4897f12ddd12ddeab1a9d7;" This is only temporary and should not result in sending a SIGKILL to that kthread";no;yes;yes 239;MDY6Q29tbWl0MjMyNTI5ODo3YjBkNDRmYTQ5YjFkY2ZkY2Y0ODk3ZjEyZGRkMTJkZGVhYjFhOWQ3;David Rientjes;Linus Torvalds;"oom: avoid killing kthreads if they assume the oom killed thread's mm After selecting a task to kill, the oom killer iterates all processes and kills all other threads that share the same mm_struct in different thread groups. It would not otherwise be helpful to kill a thread if its memory would not be subsequently freed. A kernel thread, however, may assume a user thread's mm by using use_mm(). This is only temporary and should not result in sending a SIGKILL to that kthread. This patch ensures that only user threads and not kthreads are sent a SIGKILL if they share the same mm_struct as the oom killed task. Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/7b0d44fa49b1dcfdcf4897f12ddd12ddeab1a9d7;"This patch ensures that only user threads and not kthreads are sent a SIGKILL if they share the same mm_struct as the oom killed task.";yes;yes;no 241;MDY6Q29tbWl0MjMyNTI5ODpiOTVmMWIzMWI3NTU4ODMwNmUzMmIyYWZkMzIxNjZjYWQ0OGY2NzBi;Paul Gortmaker;Paul Gortmaker;"mm: Map most files to use export.h instead of module.h The files changed within are only using the EXPORT_SYMBOL macro variants. They are not using core modular infrastructure and hence don't need module.h but only the export.h header. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>";https://api.github.com/repos/torvalds/linux/git/commits/b95f1b31b75588306e32b2afd32166cad48f670b;mm: Map most files to use export.h instead of module.h;yes;no;no 241;MDY6Q29tbWl0MjMyNTI5ODpiOTVmMWIzMWI3NTU4ODMwNmUzMmIyYWZkMzIxNjZjYWQ0OGY2NzBi;Paul Gortmaker;Paul Gortmaker;"mm: Map most files to use export.h instead of module.h The files changed within are only using the EXPORT_SYMBOL macro variants. They are not using core modular infrastructure and hence don't need module.h but only the export.h header. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>";https://api.github.com/repos/torvalds/linux/git/commits/b95f1b31b75588306e32b2afd32166cad48f670b;"The files changed within are only using the EXPORT_SYMBOL macro variants";no;no;yes 241;MDY6Q29tbWl0MjMyNTI5ODpiOTVmMWIzMWI3NTU4ODMwNmUzMmIyYWZkMzIxNjZjYWQ0OGY2NzBi;Paul Gortmaker;Paul Gortmaker;"mm: Map most files to use export.h instead of module.h The files changed within are only using the EXPORT_SYMBOL macro variants. They are not using core modular infrastructure and hence don't need module.h but only the export.h header. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>";https://api.github.com/repos/torvalds/linux/git/commits/b95f1b31b75588306e32b2afd32166cad48f670b;" They are not using core modular infrastructure and hence don't need module.h but only the export.h header.";no;yes;yes 242;MDY6Q29tbWl0MjMyNTI5ODpjMDI3YTQ3NGE2ODA2NTM5MWM4NzczZjZlODNlZDU0MTI2NTdlMzY5;Oleg Nesterov;Linus Torvalds;"oom: task->mm == NULL doesn't mean the memory was freed exit_mm() sets ->mm == NULL then it does mmput()->exit_mmap() which frees the memory. However select_bad_process() checks ->mm != NULL before TIF_MEMDIE, so it continues to kill other tasks even if we have the oom-killed task freeing its memory. Change select_bad_process() to check ->mm after TIF_MEMDIE, but skip the tasks which have already passed exit_notify() to ensure a zombie with TIF_MEMDIE set can't block oom-killer. Alternatively we could probably clear TIF_MEMDIE after exit_mmap(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/c027a474a68065391c8773f6e83ed5412657e369;oom: task->mm == NULL doesn't mean the memory was freed;no;yes;yes 242;MDY6Q29tbWl0MjMyNTI5ODpjMDI3YTQ3NGE2ODA2NTM5MWM4NzczZjZlODNlZDU0MTI2NTdlMzY5;Oleg Nesterov;Linus Torvalds;"oom: task->mm == NULL doesn't mean the memory was freed exit_mm() sets ->mm == NULL then it does mmput()->exit_mmap() which frees the memory. However select_bad_process() checks ->mm != NULL before TIF_MEMDIE, so it continues to kill other tasks even if we have the oom-killed task freeing its memory. Change select_bad_process() to check ->mm after TIF_MEMDIE, but skip the tasks which have already passed exit_notify() to ensure a zombie with TIF_MEMDIE set can't block oom-killer. Alternatively we could probably clear TIF_MEMDIE after exit_mmap(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/c027a474a68065391c8773f6e83ed5412657e369;"exit_mm() sets ->mm == NULL then it does mmput()->exit_mmap() which frees the memory";no;no;yes 242;MDY6Q29tbWl0MjMyNTI5ODpjMDI3YTQ3NGE2ODA2NTM5MWM4NzczZjZlODNlZDU0MTI2NTdlMzY5;Oleg Nesterov;Linus Torvalds;"oom: task->mm == NULL doesn't mean the memory was freed exit_mm() sets ->mm == NULL then it does mmput()->exit_mmap() which frees the memory. However select_bad_process() checks ->mm != NULL before TIF_MEMDIE, so it continues to kill other tasks even if we have the oom-killed task freeing its memory. Change select_bad_process() to check ->mm after TIF_MEMDIE, but skip the tasks which have already passed exit_notify() to ensure a zombie with TIF_MEMDIE set can't block oom-killer. Alternatively we could probably clear TIF_MEMDIE after exit_mmap(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/c027a474a68065391c8773f6e83ed5412657e369;"However select_bad_process() checks ->mm != NULL before TIF_MEMDIE, so it continues to kill other tasks even if we have the oom-killed task freeing its memory";no;yes;yes 242;MDY6Q29tbWl0MjMyNTI5ODpjMDI3YTQ3NGE2ODA2NTM5MWM4NzczZjZlODNlZDU0MTI2NTdlMzY5;Oleg Nesterov;Linus Torvalds;"oom: task->mm == NULL doesn't mean the memory was freed exit_mm() sets ->mm == NULL then it does mmput()->exit_mmap() which frees the memory. However select_bad_process() checks ->mm != NULL before TIF_MEMDIE, so it continues to kill other tasks even if we have the oom-killed task freeing its memory. Change select_bad_process() to check ->mm after TIF_MEMDIE, but skip the tasks which have already passed exit_notify() to ensure a zombie with TIF_MEMDIE set can't block oom-killer. Alternatively we could probably clear TIF_MEMDIE after exit_mmap(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/c027a474a68065391c8773f6e83ed5412657e369;"Change select_bad_process() to check ->mm after TIF_MEMDIE, but skip the tasks which have already passed exit_notify() to ensure a zombie with TIF_MEMDIE set can't block oom-killer";yes;yes;no 242;MDY6Q29tbWl0MjMyNTI5ODpjMDI3YTQ3NGE2ODA2NTM5MWM4NzczZjZlODNlZDU0MTI2NTdlMzY5;Oleg Nesterov;Linus Torvalds;"oom: task->mm == NULL doesn't mean the memory was freed exit_mm() sets ->mm == NULL then it does mmput()->exit_mmap() which frees the memory. However select_bad_process() checks ->mm != NULL before TIF_MEMDIE, so it continues to kill other tasks even if we have the oom-killed task freeing its memory. Change select_bad_process() to check ->mm after TIF_MEMDIE, but skip the tasks which have already passed exit_notify() to ensure a zombie with TIF_MEMDIE set can't block oom-killer. Alternatively we could probably clear TIF_MEMDIE after exit_mmap(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/c027a474a68065391c8773f6e83ed5412657e369;"Alternatively we could probably clear TIF_MEMDIE after exit_mmap().";yes;no;yes 243;MDY6Q29tbWl0MjMyNTI5ODoxMTIzOTgzNmMwNGI1MGJhODQ1M2VjNThjYTdhN2JkNzE2ZWYwMmMx;David Rientjes;Linus Torvalds;"oom: remove references to old badness() function The badness() function in the oom killer was renamed to oom_badness() in a63d83f427fb (""oom: badness heuristic rewrite"") since it is a globally exported function for clarity. The prototype for the old function still existed in linux/oom.h, so remove it. There are no existing users. Also fixes documentation and comment references to badness() and adjusts them accordingly. Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/11239836c04b50ba8453ec58ca7a7bd716ef02c1;oom: remove references to old badness() function;yes;no;no 243;MDY6Q29tbWl0MjMyNTI5ODoxMTIzOTgzNmMwNGI1MGJhODQ1M2VjNThjYTdhN2JkNzE2ZWYwMmMx;David Rientjes;Linus Torvalds;"oom: remove references to old badness() function The badness() function in the oom killer was renamed to oom_badness() in a63d83f427fb (""oom: badness heuristic rewrite"") since it is a globally exported function for clarity. The prototype for the old function still existed in linux/oom.h, so remove it. There are no existing users. Also fixes documentation and comment references to badness() and adjusts them accordingly. Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/11239836c04b50ba8453ec58ca7a7bd716ef02c1;"The badness() function in the oom killer was renamed to oom_badness() in a63d83f427fb (""oom: badness heuristic rewrite"") since it is a globally exported function for clarity";no;no;yes 243;MDY6Q29tbWl0MjMyNTI5ODoxMTIzOTgzNmMwNGI1MGJhODQ1M2VjNThjYTdhN2JkNzE2ZWYwMmMx;David Rientjes;Linus Torvalds;"oom: remove references to old badness() function The badness() function in the oom killer was renamed to oom_badness() in a63d83f427fb (""oom: badness heuristic rewrite"") since it is a globally exported function for clarity. The prototype for the old function still existed in linux/oom.h, so remove it. There are no existing users. Also fixes documentation and comment references to badness() and adjusts them accordingly. Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/11239836c04b50ba8453ec58ca7a7bd716ef02c1;"The prototype for the old function still existed in linux/oom.h, so remove it";yes;yes;yes 243;MDY6Q29tbWl0MjMyNTI5ODoxMTIzOTgzNmMwNGI1MGJhODQ1M2VjNThjYTdhN2JkNzE2ZWYwMmMx;David Rientjes;Linus Torvalds;"oom: remove references to old badness() function The badness() function in the oom killer was renamed to oom_badness() in a63d83f427fb (""oom: badness heuristic rewrite"") since it is a globally exported function for clarity. The prototype for the old function still existed in linux/oom.h, so remove it. There are no existing users. Also fixes documentation and comment references to badness() and adjusts them accordingly. Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/11239836c04b50ba8453ec58ca7a7bd716ef02c1; There are no existing users;no;no;yes 243;MDY6Q29tbWl0MjMyNTI5ODoxMTIzOTgzNmMwNGI1MGJhODQ1M2VjNThjYTdhN2JkNzE2ZWYwMmMx;David Rientjes;Linus Torvalds;"oom: remove references to old badness() function The badness() function in the oom killer was renamed to oom_badness() in a63d83f427fb (""oom: badness heuristic rewrite"") since it is a globally exported function for clarity. The prototype for the old function still existed in linux/oom.h, so remove it. There are no existing users. Also fixes documentation and comment references to badness() and adjusts them accordingly. Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/11239836c04b50ba8453ec58ca7a7bd716ef02c1;"Also fixes documentation and comment references to badness() and adjusts them accordingly.";yes;yes;no 244;MDY6Q29tbWl0MjMyNTI5ODpkMjExNDJlY2U0MTRjZTEwODhjZmNhZTc2MDY4OWFhNjBkNmZlZTgw;Tejun Heo;Oleg Nesterov;"ptrace: kill task_ptrace() task_ptrace(task) simply dereferences task->ptrace and isn't even used consistently only adding confusion. Kill it and directly access ->ptrace instead. This doesn't introduce any behavior change. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Oleg Nesterov <oleg@redhat.com>";https://api.github.com/repos/torvalds/linux/git/commits/d21142ece414ce1088cfcae760689aa60d6fee80;ptrace: kill task_ptrace();yes;no;no 244;MDY6Q29tbWl0MjMyNTI5ODpkMjExNDJlY2U0MTRjZTEwODhjZmNhZTc2MDY4OWFhNjBkNmZlZTgw;Tejun Heo;Oleg Nesterov;"ptrace: kill task_ptrace() task_ptrace(task) simply dereferences task->ptrace and isn't even used consistently only adding confusion. Kill it and directly access ->ptrace instead. This doesn't introduce any behavior change. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Oleg Nesterov <oleg@redhat.com>";https://api.github.com/repos/torvalds/linux/git/commits/d21142ece414ce1088cfcae760689aa60d6fee80;"task_ptrace(task) simply dereferences task->ptrace and isn't even used consistently only adding confusion";no;yes;yes 244;MDY6Q29tbWl0MjMyNTI5ODpkMjExNDJlY2U0MTRjZTEwODhjZmNhZTc2MDY4OWFhNjBkNmZlZTgw;Tejun Heo;Oleg Nesterov;"ptrace: kill task_ptrace() task_ptrace(task) simply dereferences task->ptrace and isn't even used consistently only adding confusion. Kill it and directly access ->ptrace instead. This doesn't introduce any behavior change. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Oleg Nesterov <oleg@redhat.com>";https://api.github.com/repos/torvalds/linux/git/commits/d21142ece414ce1088cfcae760689aa60d6fee80;" Kill it and directly access ->ptrace instead";yes;yes;no 244;MDY6Q29tbWl0MjMyNTI5ODpkMjExNDJlY2U0MTRjZTEwODhjZmNhZTc2MDY4OWFhNjBkNmZlZTgw;Tejun Heo;Oleg Nesterov;"ptrace: kill task_ptrace() task_ptrace(task) simply dereferences task->ptrace and isn't even used consistently only adding confusion. Kill it and directly access ->ptrace instead. This doesn't introduce any behavior change. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Oleg Nesterov <oleg@redhat.com>";https://api.github.com/repos/torvalds/linux/git/commits/d21142ece414ce1088cfcae760689aa60d6fee80;This doesn't introduce any behavior change.;yes;no;yes 245;MDY6Q29tbWl0MjMyNTI5ODo3Mjc4OGMzODU2MDQ1MjM0MjI1OTIyNDljMTljYmEwMTg3MDIxZTli;David Rientjes;Linus Torvalds;"oom: replace PF_OOM_ORIGIN with toggling oom_score_adj There's a kernel-wide shortage of per-process flags, so it's always helpful to trim one when possible without incurring a significant penalty. It's even more important when you're planning on adding a per- process flag yourself, which I plan to do shortly for transparent hugepages. PF_OOM_ORIGIN is used by ksm and swapoff to prefer current since it has a tendency to allocate large amounts of memory and should be preferred for killing over other tasks. We'd rather immediately kill the task making the errant syscall rather than penalizing an innocent task. This patch removes PF_OOM_ORIGIN since its behavior is equivalent to setting the process's oom_score_adj to OOM_SCORE_ADJ_MAX. The process's old oom_score_adj is stored and then set to OOM_SCORE_ADJ_MAX during the time it used to have PF_OOM_ORIGIN. The old value is then reinstated when the process should no longer be considered a high priority for oom killing. Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Izik Eidus <ieidus@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/72788c385604523422592249c19cba0187021e9b;oom: replace PF_OOM_ORIGIN with toggling oom_score_adj;yes;no;no 245;MDY6Q29tbWl0MjMyNTI5ODo3Mjc4OGMzODU2MDQ1MjM0MjI1OTIyNDljMTljYmEwMTg3MDIxZTli;David Rientjes;Linus Torvalds;"oom: replace PF_OOM_ORIGIN with toggling oom_score_adj There's a kernel-wide shortage of per-process flags, so it's always helpful to trim one when possible without incurring a significant penalty. It's even more important when you're planning on adding a per- process flag yourself, which I plan to do shortly for transparent hugepages. PF_OOM_ORIGIN is used by ksm and swapoff to prefer current since it has a tendency to allocate large amounts of memory and should be preferred for killing over other tasks. We'd rather immediately kill the task making the errant syscall rather than penalizing an innocent task. This patch removes PF_OOM_ORIGIN since its behavior is equivalent to setting the process's oom_score_adj to OOM_SCORE_ADJ_MAX. The process's old oom_score_adj is stored and then set to OOM_SCORE_ADJ_MAX during the time it used to have PF_OOM_ORIGIN. The old value is then reinstated when the process should no longer be considered a high priority for oom killing. Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Izik Eidus <ieidus@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/72788c385604523422592249c19cba0187021e9b;"There's a kernel-wide shortage of per-process flags, so it's always helpful to trim one when possible without incurring a significant penalty";no;yes;yes 245;MDY6Q29tbWl0MjMyNTI5ODo3Mjc4OGMzODU2MDQ1MjM0MjI1OTIyNDljMTljYmEwMTg3MDIxZTli;David Rientjes;Linus Torvalds;"oom: replace PF_OOM_ORIGIN with toggling oom_score_adj There's a kernel-wide shortage of per-process flags, so it's always helpful to trim one when possible without incurring a significant penalty. It's even more important when you're planning on adding a per- process flag yourself, which I plan to do shortly for transparent hugepages. PF_OOM_ORIGIN is used by ksm and swapoff to prefer current since it has a tendency to allocate large amounts of memory and should be preferred for killing over other tasks. We'd rather immediately kill the task making the errant syscall rather than penalizing an innocent task. This patch removes PF_OOM_ORIGIN since its behavior is equivalent to setting the process's oom_score_adj to OOM_SCORE_ADJ_MAX. The process's old oom_score_adj is stored and then set to OOM_SCORE_ADJ_MAX during the time it used to have PF_OOM_ORIGIN. The old value is then reinstated when the process should no longer be considered a high priority for oom killing. Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Izik Eidus <ieidus@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/72788c385604523422592249c19cba0187021e9b;" It's even more important when you're planning on adding a per- process flag yourself, which I plan to do shortly for transparent hugepages";no;yes;yes 245;MDY6Q29tbWl0MjMyNTI5ODo3Mjc4OGMzODU2MDQ1MjM0MjI1OTIyNDljMTljYmEwMTg3MDIxZTli;David Rientjes;Linus Torvalds;"oom: replace PF_OOM_ORIGIN with toggling oom_score_adj There's a kernel-wide shortage of per-process flags, so it's always helpful to trim one when possible without incurring a significant penalty. It's even more important when you're planning on adding a per- process flag yourself, which I plan to do shortly for transparent hugepages. PF_OOM_ORIGIN is used by ksm and swapoff to prefer current since it has a tendency to allocate large amounts of memory and should be preferred for killing over other tasks. We'd rather immediately kill the task making the errant syscall rather than penalizing an innocent task. This patch removes PF_OOM_ORIGIN since its behavior is equivalent to setting the process's oom_score_adj to OOM_SCORE_ADJ_MAX. The process's old oom_score_adj is stored and then set to OOM_SCORE_ADJ_MAX during the time it used to have PF_OOM_ORIGIN. The old value is then reinstated when the process should no longer be considered a high priority for oom killing. Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Izik Eidus <ieidus@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/72788c385604523422592249c19cba0187021e9b;"PF_OOM_ORIGIN is used by ksm and swapoff to prefer current since it has a tendency to allocate large amounts of memory and should be preferred for killing over other tasks";no;no;yes 245;MDY6Q29tbWl0MjMyNTI5ODo3Mjc4OGMzODU2MDQ1MjM0MjI1OTIyNDljMTljYmEwMTg3MDIxZTli;David Rientjes;Linus Torvalds;"oom: replace PF_OOM_ORIGIN with toggling oom_score_adj There's a kernel-wide shortage of per-process flags, so it's always helpful to trim one when possible without incurring a significant penalty. It's even more important when you're planning on adding a per- process flag yourself, which I plan to do shortly for transparent hugepages. PF_OOM_ORIGIN is used by ksm and swapoff to prefer current since it has a tendency to allocate large amounts of memory and should be preferred for killing over other tasks. We'd rather immediately kill the task making the errant syscall rather than penalizing an innocent task. This patch removes PF_OOM_ORIGIN since its behavior is equivalent to setting the process's oom_score_adj to OOM_SCORE_ADJ_MAX. The process's old oom_score_adj is stored and then set to OOM_SCORE_ADJ_MAX during the time it used to have PF_OOM_ORIGIN. The old value is then reinstated when the process should no longer be considered a high priority for oom killing. Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Izik Eidus <ieidus@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/72788c385604523422592249c19cba0187021e9b;" We'd rather immediately kill the task making the errant syscall rather than penalizing an innocent task";no;yes;yes 245;MDY6Q29tbWl0MjMyNTI5ODo3Mjc4OGMzODU2MDQ1MjM0MjI1OTIyNDljMTljYmEwMTg3MDIxZTli;David Rientjes;Linus Torvalds;"oom: replace PF_OOM_ORIGIN with toggling oom_score_adj There's a kernel-wide shortage of per-process flags, so it's always helpful to trim one when possible without incurring a significant penalty. It's even more important when you're planning on adding a per- process flag yourself, which I plan to do shortly for transparent hugepages. PF_OOM_ORIGIN is used by ksm and swapoff to prefer current since it has a tendency to allocate large amounts of memory and should be preferred for killing over other tasks. We'd rather immediately kill the task making the errant syscall rather than penalizing an innocent task. This patch removes PF_OOM_ORIGIN since its behavior is equivalent to setting the process's oom_score_adj to OOM_SCORE_ADJ_MAX. The process's old oom_score_adj is stored and then set to OOM_SCORE_ADJ_MAX during the time it used to have PF_OOM_ORIGIN. The old value is then reinstated when the process should no longer be considered a high priority for oom killing. Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Izik Eidus <ieidus@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/72788c385604523422592249c19cba0187021e9b;"This patch removes PF_OOM_ORIGIN since its behavior is equivalent to setting the process's oom_score_adj to OOM_SCORE_ADJ_MAX";yes;yes;yes 245;MDY6Q29tbWl0MjMyNTI5ODo3Mjc4OGMzODU2MDQ1MjM0MjI1OTIyNDljMTljYmEwMTg3MDIxZTli;David Rientjes;Linus Torvalds;"oom: replace PF_OOM_ORIGIN with toggling oom_score_adj There's a kernel-wide shortage of per-process flags, so it's always helpful to trim one when possible without incurring a significant penalty. It's even more important when you're planning on adding a per- process flag yourself, which I plan to do shortly for transparent hugepages. PF_OOM_ORIGIN is used by ksm and swapoff to prefer current since it has a tendency to allocate large amounts of memory and should be preferred for killing over other tasks. We'd rather immediately kill the task making the errant syscall rather than penalizing an innocent task. This patch removes PF_OOM_ORIGIN since its behavior is equivalent to setting the process's oom_score_adj to OOM_SCORE_ADJ_MAX. The process's old oom_score_adj is stored and then set to OOM_SCORE_ADJ_MAX during the time it used to have PF_OOM_ORIGIN. The old value is then reinstated when the process should no longer be considered a high priority for oom killing. Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Izik Eidus <ieidus@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/72788c385604523422592249c19cba0187021e9b;"The process's old oom_score_adj is stored and then set to OOM_SCORE_ADJ_MAX during the time it used to have PF_OOM_ORIGIN";yes;no;yes 245;MDY6Q29tbWl0MjMyNTI5ODo3Mjc4OGMzODU2MDQ1MjM0MjI1OTIyNDljMTljYmEwMTg3MDIxZTli;David Rientjes;Linus Torvalds;"oom: replace PF_OOM_ORIGIN with toggling oom_score_adj There's a kernel-wide shortage of per-process flags, so it's always helpful to trim one when possible without incurring a significant penalty. It's even more important when you're planning on adding a per- process flag yourself, which I plan to do shortly for transparent hugepages. PF_OOM_ORIGIN is used by ksm and swapoff to prefer current since it has a tendency to allocate large amounts of memory and should be preferred for killing over other tasks. We'd rather immediately kill the task making the errant syscall rather than penalizing an innocent task. This patch removes PF_OOM_ORIGIN since its behavior is equivalent to setting the process's oom_score_adj to OOM_SCORE_ADJ_MAX. The process's old oom_score_adj is stored and then set to OOM_SCORE_ADJ_MAX during the time it used to have PF_OOM_ORIGIN. The old value is then reinstated when the process should no longer be considered a high priority for oom killing. Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Izik Eidus <ieidus@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/72788c385604523422592249c19cba0187021e9b;" The old value is then reinstated when the process should no longer be considered a high priority for oom killing.";yes;no;yes 246;MDY6Q29tbWl0MjMyNTI5ODpmNzU1YTA0MmQ4MmI1MWI1NGYzYmRkMDg5MGU1ZWE1NmMwZmI2ODA3;KOSAKI Motohiro;Linus Torvalds;"oom: use pte pages in OOM score PTE pages eat up memory just like anything else, but we do not account for them in any way in the OOM scores. They are also _guaranteed_ to get freed up when a process is OOM killed, while RSS is not. Reported-by: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Cc: <stable@kernel.org> [2.6.36+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/f755a042d82b51b54f3bdd0890e5ea56c0fb6807;oom: use pte pages in OOM score;yes;no;no 246;MDY6Q29tbWl0MjMyNTI5ODpmNzU1YTA0MmQ4MmI1MWI1NGYzYmRkMDg5MGU1ZWE1NmMwZmI2ODA3;KOSAKI Motohiro;Linus Torvalds;"oom: use pte pages in OOM score PTE pages eat up memory just like anything else, but we do not account for them in any way in the OOM scores. They are also _guaranteed_ to get freed up when a process is OOM killed, while RSS is not. Reported-by: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Cc: <stable@kernel.org> [2.6.36+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/f755a042d82b51b54f3bdd0890e5ea56c0fb6807;"PTE pages eat up memory just like anything else, but we do not account for them in any way in the OOM scores";no;yes;yes 246;MDY6Q29tbWl0MjMyNTI5ODpmNzU1YTA0MmQ4MmI1MWI1NGYzYmRkMDg5MGU1ZWE1NmMwZmI2ODA3;KOSAKI Motohiro;Linus Torvalds;"oom: use pte pages in OOM score PTE pages eat up memory just like anything else, but we do not account for them in any way in the OOM scores. They are also _guaranteed_ to get freed up when a process is OOM killed, while RSS is not. Reported-by: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Cc: <stable@kernel.org> [2.6.36+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/f755a042d82b51b54f3bdd0890e5ea56c0fb6807;" They are also _guaranteed_ to get freed up when a process is OOM killed, while RSS is not.";no;yes;yes 247;MDY6Q29tbWl0MjMyNTI5ODozNDFhZWEyYmM0OGJmNjUyNzc3ZmIwMTVjYzJiM2RmYTlhNDUxODE3;KOSAKI Motohiro;Linus Torvalds;"oom-kill: remove boost_dying_task_prio() This is an almost-revert of commit 93b43fa (""oom: give the dying task a higher priority""). That commit dramatically improved oom killer logic when a fork-bomb occurs. But I've found that it has nasty corner case. Now cpu cgroup has strange default RT runtime. It's 0! That said, if a process under cpu cgroup promote RT scheduling class, the process never run at all. If an admin inserts a !RT process into a cpu cgroup by setting rtruntime=0, usually it runs perfectly because a !RT task isn't affected by the rtruntime knob. But if it promotes an RT task via an explicit setscheduler() syscall or an OOM, the task can't run at all. In short, the oom killer doesn't work at all if admins are using cpu cgroup and don't touch the rtruntime knob. Eventually, kernel may hang up when oom kill occur. I and the original author Luis agreed to disable this logic. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Luis Claudio R. Goncalves <lclaudio@uudg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/341aea2bc48bf652777fb015cc2b3dfa9a451817;oom-kill: remove boost_dying_task_prio();yes;no;no 247;MDY6Q29tbWl0MjMyNTI5ODozNDFhZWEyYmM0OGJmNjUyNzc3ZmIwMTVjYzJiM2RmYTlhNDUxODE3;KOSAKI Motohiro;Linus Torvalds;"oom-kill: remove boost_dying_task_prio() This is an almost-revert of commit 93b43fa (""oom: give the dying task a higher priority""). That commit dramatically improved oom killer logic when a fork-bomb occurs. But I've found that it has nasty corner case. Now cpu cgroup has strange default RT runtime. It's 0! That said, if a process under cpu cgroup promote RT scheduling class, the process never run at all. If an admin inserts a !RT process into a cpu cgroup by setting rtruntime=0, usually it runs perfectly because a !RT task isn't affected by the rtruntime knob. But if it promotes an RT task via an explicit setscheduler() syscall or an OOM, the task can't run at all. In short, the oom killer doesn't work at all if admins are using cpu cgroup and don't touch the rtruntime knob. Eventually, kernel may hang up when oom kill occur. I and the original author Luis agreed to disable this logic. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Luis Claudio R. Goncalves <lclaudio@uudg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/341aea2bc48bf652777fb015cc2b3dfa9a451817;"This is an almost-revert of commit 93b43fa (""oom: give the dying task a higher priority"")";yes;no;yes 247;MDY6Q29tbWl0MjMyNTI5ODozNDFhZWEyYmM0OGJmNjUyNzc3ZmIwMTVjYzJiM2RmYTlhNDUxODE3;KOSAKI Motohiro;Linus Torvalds;"oom-kill: remove boost_dying_task_prio() This is an almost-revert of commit 93b43fa (""oom: give the dying task a higher priority""). That commit dramatically improved oom killer logic when a fork-bomb occurs. But I've found that it has nasty corner case. Now cpu cgroup has strange default RT runtime. It's 0! That said, if a process under cpu cgroup promote RT scheduling class, the process never run at all. If an admin inserts a !RT process into a cpu cgroup by setting rtruntime=0, usually it runs perfectly because a !RT task isn't affected by the rtruntime knob. But if it promotes an RT task via an explicit setscheduler() syscall or an OOM, the task can't run at all. In short, the oom killer doesn't work at all if admins are using cpu cgroup and don't touch the rtruntime knob. Eventually, kernel may hang up when oom kill occur. I and the original author Luis agreed to disable this logic. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Luis Claudio R. Goncalves <lclaudio@uudg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/341aea2bc48bf652777fb015cc2b3dfa9a451817;"That commit dramatically improved oom killer logic when a fork-bomb occurs";no;no;yes 247;MDY6Q29tbWl0MjMyNTI5ODozNDFhZWEyYmM0OGJmNjUyNzc3ZmIwMTVjYzJiM2RmYTlhNDUxODE3;KOSAKI Motohiro;Linus Torvalds;"oom-kill: remove boost_dying_task_prio() This is an almost-revert of commit 93b43fa (""oom: give the dying task a higher priority""). That commit dramatically improved oom killer logic when a fork-bomb occurs. But I've found that it has nasty corner case. Now cpu cgroup has strange default RT runtime. It's 0! That said, if a process under cpu cgroup promote RT scheduling class, the process never run at all. If an admin inserts a !RT process into a cpu cgroup by setting rtruntime=0, usually it runs perfectly because a !RT task isn't affected by the rtruntime knob. But if it promotes an RT task via an explicit setscheduler() syscall or an OOM, the task can't run at all. In short, the oom killer doesn't work at all if admins are using cpu cgroup and don't touch the rtruntime knob. Eventually, kernel may hang up when oom kill occur. I and the original author Luis agreed to disable this logic. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Luis Claudio R. Goncalves <lclaudio@uudg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/341aea2bc48bf652777fb015cc2b3dfa9a451817; But I've found that it has nasty corner case;no;yes;yes 247;MDY6Q29tbWl0MjMyNTI5ODozNDFhZWEyYmM0OGJmNjUyNzc3ZmIwMTVjYzJiM2RmYTlhNDUxODE3;KOSAKI Motohiro;Linus Torvalds;"oom-kill: remove boost_dying_task_prio() This is an almost-revert of commit 93b43fa (""oom: give the dying task a higher priority""). That commit dramatically improved oom killer logic when a fork-bomb occurs. But I've found that it has nasty corner case. Now cpu cgroup has strange default RT runtime. It's 0! That said, if a process under cpu cgroup promote RT scheduling class, the process never run at all. If an admin inserts a !RT process into a cpu cgroup by setting rtruntime=0, usually it runs perfectly because a !RT task isn't affected by the rtruntime knob. But if it promotes an RT task via an explicit setscheduler() syscall or an OOM, the task can't run at all. In short, the oom killer doesn't work at all if admins are using cpu cgroup and don't touch the rtruntime knob. Eventually, kernel may hang up when oom kill occur. I and the original author Luis agreed to disable this logic. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Luis Claudio R. Goncalves <lclaudio@uudg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/341aea2bc48bf652777fb015cc2b3dfa9a451817;" Now cpu cgroup has strange default RT runtime";no;yes;yes 247;MDY6Q29tbWl0MjMyNTI5ODozNDFhZWEyYmM0OGJmNjUyNzc3ZmIwMTVjYzJiM2RmYTlhNDUxODE3;KOSAKI Motohiro;Linus Torvalds;"oom-kill: remove boost_dying_task_prio() This is an almost-revert of commit 93b43fa (""oom: give the dying task a higher priority""). That commit dramatically improved oom killer logic when a fork-bomb occurs. But I've found that it has nasty corner case. Now cpu cgroup has strange default RT runtime. It's 0! That said, if a process under cpu cgroup promote RT scheduling class, the process never run at all. If an admin inserts a !RT process into a cpu cgroup by setting rtruntime=0, usually it runs perfectly because a !RT task isn't affected by the rtruntime knob. But if it promotes an RT task via an explicit setscheduler() syscall or an OOM, the task can't run at all. In short, the oom killer doesn't work at all if admins are using cpu cgroup and don't touch the rtruntime knob. Eventually, kernel may hang up when oom kill occur. I and the original author Luis agreed to disable this logic. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Luis Claudio R. Goncalves <lclaudio@uudg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/341aea2bc48bf652777fb015cc2b3dfa9a451817;" It's 0! That said, if a process under cpu cgroup promote RT scheduling class, the process never run at all";no;yes;yes 247;MDY6Q29tbWl0MjMyNTI5ODozNDFhZWEyYmM0OGJmNjUyNzc3ZmIwMTVjYzJiM2RmYTlhNDUxODE3;KOSAKI Motohiro;Linus Torvalds;"oom-kill: remove boost_dying_task_prio() This is an almost-revert of commit 93b43fa (""oom: give the dying task a higher priority""). That commit dramatically improved oom killer logic when a fork-bomb occurs. But I've found that it has nasty corner case. Now cpu cgroup has strange default RT runtime. It's 0! That said, if a process under cpu cgroup promote RT scheduling class, the process never run at all. If an admin inserts a !RT process into a cpu cgroup by setting rtruntime=0, usually it runs perfectly because a !RT task isn't affected by the rtruntime knob. But if it promotes an RT task via an explicit setscheduler() syscall or an OOM, the task can't run at all. In short, the oom killer doesn't work at all if admins are using cpu cgroup and don't touch the rtruntime knob. Eventually, kernel may hang up when oom kill occur. I and the original author Luis agreed to disable this logic. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Luis Claudio R. Goncalves <lclaudio@uudg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/341aea2bc48bf652777fb015cc2b3dfa9a451817;"If an admin inserts a !RT process into a cpu cgroup by setting rtruntime=0, usually it runs perfectly because a !RT task isn't affected by the rtruntime knob";no;yes;yes 247;MDY6Q29tbWl0MjMyNTI5ODozNDFhZWEyYmM0OGJmNjUyNzc3ZmIwMTVjYzJiM2RmYTlhNDUxODE3;KOSAKI Motohiro;Linus Torvalds;"oom-kill: remove boost_dying_task_prio() This is an almost-revert of commit 93b43fa (""oom: give the dying task a higher priority""). That commit dramatically improved oom killer logic when a fork-bomb occurs. But I've found that it has nasty corner case. Now cpu cgroup has strange default RT runtime. It's 0! That said, if a process under cpu cgroup promote RT scheduling class, the process never run at all. If an admin inserts a !RT process into a cpu cgroup by setting rtruntime=0, usually it runs perfectly because a !RT task isn't affected by the rtruntime knob. But if it promotes an RT task via an explicit setscheduler() syscall or an OOM, the task can't run at all. In short, the oom killer doesn't work at all if admins are using cpu cgroup and don't touch the rtruntime knob. Eventually, kernel may hang up when oom kill occur. I and the original author Luis agreed to disable this logic. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Luis Claudio R. Goncalves <lclaudio@uudg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/341aea2bc48bf652777fb015cc2b3dfa9a451817;" But if it promotes an RT task via an explicit setscheduler() syscall or an OOM, the task can't run at all";no;yes;yes 247;MDY6Q29tbWl0MjMyNTI5ODozNDFhZWEyYmM0OGJmNjUyNzc3ZmIwMTVjYzJiM2RmYTlhNDUxODE3;KOSAKI Motohiro;Linus Torvalds;"oom-kill: remove boost_dying_task_prio() This is an almost-revert of commit 93b43fa (""oom: give the dying task a higher priority""). That commit dramatically improved oom killer logic when a fork-bomb occurs. But I've found that it has nasty corner case. Now cpu cgroup has strange default RT runtime. It's 0! That said, if a process under cpu cgroup promote RT scheduling class, the process never run at all. If an admin inserts a !RT process into a cpu cgroup by setting rtruntime=0, usually it runs perfectly because a !RT task isn't affected by the rtruntime knob. But if it promotes an RT task via an explicit setscheduler() syscall or an OOM, the task can't run at all. In short, the oom killer doesn't work at all if admins are using cpu cgroup and don't touch the rtruntime knob. Eventually, kernel may hang up when oom kill occur. I and the original author Luis agreed to disable this logic. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Luis Claudio R. Goncalves <lclaudio@uudg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/341aea2bc48bf652777fb015cc2b3dfa9a451817;" In short, the oom killer doesn't work at all if admins are using cpu cgroup and don't touch the rtruntime knob";no;yes;yes 247;MDY6Q29tbWl0MjMyNTI5ODozNDFhZWEyYmM0OGJmNjUyNzc3ZmIwMTVjYzJiM2RmYTlhNDUxODE3;KOSAKI Motohiro;Linus Torvalds;"oom-kill: remove boost_dying_task_prio() This is an almost-revert of commit 93b43fa (""oom: give the dying task a higher priority""). That commit dramatically improved oom killer logic when a fork-bomb occurs. But I've found that it has nasty corner case. Now cpu cgroup has strange default RT runtime. It's 0! That said, if a process under cpu cgroup promote RT scheduling class, the process never run at all. If an admin inserts a !RT process into a cpu cgroup by setting rtruntime=0, usually it runs perfectly because a !RT task isn't affected by the rtruntime knob. But if it promotes an RT task via an explicit setscheduler() syscall or an OOM, the task can't run at all. In short, the oom killer doesn't work at all if admins are using cpu cgroup and don't touch the rtruntime knob. Eventually, kernel may hang up when oom kill occur. I and the original author Luis agreed to disable this logic. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Luis Claudio R. Goncalves <lclaudio@uudg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/341aea2bc48bf652777fb015cc2b3dfa9a451817;Eventually, kernel may hang up when oom kill occur;no;yes;yes 247;MDY6Q29tbWl0MjMyNTI5ODozNDFhZWEyYmM0OGJmNjUyNzc3ZmIwMTVjYzJiM2RmYTlhNDUxODE3;KOSAKI Motohiro;Linus Torvalds;"oom-kill: remove boost_dying_task_prio() This is an almost-revert of commit 93b43fa (""oom: give the dying task a higher priority""). That commit dramatically improved oom killer logic when a fork-bomb occurs. But I've found that it has nasty corner case. Now cpu cgroup has strange default RT runtime. It's 0! That said, if a process under cpu cgroup promote RT scheduling class, the process never run at all. If an admin inserts a !RT process into a cpu cgroup by setting rtruntime=0, usually it runs perfectly because a !RT task isn't affected by the rtruntime knob. But if it promotes an RT task via an explicit setscheduler() syscall or an OOM, the task can't run at all. In short, the oom killer doesn't work at all if admins are using cpu cgroup and don't touch the rtruntime knob. Eventually, kernel may hang up when oom kill occur. I and the original author Luis agreed to disable this logic. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Luis Claudio R. Goncalves <lclaudio@uudg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/341aea2bc48bf652777fb015cc2b3dfa9a451817;" I and the original author Luis agreed to disable this logic.";yes;no;no 248;MDY6Q29tbWl0MjMyNTI5ODpiMmI3NTViNWYxMGViMzJmYmRjNzNhOTkwN2MwNzAwNmIxN2Y3MTRi;David Rientjes;Linus Torvalds;"lib, arch: add filter argument to show_mem and fix private implementations Commit ddd588b5dd55 (""oom: suppress nodes that are not allowed from meminfo on oom kill"") moved lib/show_mem.o out of lib/lib.a, which resulted in build warnings on all architectures that implement their own versions of show_mem(): lib/lib.a(show_mem.o): In function `show_mem': show_mem.c:(.text+0x1f4): multiple definition of `show_mem' arch/sparc/mm/built-in.o:(.text+0xd70): first defined here The fix is to remove __show_mem() and add its argument to show_mem() in all implementations to prevent this breakage. Architectures that implement their own show_mem() actually don't do anything with the argument yet, but they could be made to filter nodes that aren't allowed in the current context in the future just like the generic implementation. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Reported-by: James Bottomley <James.Bottomley@hansenpartnership.com> Suggested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/b2b755b5f10eb32fbdc73a9907c07006b17f714b;lib, arch: add filter argument to show_mem and fix private implementations;yes;yes;no 248;MDY6Q29tbWl0MjMyNTI5ODpiMmI3NTViNWYxMGViMzJmYmRjNzNhOTkwN2MwNzAwNmIxN2Y3MTRi;David Rientjes;Linus Torvalds;"lib, arch: add filter argument to show_mem and fix private implementations Commit ddd588b5dd55 (""oom: suppress nodes that are not allowed from meminfo on oom kill"") moved lib/show_mem.o out of lib/lib.a, which resulted in build warnings on all architectures that implement their own versions of show_mem(): lib/lib.a(show_mem.o): In function `show_mem': show_mem.c:(.text+0x1f4): multiple definition of `show_mem' arch/sparc/mm/built-in.o:(.text+0xd70): first defined here The fix is to remove __show_mem() and add its argument to show_mem() in all implementations to prevent this breakage. Architectures that implement their own show_mem() actually don't do anything with the argument yet, but they could be made to filter nodes that aren't allowed in the current context in the future just like the generic implementation. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Reported-by: James Bottomley <James.Bottomley@hansenpartnership.com> Suggested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/b2b755b5f10eb32fbdc73a9907c07006b17f714b;"Commit ddd588b5dd55 (""oom: suppress nodes that are not allowed from meminfo on oom kill"") moved lib/show_mem.o out of lib/lib.a, which resulted in build warnings on all architectures that implement their own versions of show_mem()";no;yes;yes 248;MDY6Q29tbWl0MjMyNTI5ODpiMmI3NTViNWYxMGViMzJmYmRjNzNhOTkwN2MwNzAwNmIxN2Y3MTRi;David Rientjes;Linus Torvalds;"lib, arch: add filter argument to show_mem and fix private implementations Commit ddd588b5dd55 (""oom: suppress nodes that are not allowed from meminfo on oom kill"") moved lib/show_mem.o out of lib/lib.a, which resulted in build warnings on all architectures that implement their own versions of show_mem(): lib/lib.a(show_mem.o): In function `show_mem': show_mem.c:(.text+0x1f4): multiple definition of `show_mem' arch/sparc/mm/built-in.o:(.text+0xd70): first defined here The fix is to remove __show_mem() and add its argument to show_mem() in all implementations to prevent this breakage. Architectures that implement their own show_mem() actually don't do anything with the argument yet, but they could be made to filter nodes that aren't allowed in the current context in the future just like the generic implementation. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Reported-by: James Bottomley <James.Bottomley@hansenpartnership.com> Suggested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/b2b755b5f10eb32fbdc73a9907c07006b17f714b;" show_mem.c:(.text+0x1f4): multiple definition of `show_mem' arch/sparc/mm/built-in.o:(.text+0xd70): first defined here The fix is to remove __show_mem() and add its argument to show_mem() in all implementations to prevent this breakage";yes;yes;yes 248;MDY6Q29tbWl0MjMyNTI5ODpiMmI3NTViNWYxMGViMzJmYmRjNzNhOTkwN2MwNzAwNmIxN2Y3MTRi;David Rientjes;Linus Torvalds;"lib, arch: add filter argument to show_mem and fix private implementations Commit ddd588b5dd55 (""oom: suppress nodes that are not allowed from meminfo on oom kill"") moved lib/show_mem.o out of lib/lib.a, which resulted in build warnings on all architectures that implement their own versions of show_mem(): lib/lib.a(show_mem.o): In function `show_mem': show_mem.c:(.text+0x1f4): multiple definition of `show_mem' arch/sparc/mm/built-in.o:(.text+0xd70): first defined here The fix is to remove __show_mem() and add its argument to show_mem() in all implementations to prevent this breakage. Architectures that implement their own show_mem() actually don't do anything with the argument yet, but they could be made to filter nodes that aren't allowed in the current context in the future just like the generic implementation. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Reported-by: James Bottomley <James.Bottomley@hansenpartnership.com> Suggested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/b2b755b5f10eb32fbdc73a9907c07006b17f714b;"Architectures that implement their own show_mem() actually don't do anything with the argument yet, but they could be made to filter nodes that aren't allowed in the current context in the future just like the generic implementation.";yes;yes;yes 249;MDY6Q29tbWl0MjMyNTI5ODpmOTQzNGFkMTU1MjQyN2ZhYjQ5MzM2ZTFhNmUzZWYxMjE4OTViOWQx;David Rientjes;Linus Torvalds;"memcg: give current access to memory reserves if it's trying to die When a memcg is oom and current has already received a SIGKILL, then give it access to memory reserves with a higher scheduling priority so that it may quickly exit and free its memory. This is identical to the global oom killer and is done even before checking for panic_on_oom: a pending SIGKILL here while panic_on_oom is selected is guaranteed to have come from userspace; the thread only needs access to memory reserves to exit and thus we don't unnecessarily panic the machine until the kernel has no last resort to free memory. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/f9434ad1552427fab49336e1a6e3ef121895b9d1;memcg: give current access to memory reserves if it's trying to die;yes;no;no 249;MDY6Q29tbWl0MjMyNTI5ODpmOTQzNGFkMTU1MjQyN2ZhYjQ5MzM2ZTFhNmUzZWYxMjE4OTViOWQx;David Rientjes;Linus Torvalds;"memcg: give current access to memory reserves if it's trying to die When a memcg is oom and current has already received a SIGKILL, then give it access to memory reserves with a higher scheduling priority so that it may quickly exit and free its memory. This is identical to the global oom killer and is done even before checking for panic_on_oom: a pending SIGKILL here while panic_on_oom is selected is guaranteed to have come from userspace; the thread only needs access to memory reserves to exit and thus we don't unnecessarily panic the machine until the kernel has no last resort to free memory. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/f9434ad1552427fab49336e1a6e3ef121895b9d1;"When a memcg is oom and current has already received a SIGKILL, then give it access to memory reserves with a higher scheduling priority so that it may quickly exit and free its memory";yes;yes;no 249;MDY6Q29tbWl0MjMyNTI5ODpmOTQzNGFkMTU1MjQyN2ZhYjQ5MzM2ZTFhNmUzZWYxMjE4OTViOWQx;David Rientjes;Linus Torvalds;"memcg: give current access to memory reserves if it's trying to die When a memcg is oom and current has already received a SIGKILL, then give it access to memory reserves with a higher scheduling priority so that it may quickly exit and free its memory. This is identical to the global oom killer and is done even before checking for panic_on_oom: a pending SIGKILL here while panic_on_oom is selected is guaranteed to have come from userspace; the thread only needs access to memory reserves to exit and thus we don't unnecessarily panic the machine until the kernel has no last resort to free memory. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/f9434ad1552427fab49336e1a6e3ef121895b9d1;"This is identical to the global oom killer and is done even before checking for panic_on_oom: a pending SIGKILL here while panic_on_oom is selected is guaranteed to have come from userspace; the thread only needs access to memory reserves to exit and thus we don't unnecessarily panic the machine until the kernel has no last resort to free memory.";yes;yes;yes 250;MDY6Q29tbWl0MjMyNTI5ODpkZGQ1ODhiNWRkNTVmMTQzMjAzNzk5NjFlNDc2ODNkYjRlNGMxZDkw;David Rientjes;Linus Torvalds;"oom: suppress nodes that are not allowed from meminfo on oom kill The oom killer is extremely verbose for machines with a large number of cpus and/or nodes. This verbosity can often be harmful if it causes other important messages to be scrolled from the kernel log and incurs a signicant time delay, specifically for kernels with CONFIG_NODES_SHIFT > 8. This patch causes only memory information to be displayed for nodes that are allowed by current's cpuset when dumping the VM state. Information for all other nodes is irrelevant to the oom condition; we don't care if there's an abundance of memory elsewhere if we can't access it. This only affects the behavior of dumping memory information when an oom is triggered. Other dumps, such as for sysrq+m, still display the unfiltered form when using the existing show_mem() interface. Additionally, the per-cpu pageset statistics are extremely verbose in oom killer output, so it is now suppressed. This removes nodes_weight(current->mems_allowed) * (1 + nr_cpus) lines from the oom killer output. Callers may use __show_mem(SHOW_MEM_FILTER_NODES) to filter disallowed nodes. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/ddd588b5dd55f14320379961e47683db4e4c1d90;oom: suppress nodes that are not allowed from meminfo on oom kill;yes;no;no 250;MDY6Q29tbWl0MjMyNTI5ODpkZGQ1ODhiNWRkNTVmMTQzMjAzNzk5NjFlNDc2ODNkYjRlNGMxZDkw;David Rientjes;Linus Torvalds;"oom: suppress nodes that are not allowed from meminfo on oom kill The oom killer is extremely verbose for machines with a large number of cpus and/or nodes. This verbosity can often be harmful if it causes other important messages to be scrolled from the kernel log and incurs a signicant time delay, specifically for kernels with CONFIG_NODES_SHIFT > 8. This patch causes only memory information to be displayed for nodes that are allowed by current's cpuset when dumping the VM state. Information for all other nodes is irrelevant to the oom condition; we don't care if there's an abundance of memory elsewhere if we can't access it. This only affects the behavior of dumping memory information when an oom is triggered. Other dumps, such as for sysrq+m, still display the unfiltered form when using the existing show_mem() interface. Additionally, the per-cpu pageset statistics are extremely verbose in oom killer output, so it is now suppressed. This removes nodes_weight(current->mems_allowed) * (1 + nr_cpus) lines from the oom killer output. Callers may use __show_mem(SHOW_MEM_FILTER_NODES) to filter disallowed nodes. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/ddd588b5dd55f14320379961e47683db4e4c1d90;"The oom killer is extremely verbose for machines with a large number of cpus and/or nodes";no;yes;yes 250;MDY6Q29tbWl0MjMyNTI5ODpkZGQ1ODhiNWRkNTVmMTQzMjAzNzk5NjFlNDc2ODNkYjRlNGMxZDkw;David Rientjes;Linus Torvalds;"oom: suppress nodes that are not allowed from meminfo on oom kill The oom killer is extremely verbose for machines with a large number of cpus and/or nodes. This verbosity can often be harmful if it causes other important messages to be scrolled from the kernel log and incurs a signicant time delay, specifically for kernels with CONFIG_NODES_SHIFT > 8. This patch causes only memory information to be displayed for nodes that are allowed by current's cpuset when dumping the VM state. Information for all other nodes is irrelevant to the oom condition; we don't care if there's an abundance of memory elsewhere if we can't access it. This only affects the behavior of dumping memory information when an oom is triggered. Other dumps, such as for sysrq+m, still display the unfiltered form when using the existing show_mem() interface. Additionally, the per-cpu pageset statistics are extremely verbose in oom killer output, so it is now suppressed. This removes nodes_weight(current->mems_allowed) * (1 + nr_cpus) lines from the oom killer output. Callers may use __show_mem(SHOW_MEM_FILTER_NODES) to filter disallowed nodes. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/ddd588b5dd55f14320379961e47683db4e4c1d90;" This verbosity can often be harmful if it causes other important messages to be scrolled from the kernel log and incurs a signicant time delay, specifically for kernels with CONFIG_NODES_SHIFT > This patch causes only memory information to be displayed for nodes that are allowed by current's cpuset when dumping the VM state";yes;yes;yes 250;MDY6Q29tbWl0MjMyNTI5ODpkZGQ1ODhiNWRkNTVmMTQzMjAzNzk5NjFlNDc2ODNkYjRlNGMxZDkw;David Rientjes;Linus Torvalds;"oom: suppress nodes that are not allowed from meminfo on oom kill The oom killer is extremely verbose for machines with a large number of cpus and/or nodes. This verbosity can often be harmful if it causes other important messages to be scrolled from the kernel log and incurs a signicant time delay, specifically for kernels with CONFIG_NODES_SHIFT > 8. This patch causes only memory information to be displayed for nodes that are allowed by current's cpuset when dumping the VM state. Information for all other nodes is irrelevant to the oom condition; we don't care if there's an abundance of memory elsewhere if we can't access it. This only affects the behavior of dumping memory information when an oom is triggered. Other dumps, such as for sysrq+m, still display the unfiltered form when using the existing show_mem() interface. Additionally, the per-cpu pageset statistics are extremely verbose in oom killer output, so it is now suppressed. This removes nodes_weight(current->mems_allowed) * (1 + nr_cpus) lines from the oom killer output. Callers may use __show_mem(SHOW_MEM_FILTER_NODES) to filter disallowed nodes. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/ddd588b5dd55f14320379961e47683db4e4c1d90;" Information for all other nodes is irrelevant to the oom condition; we don't care if there's an abundance of memory elsewhere if we can't access it";yes;yes;yes 250;MDY6Q29tbWl0MjMyNTI5ODpkZGQ1ODhiNWRkNTVmMTQzMjAzNzk5NjFlNDc2ODNkYjRlNGMxZDkw;David Rientjes;Linus Torvalds;"oom: suppress nodes that are not allowed from meminfo on oom kill The oom killer is extremely verbose for machines with a large number of cpus and/or nodes. This verbosity can often be harmful if it causes other important messages to be scrolled from the kernel log and incurs a signicant time delay, specifically for kernels with CONFIG_NODES_SHIFT > 8. This patch causes only memory information to be displayed for nodes that are allowed by current's cpuset when dumping the VM state. Information for all other nodes is irrelevant to the oom condition; we don't care if there's an abundance of memory elsewhere if we can't access it. This only affects the behavior of dumping memory information when an oom is triggered. Other dumps, such as for sysrq+m, still display the unfiltered form when using the existing show_mem() interface. Additionally, the per-cpu pageset statistics are extremely verbose in oom killer output, so it is now suppressed. This removes nodes_weight(current->mems_allowed) * (1 + nr_cpus) lines from the oom killer output. Callers may use __show_mem(SHOW_MEM_FILTER_NODES) to filter disallowed nodes. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/ddd588b5dd55f14320379961e47683db4e4c1d90;"This only affects the behavior of dumping memory information when an oom is triggered";yes;no;yes 250;MDY6Q29tbWl0MjMyNTI5ODpkZGQ1ODhiNWRkNTVmMTQzMjAzNzk5NjFlNDc2ODNkYjRlNGMxZDkw;David Rientjes;Linus Torvalds;"oom: suppress nodes that are not allowed from meminfo on oom kill The oom killer is extremely verbose for machines with a large number of cpus and/or nodes. This verbosity can often be harmful if it causes other important messages to be scrolled from the kernel log and incurs a signicant time delay, specifically for kernels with CONFIG_NODES_SHIFT > 8. This patch causes only memory information to be displayed for nodes that are allowed by current's cpuset when dumping the VM state. Information for all other nodes is irrelevant to the oom condition; we don't care if there's an abundance of memory elsewhere if we can't access it. This only affects the behavior of dumping memory information when an oom is triggered. Other dumps, such as for sysrq+m, still display the unfiltered form when using the existing show_mem() interface. Additionally, the per-cpu pageset statistics are extremely verbose in oom killer output, so it is now suppressed. This removes nodes_weight(current->mems_allowed) * (1 + nr_cpus) lines from the oom killer output. Callers may use __show_mem(SHOW_MEM_FILTER_NODES) to filter disallowed nodes. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/ddd588b5dd55f14320379961e47683db4e4c1d90;" Other dumps, such as for sysrq+m, still display the unfiltered form when using the existing show_mem() interface";yes;no;yes 250;MDY6Q29tbWl0MjMyNTI5ODpkZGQ1ODhiNWRkNTVmMTQzMjAzNzk5NjFlNDc2ODNkYjRlNGMxZDkw;David Rientjes;Linus Torvalds;"oom: suppress nodes that are not allowed from meminfo on oom kill The oom killer is extremely verbose for machines with a large number of cpus and/or nodes. This verbosity can often be harmful if it causes other important messages to be scrolled from the kernel log and incurs a signicant time delay, specifically for kernels with CONFIG_NODES_SHIFT > 8. This patch causes only memory information to be displayed for nodes that are allowed by current's cpuset when dumping the VM state. Information for all other nodes is irrelevant to the oom condition; we don't care if there's an abundance of memory elsewhere if we can't access it. This only affects the behavior of dumping memory information when an oom is triggered. Other dumps, such as for sysrq+m, still display the unfiltered form when using the existing show_mem() interface. Additionally, the per-cpu pageset statistics are extremely verbose in oom killer output, so it is now suppressed. This removes nodes_weight(current->mems_allowed) * (1 + nr_cpus) lines from the oom killer output. Callers may use __show_mem(SHOW_MEM_FILTER_NODES) to filter disallowed nodes. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/ddd588b5dd55f14320379961e47683db4e4c1d90;"Additionally, the per-cpu pageset statistics are extremely verbose in oom killer output, so it is now suppressed";yes;yes;yes 250;MDY6Q29tbWl0MjMyNTI5ODpkZGQ1ODhiNWRkNTVmMTQzMjAzNzk5NjFlNDc2ODNkYjRlNGMxZDkw;David Rientjes;Linus Torvalds;"oom: suppress nodes that are not allowed from meminfo on oom kill The oom killer is extremely verbose for machines with a large number of cpus and/or nodes. This verbosity can often be harmful if it causes other important messages to be scrolled from the kernel log and incurs a signicant time delay, specifically for kernels with CONFIG_NODES_SHIFT > 8. This patch causes only memory information to be displayed for nodes that are allowed by current's cpuset when dumping the VM state. Information for all other nodes is irrelevant to the oom condition; we don't care if there's an abundance of memory elsewhere if we can't access it. This only affects the behavior of dumping memory information when an oom is triggered. Other dumps, such as for sysrq+m, still display the unfiltered form when using the existing show_mem() interface. Additionally, the per-cpu pageset statistics are extremely verbose in oom killer output, so it is now suppressed. This removes nodes_weight(current->mems_allowed) * (1 + nr_cpus) lines from the oom killer output. Callers may use __show_mem(SHOW_MEM_FILTER_NODES) to filter disallowed nodes. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/ddd588b5dd55f14320379961e47683db4e4c1d90;" This removes nodes_weight(current->mems_allowed) * (1 + nr_cpus) lines from the oom killer output";yes;no;no 250;MDY6Q29tbWl0MjMyNTI5ODpkZGQ1ODhiNWRkNTVmMTQzMjAzNzk5NjFlNDc2ODNkYjRlNGMxZDkw;David Rientjes;Linus Torvalds;"oom: suppress nodes that are not allowed from meminfo on oom kill The oom killer is extremely verbose for machines with a large number of cpus and/or nodes. This verbosity can often be harmful if it causes other important messages to be scrolled from the kernel log and incurs a signicant time delay, specifically for kernels with CONFIG_NODES_SHIFT > 8. This patch causes only memory information to be displayed for nodes that are allowed by current's cpuset when dumping the VM state. Information for all other nodes is irrelevant to the oom condition; we don't care if there's an abundance of memory elsewhere if we can't access it. This only affects the behavior of dumping memory information when an oom is triggered. Other dumps, such as for sysrq+m, still display the unfiltered form when using the existing show_mem() interface. Additionally, the per-cpu pageset statistics are extremely verbose in oom killer output, so it is now suppressed. This removes nodes_weight(current->mems_allowed) * (1 + nr_cpus) lines from the oom killer output. Callers may use __show_mem(SHOW_MEM_FILTER_NODES) to filter disallowed nodes. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/ddd588b5dd55f14320379961e47683db4e4c1d90;"Callers may use __show_mem(SHOW_MEM_FILTER_NODES) to filter disallowed nodes.";yes;no;yes 251;MDY6Q29tbWl0MjMyNTI5ODplZGQ0NTU0NGM2ZjA5NTUwZGYwYTU0OTFhYThhMDdhZjI0NzY3ZTcz;David Rientjes;Linus Torvalds;"oom: avoid deferring oom killer if exiting task is being traced The oom killer naturally defers killing anything if it finds an eligible task that is already exiting and has yet to detach its ->mm. This avoids unnecessarily killing tasks when one is already in the exit path and may free enough memory that the oom killer is no longer needed. This is detected by PF_EXITING since threads that have already detached its ->mm are no longer considered at all. The problem with always deferring when a thread is PF_EXITING, however, is that it may never actually exit when being traced, specifically if another task is tracing it with PTRACE_O_TRACEEXIT. The oom killer does not want to defer in this case since there is no guarantee that thread will ever exit without intervention. This patch will now only defer the oom killer when a thread is PF_EXITING and no ptracer has stopped its progress in the exit path. It also ensures that a child is sacrificed for the chosen parent only if it has a different ->mm as the comment implies: this ensures that the thread group leader is always targeted appropriately. Signed-off-by: David Rientjes <rientjes@google.com> Reported-by: Oleg Nesterov <oleg@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrey Vagin <avagin@openvz.org> Cc: <stable@kernel.org> [2.6.38.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/edd45544c6f09550df0a5491aa8a07af24767e73;oom: avoid deferring oom killer if exiting task is being traced;yes;no;no 251;MDY6Q29tbWl0MjMyNTI5ODplZGQ0NTU0NGM2ZjA5NTUwZGYwYTU0OTFhYThhMDdhZjI0NzY3ZTcz;David Rientjes;Linus Torvalds;"oom: avoid deferring oom killer if exiting task is being traced The oom killer naturally defers killing anything if it finds an eligible task that is already exiting and has yet to detach its ->mm. This avoids unnecessarily killing tasks when one is already in the exit path and may free enough memory that the oom killer is no longer needed. This is detected by PF_EXITING since threads that have already detached its ->mm are no longer considered at all. The problem with always deferring when a thread is PF_EXITING, however, is that it may never actually exit when being traced, specifically if another task is tracing it with PTRACE_O_TRACEEXIT. The oom killer does not want to defer in this case since there is no guarantee that thread will ever exit without intervention. This patch will now only defer the oom killer when a thread is PF_EXITING and no ptracer has stopped its progress in the exit path. It also ensures that a child is sacrificed for the chosen parent only if it has a different ->mm as the comment implies: this ensures that the thread group leader is always targeted appropriately. Signed-off-by: David Rientjes <rientjes@google.com> Reported-by: Oleg Nesterov <oleg@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrey Vagin <avagin@openvz.org> Cc: <stable@kernel.org> [2.6.38.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/edd45544c6f09550df0a5491aa8a07af24767e73;"The oom killer naturally defers killing anything if it finds an eligible task that is already exiting and has yet to detach its ->mm";no;no;yes 251;MDY6Q29tbWl0MjMyNTI5ODplZGQ0NTU0NGM2ZjA5NTUwZGYwYTU0OTFhYThhMDdhZjI0NzY3ZTcz;David Rientjes;Linus Torvalds;"oom: avoid deferring oom killer if exiting task is being traced The oom killer naturally defers killing anything if it finds an eligible task that is already exiting and has yet to detach its ->mm. This avoids unnecessarily killing tasks when one is already in the exit path and may free enough memory that the oom killer is no longer needed. This is detected by PF_EXITING since threads that have already detached its ->mm are no longer considered at all. The problem with always deferring when a thread is PF_EXITING, however, is that it may never actually exit when being traced, specifically if another task is tracing it with PTRACE_O_TRACEEXIT. The oom killer does not want to defer in this case since there is no guarantee that thread will ever exit without intervention. This patch will now only defer the oom killer when a thread is PF_EXITING and no ptracer has stopped its progress in the exit path. It also ensures that a child is sacrificed for the chosen parent only if it has a different ->mm as the comment implies: this ensures that the thread group leader is always targeted appropriately. Signed-off-by: David Rientjes <rientjes@google.com> Reported-by: Oleg Nesterov <oleg@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrey Vagin <avagin@openvz.org> Cc: <stable@kernel.org> [2.6.38.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/edd45544c6f09550df0a5491aa8a07af24767e73;" This avoids unnecessarily killing tasks when one is already in the exit path and may free enough memory that the oom killer is no longer needed";no;yes;yes 251;MDY6Q29tbWl0MjMyNTI5ODplZGQ0NTU0NGM2ZjA5NTUwZGYwYTU0OTFhYThhMDdhZjI0NzY3ZTcz;David Rientjes;Linus Torvalds;"oom: avoid deferring oom killer if exiting task is being traced The oom killer naturally defers killing anything if it finds an eligible task that is already exiting and has yet to detach its ->mm. This avoids unnecessarily killing tasks when one is already in the exit path and may free enough memory that the oom killer is no longer needed. This is detected by PF_EXITING since threads that have already detached its ->mm are no longer considered at all. The problem with always deferring when a thread is PF_EXITING, however, is that it may never actually exit when being traced, specifically if another task is tracing it with PTRACE_O_TRACEEXIT. The oom killer does not want to defer in this case since there is no guarantee that thread will ever exit without intervention. This patch will now only defer the oom killer when a thread is PF_EXITING and no ptracer has stopped its progress in the exit path. It also ensures that a child is sacrificed for the chosen parent only if it has a different ->mm as the comment implies: this ensures that the thread group leader is always targeted appropriately. Signed-off-by: David Rientjes <rientjes@google.com> Reported-by: Oleg Nesterov <oleg@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrey Vagin <avagin@openvz.org> Cc: <stable@kernel.org> [2.6.38.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/edd45544c6f09550df0a5491aa8a07af24767e73;" This is detected by PF_EXITING since threads that have already detached its ->mm are no longer considered at all";no;no;yes 251;MDY6Q29tbWl0MjMyNTI5ODplZGQ0NTU0NGM2ZjA5NTUwZGYwYTU0OTFhYThhMDdhZjI0NzY3ZTcz;David Rientjes;Linus Torvalds;"oom: avoid deferring oom killer if exiting task is being traced The oom killer naturally defers killing anything if it finds an eligible task that is already exiting and has yet to detach its ->mm. This avoids unnecessarily killing tasks when one is already in the exit path and may free enough memory that the oom killer is no longer needed. This is detected by PF_EXITING since threads that have already detached its ->mm are no longer considered at all. The problem with always deferring when a thread is PF_EXITING, however, is that it may never actually exit when being traced, specifically if another task is tracing it with PTRACE_O_TRACEEXIT. The oom killer does not want to defer in this case since there is no guarantee that thread will ever exit without intervention. This patch will now only defer the oom killer when a thread is PF_EXITING and no ptracer has stopped its progress in the exit path. It also ensures that a child is sacrificed for the chosen parent only if it has a different ->mm as the comment implies: this ensures that the thread group leader is always targeted appropriately. Signed-off-by: David Rientjes <rientjes@google.com> Reported-by: Oleg Nesterov <oleg@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrey Vagin <avagin@openvz.org> Cc: <stable@kernel.org> [2.6.38.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/edd45544c6f09550df0a5491aa8a07af24767e73;"The problem with always deferring when a thread is PF_EXITING, however, is that it may never actually exit when being traced, specifically if another task is tracing it with PTRACE_O_TRACEEXIT";no;yes;yes 251;MDY6Q29tbWl0MjMyNTI5ODplZGQ0NTU0NGM2ZjA5NTUwZGYwYTU0OTFhYThhMDdhZjI0NzY3ZTcz;David Rientjes;Linus Torvalds;"oom: avoid deferring oom killer if exiting task is being traced The oom killer naturally defers killing anything if it finds an eligible task that is already exiting and has yet to detach its ->mm. This avoids unnecessarily killing tasks when one is already in the exit path and may free enough memory that the oom killer is no longer needed. This is detected by PF_EXITING since threads that have already detached its ->mm are no longer considered at all. The problem with always deferring when a thread is PF_EXITING, however, is that it may never actually exit when being traced, specifically if another task is tracing it with PTRACE_O_TRACEEXIT. The oom killer does not want to defer in this case since there is no guarantee that thread will ever exit without intervention. This patch will now only defer the oom killer when a thread is PF_EXITING and no ptracer has stopped its progress in the exit path. It also ensures that a child is sacrificed for the chosen parent only if it has a different ->mm as the comment implies: this ensures that the thread group leader is always targeted appropriately. Signed-off-by: David Rientjes <rientjes@google.com> Reported-by: Oleg Nesterov <oleg@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrey Vagin <avagin@openvz.org> Cc: <stable@kernel.org> [2.6.38.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/edd45544c6f09550df0a5491aa8a07af24767e73;" The oom killer does not want to defer in this case since there is no guarantee that thread will ever exit without intervention";yes;yes;yes 251;MDY6Q29tbWl0MjMyNTI5ODplZGQ0NTU0NGM2ZjA5NTUwZGYwYTU0OTFhYThhMDdhZjI0NzY3ZTcz;David Rientjes;Linus Torvalds;"oom: avoid deferring oom killer if exiting task is being traced The oom killer naturally defers killing anything if it finds an eligible task that is already exiting and has yet to detach its ->mm. This avoids unnecessarily killing tasks when one is already in the exit path and may free enough memory that the oom killer is no longer needed. This is detected by PF_EXITING since threads that have already detached its ->mm are no longer considered at all. The problem with always deferring when a thread is PF_EXITING, however, is that it may never actually exit when being traced, specifically if another task is tracing it with PTRACE_O_TRACEEXIT. The oom killer does not want to defer in this case since there is no guarantee that thread will ever exit without intervention. This patch will now only defer the oom killer when a thread is PF_EXITING and no ptracer has stopped its progress in the exit path. It also ensures that a child is sacrificed for the chosen parent only if it has a different ->mm as the comment implies: this ensures that the thread group leader is always targeted appropriately. Signed-off-by: David Rientjes <rientjes@google.com> Reported-by: Oleg Nesterov <oleg@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrey Vagin <avagin@openvz.org> Cc: <stable@kernel.org> [2.6.38.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/edd45544c6f09550df0a5491aa8a07af24767e73;"This patch will now only defer the oom killer when a thread is PF_EXITING and no ptracer has stopped its progress in the exit path";yes;no;yes 251;MDY6Q29tbWl0MjMyNTI5ODplZGQ0NTU0NGM2ZjA5NTUwZGYwYTU0OTFhYThhMDdhZjI0NzY3ZTcz;David Rientjes;Linus Torvalds;"oom: avoid deferring oom killer if exiting task is being traced The oom killer naturally defers killing anything if it finds an eligible task that is already exiting and has yet to detach its ->mm. This avoids unnecessarily killing tasks when one is already in the exit path and may free enough memory that the oom killer is no longer needed. This is detected by PF_EXITING since threads that have already detached its ->mm are no longer considered at all. The problem with always deferring when a thread is PF_EXITING, however, is that it may never actually exit when being traced, specifically if another task is tracing it with PTRACE_O_TRACEEXIT. The oom killer does not want to defer in this case since there is no guarantee that thread will ever exit without intervention. This patch will now only defer the oom killer when a thread is PF_EXITING and no ptracer has stopped its progress in the exit path. It also ensures that a child is sacrificed for the chosen parent only if it has a different ->mm as the comment implies: this ensures that the thread group leader is always targeted appropriately. Signed-off-by: David Rientjes <rientjes@google.com> Reported-by: Oleg Nesterov <oleg@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrey Vagin <avagin@openvz.org> Cc: <stable@kernel.org> [2.6.38.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/edd45544c6f09550df0a5491aa8a07af24767e73;" It also ensures that a child is sacrificed for the chosen parent only if it has a different ->mm as the comment implies: this ensures that the thread group leader is always targeted appropriately.";yes;yes;yes 252;MDY6Q29tbWl0MjMyNTI5ODozMGUyYjQxZjIwYjYyMzhmNTFlN2NmZmI4NzljN2EwZjAwNzNmNWZl;Andrey Vagin;Linus Torvalds;"oom: skip zombies when iterating tasklist We shouldn't defer oom killing if a thread has already detached its ->mm and still has TIF_MEMDIE set. Memory needs to be freed, so find kill other threads that pin the same ->mm or find another task to kill. Signed-off-by: Andrey Vagin <avagin@openvz.org> Signed-off-by: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: <stable@kernel.org> [2.6.38.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/30e2b41f20b6238f51e7cffb879c7a0f0073f5fe;oom: skip zombies when iterating tasklist;yes;yes;no 252;MDY6Q29tbWl0MjMyNTI5ODozMGUyYjQxZjIwYjYyMzhmNTFlN2NmZmI4NzljN2EwZjAwNzNmNWZl;Andrey Vagin;Linus Torvalds;"oom: skip zombies when iterating tasklist We shouldn't defer oom killing if a thread has already detached its ->mm and still has TIF_MEMDIE set. Memory needs to be freed, so find kill other threads that pin the same ->mm or find another task to kill. Signed-off-by: Andrey Vagin <avagin@openvz.org> Signed-off-by: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: <stable@kernel.org> [2.6.38.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/30e2b41f20b6238f51e7cffb879c7a0f0073f5fe;"We shouldn't defer oom killing if a thread has already detached its ->mm and still has TIF_MEMDIE set";no;yes;yes 252;MDY6Q29tbWl0MjMyNTI5ODozMGUyYjQxZjIwYjYyMzhmNTFlN2NmZmI4NzljN2EwZjAwNzNmNWZl;Andrey Vagin;Linus Torvalds;"oom: skip zombies when iterating tasklist We shouldn't defer oom killing if a thread has already detached its ->mm and still has TIF_MEMDIE set. Memory needs to be freed, so find kill other threads that pin the same ->mm or find another task to kill. Signed-off-by: Andrey Vagin <avagin@openvz.org> Signed-off-by: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: <stable@kernel.org> [2.6.38.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/30e2b41f20b6238f51e7cffb879c7a0f0073f5fe;" Memory needs to be freed, so find kill other threads that pin the same ->mm or find another task to kill.";yes;yes;yes 253;MDY6Q29tbWl0MjMyNTI5ODozYTVkZGE3YTE3Y2YzNzA2Zjc5Yjg2MjkzZjI5ZGIwMmQ2MWUwZDQ4;David Rientjes;Linus Torvalds;"oom: prevent unnecessary oom kills or kernel panics This patch prevents unnecessary oom kills or kernel panics by reverting two commits: 495789a5 (oom: make oom_score to per-process value) cef1d352 (oom: multi threaded process coredump don't make deadlock) First, 495789a5 (oom: make oom_score to per-process value) ignores the fact that all threads in a thread group do not necessarily exit at the same time. It is imperative that select_bad_process() detect threads that are in the exit path, specifically those with PF_EXITING set, to prevent needlessly killing additional tasks. If a process is oom killed and the thread group leader exits, select_bad_process() cannot detect the other threads that are PF_EXITING by iterating over only processes. Thus, it currently chooses another task unnecessarily for oom kill or panics the machine when nothing else is eligible. By iterating over threads instead, it is possible to detect threads that are exiting and nominate them for oom kill so they get access to memory reserves. Second, cef1d352 (oom: multi threaded process coredump don't make deadlock) erroneously avoids making the oom killer a no-op when an eligible thread other than current isfound to be exiting. We want to detect this situation so that we may allow that exiting thread time to exit and free its memory; if it is able to exit on its own, that should free memory so current is no loner oom. If it is not able to exit on its own, the oom killer will nominate it for oom kill which, in this case, only means it will get access to memory reserves. Without this change, it is easy for the oom killer to unnecessarily target tasks when all threads of a victim don't exit before the thread group leader or, in the worst case, panic the machine. Signed-off-by: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrey Vagin <avagin@openvz.org> Cc: <stable@kernel.org> [2.6.38.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3a5dda7a17cf3706f79b86293f29db02d61e0d48;oom: prevent unnecessary oom kills or kernel panics;yes;yes;no 253;MDY6Q29tbWl0MjMyNTI5ODozYTVkZGE3YTE3Y2YzNzA2Zjc5Yjg2MjkzZjI5ZGIwMmQ2MWUwZDQ4;David Rientjes;Linus Torvalds;"oom: prevent unnecessary oom kills or kernel panics This patch prevents unnecessary oom kills or kernel panics by reverting two commits: 495789a5 (oom: make oom_score to per-process value) cef1d352 (oom: multi threaded process coredump don't make deadlock) First, 495789a5 (oom: make oom_score to per-process value) ignores the fact that all threads in a thread group do not necessarily exit at the same time. It is imperative that select_bad_process() detect threads that are in the exit path, specifically those with PF_EXITING set, to prevent needlessly killing additional tasks. If a process is oom killed and the thread group leader exits, select_bad_process() cannot detect the other threads that are PF_EXITING by iterating over only processes. Thus, it currently chooses another task unnecessarily for oom kill or panics the machine when nothing else is eligible. By iterating over threads instead, it is possible to detect threads that are exiting and nominate them for oom kill so they get access to memory reserves. Second, cef1d352 (oom: multi threaded process coredump don't make deadlock) erroneously avoids making the oom killer a no-op when an eligible thread other than current isfound to be exiting. We want to detect this situation so that we may allow that exiting thread time to exit and free its memory; if it is able to exit on its own, that should free memory so current is no loner oom. If it is not able to exit on its own, the oom killer will nominate it for oom kill which, in this case, only means it will get access to memory reserves. Without this change, it is easy for the oom killer to unnecessarily target tasks when all threads of a victim don't exit before the thread group leader or, in the worst case, panic the machine. Signed-off-by: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrey Vagin <avagin@openvz.org> Cc: <stable@kernel.org> [2.6.38.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3a5dda7a17cf3706f79b86293f29db02d61e0d48;"This patch prevents unnecessary oom kills or kernel panics by reverting two commits";yes;yes;no 253;MDY6Q29tbWl0MjMyNTI5ODozYTVkZGE3YTE3Y2YzNzA2Zjc5Yjg2MjkzZjI5ZGIwMmQ2MWUwZDQ4;David Rientjes;Linus Torvalds;"oom: prevent unnecessary oom kills or kernel panics This patch prevents unnecessary oom kills or kernel panics by reverting two commits: 495789a5 (oom: make oom_score to per-process value) cef1d352 (oom: multi threaded process coredump don't make deadlock) First, 495789a5 (oom: make oom_score to per-process value) ignores the fact that all threads in a thread group do not necessarily exit at the same time. It is imperative that select_bad_process() detect threads that are in the exit path, specifically those with PF_EXITING set, to prevent needlessly killing additional tasks. If a process is oom killed and the thread group leader exits, select_bad_process() cannot detect the other threads that are PF_EXITING by iterating over only processes. Thus, it currently chooses another task unnecessarily for oom kill or panics the machine when nothing else is eligible. By iterating over threads instead, it is possible to detect threads that are exiting and nominate them for oom kill so they get access to memory reserves. Second, cef1d352 (oom: multi threaded process coredump don't make deadlock) erroneously avoids making the oom killer a no-op when an eligible thread other than current isfound to be exiting. We want to detect this situation so that we may allow that exiting thread time to exit and free its memory; if it is able to exit on its own, that should free memory so current is no loner oom. If it is not able to exit on its own, the oom killer will nominate it for oom kill which, in this case, only means it will get access to memory reserves. Without this change, it is easy for the oom killer to unnecessarily target tasks when all threads of a victim don't exit before the thread group leader or, in the worst case, panic the machine. Signed-off-by: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrey Vagin <avagin@openvz.org> Cc: <stable@kernel.org> [2.6.38.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3a5dda7a17cf3706f79b86293f29db02d61e0d48;" 495789a5 (oom: make oom_score to per-process value) cef1d352 (oom: multi threaded process coredump don't make deadlock) First, 495789a5 (oom: make oom_score to per-process value) ignores the fact that all threads in a thread group do not necessarily exit at the same time";yes;yes;yes 253;MDY6Q29tbWl0MjMyNTI5ODozYTVkZGE3YTE3Y2YzNzA2Zjc5Yjg2MjkzZjI5ZGIwMmQ2MWUwZDQ4;David Rientjes;Linus Torvalds;"oom: prevent unnecessary oom kills or kernel panics This patch prevents unnecessary oom kills or kernel panics by reverting two commits: 495789a5 (oom: make oom_score to per-process value) cef1d352 (oom: multi threaded process coredump don't make deadlock) First, 495789a5 (oom: make oom_score to per-process value) ignores the fact that all threads in a thread group do not necessarily exit at the same time. It is imperative that select_bad_process() detect threads that are in the exit path, specifically those with PF_EXITING set, to prevent needlessly killing additional tasks. If a process is oom killed and the thread group leader exits, select_bad_process() cannot detect the other threads that are PF_EXITING by iterating over only processes. Thus, it currently chooses another task unnecessarily for oom kill or panics the machine when nothing else is eligible. By iterating over threads instead, it is possible to detect threads that are exiting and nominate them for oom kill so they get access to memory reserves. Second, cef1d352 (oom: multi threaded process coredump don't make deadlock) erroneously avoids making the oom killer a no-op when an eligible thread other than current isfound to be exiting. We want to detect this situation so that we may allow that exiting thread time to exit and free its memory; if it is able to exit on its own, that should free memory so current is no loner oom. If it is not able to exit on its own, the oom killer will nominate it for oom kill which, in this case, only means it will get access to memory reserves. Without this change, it is easy for the oom killer to unnecessarily target tasks when all threads of a victim don't exit before the thread group leader or, in the worst case, panic the machine. Signed-off-by: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrey Vagin <avagin@openvz.org> Cc: <stable@kernel.org> [2.6.38.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3a5dda7a17cf3706f79b86293f29db02d61e0d48;"It is imperative that select_bad_process() detect threads that are in the exit path, specifically those with PF_EXITING set, to prevent needlessly killing additional tasks";yes;yes;yes 253;MDY6Q29tbWl0MjMyNTI5ODozYTVkZGE3YTE3Y2YzNzA2Zjc5Yjg2MjkzZjI5ZGIwMmQ2MWUwZDQ4;David Rientjes;Linus Torvalds;"oom: prevent unnecessary oom kills or kernel panics This patch prevents unnecessary oom kills or kernel panics by reverting two commits: 495789a5 (oom: make oom_score to per-process value) cef1d352 (oom: multi threaded process coredump don't make deadlock) First, 495789a5 (oom: make oom_score to per-process value) ignores the fact that all threads in a thread group do not necessarily exit at the same time. It is imperative that select_bad_process() detect threads that are in the exit path, specifically those with PF_EXITING set, to prevent needlessly killing additional tasks. If a process is oom killed and the thread group leader exits, select_bad_process() cannot detect the other threads that are PF_EXITING by iterating over only processes. Thus, it currently chooses another task unnecessarily for oom kill or panics the machine when nothing else is eligible. By iterating over threads instead, it is possible to detect threads that are exiting and nominate them for oom kill so they get access to memory reserves. Second, cef1d352 (oom: multi threaded process coredump don't make deadlock) erroneously avoids making the oom killer a no-op when an eligible thread other than current isfound to be exiting. We want to detect this situation so that we may allow that exiting thread time to exit and free its memory; if it is able to exit on its own, that should free memory so current is no loner oom. If it is not able to exit on its own, the oom killer will nominate it for oom kill which, in this case, only means it will get access to memory reserves. Without this change, it is easy for the oom killer to unnecessarily target tasks when all threads of a victim don't exit before the thread group leader or, in the worst case, panic the machine. Signed-off-by: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrey Vagin <avagin@openvz.org> Cc: <stable@kernel.org> [2.6.38.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3a5dda7a17cf3706f79b86293f29db02d61e0d48;" If a process is oom killed and the thread group leader exits, select_bad_process() cannot detect the other threads that are PF_EXITING by iterating over only processes";no;yes;yes 253;MDY6Q29tbWl0MjMyNTI5ODozYTVkZGE3YTE3Y2YzNzA2Zjc5Yjg2MjkzZjI5ZGIwMmQ2MWUwZDQ4;David Rientjes;Linus Torvalds;"oom: prevent unnecessary oom kills or kernel panics This patch prevents unnecessary oom kills or kernel panics by reverting two commits: 495789a5 (oom: make oom_score to per-process value) cef1d352 (oom: multi threaded process coredump don't make deadlock) First, 495789a5 (oom: make oom_score to per-process value) ignores the fact that all threads in a thread group do not necessarily exit at the same time. It is imperative that select_bad_process() detect threads that are in the exit path, specifically those with PF_EXITING set, to prevent needlessly killing additional tasks. If a process is oom killed and the thread group leader exits, select_bad_process() cannot detect the other threads that are PF_EXITING by iterating over only processes. Thus, it currently chooses another task unnecessarily for oom kill or panics the machine when nothing else is eligible. By iterating over threads instead, it is possible to detect threads that are exiting and nominate them for oom kill so they get access to memory reserves. Second, cef1d352 (oom: multi threaded process coredump don't make deadlock) erroneously avoids making the oom killer a no-op when an eligible thread other than current isfound to be exiting. We want to detect this situation so that we may allow that exiting thread time to exit and free its memory; if it is able to exit on its own, that should free memory so current is no loner oom. If it is not able to exit on its own, the oom killer will nominate it for oom kill which, in this case, only means it will get access to memory reserves. Without this change, it is easy for the oom killer to unnecessarily target tasks when all threads of a victim don't exit before the thread group leader or, in the worst case, panic the machine. Signed-off-by: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrey Vagin <avagin@openvz.org> Cc: <stable@kernel.org> [2.6.38.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3a5dda7a17cf3706f79b86293f29db02d61e0d48;" Thus, it currently chooses another task unnecessarily for oom kill or panics the machine when nothing else is eligible";no;yes;yes 253;MDY6Q29tbWl0MjMyNTI5ODozYTVkZGE3YTE3Y2YzNzA2Zjc5Yjg2MjkzZjI5ZGIwMmQ2MWUwZDQ4;David Rientjes;Linus Torvalds;"oom: prevent unnecessary oom kills or kernel panics This patch prevents unnecessary oom kills or kernel panics by reverting two commits: 495789a5 (oom: make oom_score to per-process value) cef1d352 (oom: multi threaded process coredump don't make deadlock) First, 495789a5 (oom: make oom_score to per-process value) ignores the fact that all threads in a thread group do not necessarily exit at the same time. It is imperative that select_bad_process() detect threads that are in the exit path, specifically those with PF_EXITING set, to prevent needlessly killing additional tasks. If a process is oom killed and the thread group leader exits, select_bad_process() cannot detect the other threads that are PF_EXITING by iterating over only processes. Thus, it currently chooses another task unnecessarily for oom kill or panics the machine when nothing else is eligible. By iterating over threads instead, it is possible to detect threads that are exiting and nominate them for oom kill so they get access to memory reserves. Second, cef1d352 (oom: multi threaded process coredump don't make deadlock) erroneously avoids making the oom killer a no-op when an eligible thread other than current isfound to be exiting. We want to detect this situation so that we may allow that exiting thread time to exit and free its memory; if it is able to exit on its own, that should free memory so current is no loner oom. If it is not able to exit on its own, the oom killer will nominate it for oom kill which, in this case, only means it will get access to memory reserves. Without this change, it is easy for the oom killer to unnecessarily target tasks when all threads of a victim don't exit before the thread group leader or, in the worst case, panic the machine. Signed-off-by: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrey Vagin <avagin@openvz.org> Cc: <stable@kernel.org> [2.6.38.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3a5dda7a17cf3706f79b86293f29db02d61e0d48;"By iterating over threads instead, it is possible to detect threads that are exiting and nominate them for oom kill so they get access to memory reserves";yes;yes;yes 253;MDY6Q29tbWl0MjMyNTI5ODozYTVkZGE3YTE3Y2YzNzA2Zjc5Yjg2MjkzZjI5ZGIwMmQ2MWUwZDQ4;David Rientjes;Linus Torvalds;"oom: prevent unnecessary oom kills or kernel panics This patch prevents unnecessary oom kills or kernel panics by reverting two commits: 495789a5 (oom: make oom_score to per-process value) cef1d352 (oom: multi threaded process coredump don't make deadlock) First, 495789a5 (oom: make oom_score to per-process value) ignores the fact that all threads in a thread group do not necessarily exit at the same time. It is imperative that select_bad_process() detect threads that are in the exit path, specifically those with PF_EXITING set, to prevent needlessly killing additional tasks. If a process is oom killed and the thread group leader exits, select_bad_process() cannot detect the other threads that are PF_EXITING by iterating over only processes. Thus, it currently chooses another task unnecessarily for oom kill or panics the machine when nothing else is eligible. By iterating over threads instead, it is possible to detect threads that are exiting and nominate them for oom kill so they get access to memory reserves. Second, cef1d352 (oom: multi threaded process coredump don't make deadlock) erroneously avoids making the oom killer a no-op when an eligible thread other than current isfound to be exiting. We want to detect this situation so that we may allow that exiting thread time to exit and free its memory; if it is able to exit on its own, that should free memory so current is no loner oom. If it is not able to exit on its own, the oom killer will nominate it for oom kill which, in this case, only means it will get access to memory reserves. Without this change, it is easy for the oom killer to unnecessarily target tasks when all threads of a victim don't exit before the thread group leader or, in the worst case, panic the machine. Signed-off-by: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrey Vagin <avagin@openvz.org> Cc: <stable@kernel.org> [2.6.38.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3a5dda7a17cf3706f79b86293f29db02d61e0d48;"Second, cef1d352 (oom: multi threaded process coredump don't make deadlock) erroneously avoids making the oom killer a no-op when an eligible thread other than current isfound to be exiting";no;yes;yes 253;MDY6Q29tbWl0MjMyNTI5ODozYTVkZGE3YTE3Y2YzNzA2Zjc5Yjg2MjkzZjI5ZGIwMmQ2MWUwZDQ4;David Rientjes;Linus Torvalds;"oom: prevent unnecessary oom kills or kernel panics This patch prevents unnecessary oom kills or kernel panics by reverting two commits: 495789a5 (oom: make oom_score to per-process value) cef1d352 (oom: multi threaded process coredump don't make deadlock) First, 495789a5 (oom: make oom_score to per-process value) ignores the fact that all threads in a thread group do not necessarily exit at the same time. It is imperative that select_bad_process() detect threads that are in the exit path, specifically those with PF_EXITING set, to prevent needlessly killing additional tasks. If a process is oom killed and the thread group leader exits, select_bad_process() cannot detect the other threads that are PF_EXITING by iterating over only processes. Thus, it currently chooses another task unnecessarily for oom kill or panics the machine when nothing else is eligible. By iterating over threads instead, it is possible to detect threads that are exiting and nominate them for oom kill so they get access to memory reserves. Second, cef1d352 (oom: multi threaded process coredump don't make deadlock) erroneously avoids making the oom killer a no-op when an eligible thread other than current isfound to be exiting. We want to detect this situation so that we may allow that exiting thread time to exit and free its memory; if it is able to exit on its own, that should free memory so current is no loner oom. If it is not able to exit on its own, the oom killer will nominate it for oom kill which, in this case, only means it will get access to memory reserves. Without this change, it is easy for the oom killer to unnecessarily target tasks when all threads of a victim don't exit before the thread group leader or, in the worst case, panic the machine. Signed-off-by: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrey Vagin <avagin@openvz.org> Cc: <stable@kernel.org> [2.6.38.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3a5dda7a17cf3706f79b86293f29db02d61e0d48;" We want to detect this situation so that we may allow that exiting thread time to exit and free its memory; if it is able to exit on its own, that should free memory so current is no loner oom";yes;yes;yes 253;MDY6Q29tbWl0MjMyNTI5ODozYTVkZGE3YTE3Y2YzNzA2Zjc5Yjg2MjkzZjI5ZGIwMmQ2MWUwZDQ4;David Rientjes;Linus Torvalds;"oom: prevent unnecessary oom kills or kernel panics This patch prevents unnecessary oom kills or kernel panics by reverting two commits: 495789a5 (oom: make oom_score to per-process value) cef1d352 (oom: multi threaded process coredump don't make deadlock) First, 495789a5 (oom: make oom_score to per-process value) ignores the fact that all threads in a thread group do not necessarily exit at the same time. It is imperative that select_bad_process() detect threads that are in the exit path, specifically those with PF_EXITING set, to prevent needlessly killing additional tasks. If a process is oom killed and the thread group leader exits, select_bad_process() cannot detect the other threads that are PF_EXITING by iterating over only processes. Thus, it currently chooses another task unnecessarily for oom kill or panics the machine when nothing else is eligible. By iterating over threads instead, it is possible to detect threads that are exiting and nominate them for oom kill so they get access to memory reserves. Second, cef1d352 (oom: multi threaded process coredump don't make deadlock) erroneously avoids making the oom killer a no-op when an eligible thread other than current isfound to be exiting. We want to detect this situation so that we may allow that exiting thread time to exit and free its memory; if it is able to exit on its own, that should free memory so current is no loner oom. If it is not able to exit on its own, the oom killer will nominate it for oom kill which, in this case, only means it will get access to memory reserves. Without this change, it is easy for the oom killer to unnecessarily target tasks when all threads of a victim don't exit before the thread group leader or, in the worst case, panic the machine. Signed-off-by: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrey Vagin <avagin@openvz.org> Cc: <stable@kernel.org> [2.6.38.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3a5dda7a17cf3706f79b86293f29db02d61e0d48;" If it is not able to exit on its own, the oom killer will nominate it for oom kill which, in this case, only means it will get access to memory reserves";yes;no;yes 253;MDY6Q29tbWl0MjMyNTI5ODozYTVkZGE3YTE3Y2YzNzA2Zjc5Yjg2MjkzZjI5ZGIwMmQ2MWUwZDQ4;David Rientjes;Linus Torvalds;"oom: prevent unnecessary oom kills or kernel panics This patch prevents unnecessary oom kills or kernel panics by reverting two commits: 495789a5 (oom: make oom_score to per-process value) cef1d352 (oom: multi threaded process coredump don't make deadlock) First, 495789a5 (oom: make oom_score to per-process value) ignores the fact that all threads in a thread group do not necessarily exit at the same time. It is imperative that select_bad_process() detect threads that are in the exit path, specifically those with PF_EXITING set, to prevent needlessly killing additional tasks. If a process is oom killed and the thread group leader exits, select_bad_process() cannot detect the other threads that are PF_EXITING by iterating over only processes. Thus, it currently chooses another task unnecessarily for oom kill or panics the machine when nothing else is eligible. By iterating over threads instead, it is possible to detect threads that are exiting and nominate them for oom kill so they get access to memory reserves. Second, cef1d352 (oom: multi threaded process coredump don't make deadlock) erroneously avoids making the oom killer a no-op when an eligible thread other than current isfound to be exiting. We want to detect this situation so that we may allow that exiting thread time to exit and free its memory; if it is able to exit on its own, that should free memory so current is no loner oom. If it is not able to exit on its own, the oom killer will nominate it for oom kill which, in this case, only means it will get access to memory reserves. Without this change, it is easy for the oom killer to unnecessarily target tasks when all threads of a victim don't exit before the thread group leader or, in the worst case, panic the machine. Signed-off-by: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrey Vagin <avagin@openvz.org> Cc: <stable@kernel.org> [2.6.38.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3a5dda7a17cf3706f79b86293f29db02d61e0d48;"Without this change, it is easy for the oom killer to unnecessarily target tasks when all threads of a victim don't exit before the thread group leader or, in the worst case, panic the machine.";no;yes;yes 254;MDY6Q29tbWl0MjMyNTI5ODo1MmQzYzAzNjc1ZmRiZTE5NjViOWIxOTA5MDcyYjQwYWQyZjgwMDYz;Linus Torvalds;Linus Torvalds;"Revert ""oom: oom_kill_process: fix the child_points logic"" This reverts the parent commit. I hate doing that, but it's generating some discussion (""half of it is right""), and since I am planning on doing the 2.6.38 release later today we can punt it to stable if required. Let's not rock the boat right now. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/52d3c03675fdbe1965b9b1909072b40ad2f80063;"Revert ""oom: oom_kill_process: fix the child_points logic""";yes;no;no 254;MDY6Q29tbWl0MjMyNTI5ODo1MmQzYzAzNjc1ZmRiZTE5NjViOWIxOTA5MDcyYjQwYWQyZjgwMDYz;Linus Torvalds;Linus Torvalds;"Revert ""oom: oom_kill_process: fix the child_points logic"" This reverts the parent commit. I hate doing that, but it's generating some discussion (""half of it is right""), and since I am planning on doing the 2.6.38 release later today we can punt it to stable if required. Let's not rock the boat right now. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/52d3c03675fdbe1965b9b1909072b40ad2f80063;This reverts the parent commit;yes;no;no 254;MDY6Q29tbWl0MjMyNTI5ODo1MmQzYzAzNjc1ZmRiZTE5NjViOWIxOTA5MDcyYjQwYWQyZjgwMDYz;Linus Torvalds;Linus Torvalds;"Revert ""oom: oom_kill_process: fix the child_points logic"" This reverts the parent commit. I hate doing that, but it's generating some discussion (""half of it is right""), and since I am planning on doing the 2.6.38 release later today we can punt it to stable if required. Let's not rock the boat right now. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/52d3c03675fdbe1965b9b1909072b40ad2f80063;" I hate doing that, but it's generating some discussion (""half of it is right""), and since I am planning on doing the 2.6.38 release later today we can punt it to stable if required";no;yes;no 254;MDY6Q29tbWl0MjMyNTI5ODo1MmQzYzAzNjc1ZmRiZTE5NjViOWIxOTA5MDcyYjQwYWQyZjgwMDYz;Linus Torvalds;Linus Torvalds;"Revert ""oom: oom_kill_process: fix the child_points logic"" This reverts the parent commit. I hate doing that, but it's generating some discussion (""half of it is right""), and since I am planning on doing the 2.6.38 release later today we can punt it to stable if required. Let's not rock the boat right now. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/52d3c03675fdbe1965b9b1909072b40ad2f80063;Let's not rock the boat right now.;yes;yes;no 255;MDY6Q29tbWl0MjMyNTI5ODpkYzFiODNhYjA4ZjE5NTQzMzU2OTJjZGNkNDk5Zjc4Yzk0ZjRjNDJh;Oleg Nesterov;Linus Torvalds;"oom: oom_kill_process: fix the child_points logic oom_kill_process() starts with victim_points == 0. This means that (most likely) any child has more points and can be killed erroneously. Also, ""children has a different mm"" doesn't match the reality, we should check child->mm != t->mm. This check is not exactly correct if t->mm == NULL but this doesn't really matter, oom_kill_task() will kill them anyway. Note: ""Kill all processes sharing p->mm"" in oom_kill_task() is wrong too. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/dc1b83ab08f1954335692cdcd499f78c94f4c42a;oom: oom_kill_process: fix the child_points logic;yes;yes;no 255;MDY6Q29tbWl0MjMyNTI5ODpkYzFiODNhYjA4ZjE5NTQzMzU2OTJjZGNkNDk5Zjc4Yzk0ZjRjNDJh;Oleg Nesterov;Linus Torvalds;"oom: oom_kill_process: fix the child_points logic oom_kill_process() starts with victim_points == 0. This means that (most likely) any child has more points and can be killed erroneously. Also, ""children has a different mm"" doesn't match the reality, we should check child->mm != t->mm. This check is not exactly correct if t->mm == NULL but this doesn't really matter, oom_kill_task() will kill them anyway. Note: ""Kill all processes sharing p->mm"" in oom_kill_task() is wrong too. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/dc1b83ab08f1954335692cdcd499f78c94f4c42a;oom_kill_process() starts with victim_points == 0;no;no;yes 255;MDY6Q29tbWl0MjMyNTI5ODpkYzFiODNhYjA4ZjE5NTQzMzU2OTJjZGNkNDk5Zjc4Yzk0ZjRjNDJh;Oleg Nesterov;Linus Torvalds;"oom: oom_kill_process: fix the child_points logic oom_kill_process() starts with victim_points == 0. This means that (most likely) any child has more points and can be killed erroneously. Also, ""children has a different mm"" doesn't match the reality, we should check child->mm != t->mm. This check is not exactly correct if t->mm == NULL but this doesn't really matter, oom_kill_task() will kill them anyway. Note: ""Kill all processes sharing p->mm"" in oom_kill_task() is wrong too. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/dc1b83ab08f1954335692cdcd499f78c94f4c42a;" This means that (most likely) any child has more points and can be killed erroneously";no;yes;yes 255;MDY6Q29tbWl0MjMyNTI5ODpkYzFiODNhYjA4ZjE5NTQzMzU2OTJjZGNkNDk5Zjc4Yzk0ZjRjNDJh;Oleg Nesterov;Linus Torvalds;"oom: oom_kill_process: fix the child_points logic oom_kill_process() starts with victim_points == 0. This means that (most likely) any child has more points and can be killed erroneously. Also, ""children has a different mm"" doesn't match the reality, we should check child->mm != t->mm. This check is not exactly correct if t->mm == NULL but this doesn't really matter, oom_kill_task() will kill them anyway. Note: ""Kill all processes sharing p->mm"" in oom_kill_task() is wrong too. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/dc1b83ab08f1954335692cdcd499f78c94f4c42a;"Also, ""children has a different mm"" doesn't match the reality, we should check child->mm != t->mm";yes;yes;yes 255;MDY6Q29tbWl0MjMyNTI5ODpkYzFiODNhYjA4ZjE5NTQzMzU2OTJjZGNkNDk5Zjc4Yzk0ZjRjNDJh;Oleg Nesterov;Linus Torvalds;"oom: oom_kill_process: fix the child_points logic oom_kill_process() starts with victim_points == 0. This means that (most likely) any child has more points and can be killed erroneously. Also, ""children has a different mm"" doesn't match the reality, we should check child->mm != t->mm. This check is not exactly correct if t->mm == NULL but this doesn't really matter, oom_kill_task() will kill them anyway. Note: ""Kill all processes sharing p->mm"" in oom_kill_task() is wrong too. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/dc1b83ab08f1954335692cdcd499f78c94f4c42a;" This check is not exactly correct if t->mm == NULL but this doesn't really matter, oom_kill_task() will kill them anyway";yes;yes;yes 255;MDY6Q29tbWl0MjMyNTI5ODpkYzFiODNhYjA4ZjE5NTQzMzU2OTJjZGNkNDk5Zjc4Yzk0ZjRjNDJh;Oleg Nesterov;Linus Torvalds;"oom: oom_kill_process: fix the child_points logic oom_kill_process() starts with victim_points == 0. This means that (most likely) any child has more points and can be killed erroneously. Also, ""children has a different mm"" doesn't match the reality, we should check child->mm != t->mm. This check is not exactly correct if t->mm == NULL but this doesn't really matter, oom_kill_task() will kill them anyway. Note: ""Kill all processes sharing p->mm"" in oom_kill_task() is wrong too. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/dc1b83ab08f1954335692cdcd499f78c94f4c42a;"Note: ""Kill all processes sharing p->mm"" in oom_kill_task() is wrong too.";yes;no;yes 256;MDY6Q29tbWl0MjMyNTI5ODoxZTk5YmFkMGQ5YzEyYTRhYWE2MGNkODEyYzg0ZWYxNTI1NjRiY2Y1;David Rientjes;Linus Torvalds;"oom: kill all threads sharing oom killed task's mm It's necessary to kill all threads that share an oom killed task's mm if the goal is to lead to future memory freeing. This patch reintroduces the code removed in 8c5cd6f3 (oom: oom_kill doesn't kill vfork parent (or child)) since it is obsoleted. It's now guaranteed that any task passed to oom_kill_task() does not share an mm with any thread that is unkillable. Thus, we're safe to issue a SIGKILL to any thread sharing the same mm. This is especially necessary to solve an mm->mmap_sem livelock issue whereas an oom killed thread must acquire the lock in the exit path while another thread is holding it in the page allocator while trying to allocate memory itself (and will preempt the oom killer since a task was already killed). Since tasks with pending fatal signals are now granted access to memory reserves, the thread holding the lock may quickly allocate and release the lock so that the oom killed task may exit. This mainly is for threads that are cloned with CLONE_VM but not CLONE_THREAD, so they are in a different thread group. Non-NPTL threads exist in the wild and this change is necessary to prevent the livelock in such cases. We care more about preventing the livelock than incurring the additional tasklist in the oom killer when a task has been killed. Systems that are sufficiently large to not want the tasklist scan in the oom killer in the first place already have the option of enabling /proc/sys/vm/oom_kill_allocating_task, which was designed specifically for that purpose. This code had existed in the oom killer for over eight years dating back to the 2.4 kernel. [akpm@linux-foundation.org: add nice comment] Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/1e99bad0d9c12a4aaa60cd812c84ef152564bcf5;oom: kill all threads sharing oom killed task's mm;yes;no;no 256;MDY6Q29tbWl0MjMyNTI5ODoxZTk5YmFkMGQ5YzEyYTRhYWE2MGNkODEyYzg0ZWYxNTI1NjRiY2Y1;David Rientjes;Linus Torvalds;"oom: kill all threads sharing oom killed task's mm It's necessary to kill all threads that share an oom killed task's mm if the goal is to lead to future memory freeing. This patch reintroduces the code removed in 8c5cd6f3 (oom: oom_kill doesn't kill vfork parent (or child)) since it is obsoleted. It's now guaranteed that any task passed to oom_kill_task() does not share an mm with any thread that is unkillable. Thus, we're safe to issue a SIGKILL to any thread sharing the same mm. This is especially necessary to solve an mm->mmap_sem livelock issue whereas an oom killed thread must acquire the lock in the exit path while another thread is holding it in the page allocator while trying to allocate memory itself (and will preempt the oom killer since a task was already killed). Since tasks with pending fatal signals are now granted access to memory reserves, the thread holding the lock may quickly allocate and release the lock so that the oom killed task may exit. This mainly is for threads that are cloned with CLONE_VM but not CLONE_THREAD, so they are in a different thread group. Non-NPTL threads exist in the wild and this change is necessary to prevent the livelock in such cases. We care more about preventing the livelock than incurring the additional tasklist in the oom killer when a task has been killed. Systems that are sufficiently large to not want the tasklist scan in the oom killer in the first place already have the option of enabling /proc/sys/vm/oom_kill_allocating_task, which was designed specifically for that purpose. This code had existed in the oom killer for over eight years dating back to the 2.4 kernel. [akpm@linux-foundation.org: add nice comment] Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/1e99bad0d9c12a4aaa60cd812c84ef152564bcf5;"It's necessary to kill all threads that share an oom killed task's mm if the goal is to lead to future memory freeing";no;yes;yes 256;MDY6Q29tbWl0MjMyNTI5ODoxZTk5YmFkMGQ5YzEyYTRhYWE2MGNkODEyYzg0ZWYxNTI1NjRiY2Y1;David Rientjes;Linus Torvalds;"oom: kill all threads sharing oom killed task's mm It's necessary to kill all threads that share an oom killed task's mm if the goal is to lead to future memory freeing. This patch reintroduces the code removed in 8c5cd6f3 (oom: oom_kill doesn't kill vfork parent (or child)) since it is obsoleted. It's now guaranteed that any task passed to oom_kill_task() does not share an mm with any thread that is unkillable. Thus, we're safe to issue a SIGKILL to any thread sharing the same mm. This is especially necessary to solve an mm->mmap_sem livelock issue whereas an oom killed thread must acquire the lock in the exit path while another thread is holding it in the page allocator while trying to allocate memory itself (and will preempt the oom killer since a task was already killed). Since tasks with pending fatal signals are now granted access to memory reserves, the thread holding the lock may quickly allocate and release the lock so that the oom killed task may exit. This mainly is for threads that are cloned with CLONE_VM but not CLONE_THREAD, so they are in a different thread group. Non-NPTL threads exist in the wild and this change is necessary to prevent the livelock in such cases. We care more about preventing the livelock than incurring the additional tasklist in the oom killer when a task has been killed. Systems that are sufficiently large to not want the tasklist scan in the oom killer in the first place already have the option of enabling /proc/sys/vm/oom_kill_allocating_task, which was designed specifically for that purpose. This code had existed in the oom killer for over eight years dating back to the 2.4 kernel. [akpm@linux-foundation.org: add nice comment] Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/1e99bad0d9c12a4aaa60cd812c84ef152564bcf5;"This patch reintroduces the code removed in 8c5cd6f3 (oom: oom_kill doesn't kill vfork parent (or child)) since it is obsoleted";yes;yes;no 256;MDY6Q29tbWl0MjMyNTI5ODoxZTk5YmFkMGQ5YzEyYTRhYWE2MGNkODEyYzg0ZWYxNTI1NjRiY2Y1;David Rientjes;Linus Torvalds;"oom: kill all threads sharing oom killed task's mm It's necessary to kill all threads that share an oom killed task's mm if the goal is to lead to future memory freeing. This patch reintroduces the code removed in 8c5cd6f3 (oom: oom_kill doesn't kill vfork parent (or child)) since it is obsoleted. It's now guaranteed that any task passed to oom_kill_task() does not share an mm with any thread that is unkillable. Thus, we're safe to issue a SIGKILL to any thread sharing the same mm. This is especially necessary to solve an mm->mmap_sem livelock issue whereas an oom killed thread must acquire the lock in the exit path while another thread is holding it in the page allocator while trying to allocate memory itself (and will preempt the oom killer since a task was already killed). Since tasks with pending fatal signals are now granted access to memory reserves, the thread holding the lock may quickly allocate and release the lock so that the oom killed task may exit. This mainly is for threads that are cloned with CLONE_VM but not CLONE_THREAD, so they are in a different thread group. Non-NPTL threads exist in the wild and this change is necessary to prevent the livelock in such cases. We care more about preventing the livelock than incurring the additional tasklist in the oom killer when a task has been killed. Systems that are sufficiently large to not want the tasklist scan in the oom killer in the first place already have the option of enabling /proc/sys/vm/oom_kill_allocating_task, which was designed specifically for that purpose. This code had existed in the oom killer for over eight years dating back to the 2.4 kernel. [akpm@linux-foundation.org: add nice comment] Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/1e99bad0d9c12a4aaa60cd812c84ef152564bcf5;"It's now guaranteed that any task passed to oom_kill_task() does not share an mm with any thread that is unkillable";yes;no;no 256;MDY6Q29tbWl0MjMyNTI5ODoxZTk5YmFkMGQ5YzEyYTRhYWE2MGNkODEyYzg0ZWYxNTI1NjRiY2Y1;David Rientjes;Linus Torvalds;"oom: kill all threads sharing oom killed task's mm It's necessary to kill all threads that share an oom killed task's mm if the goal is to lead to future memory freeing. This patch reintroduces the code removed in 8c5cd6f3 (oom: oom_kill doesn't kill vfork parent (or child)) since it is obsoleted. It's now guaranteed that any task passed to oom_kill_task() does not share an mm with any thread that is unkillable. Thus, we're safe to issue a SIGKILL to any thread sharing the same mm. This is especially necessary to solve an mm->mmap_sem livelock issue whereas an oom killed thread must acquire the lock in the exit path while another thread is holding it in the page allocator while trying to allocate memory itself (and will preempt the oom killer since a task was already killed). Since tasks with pending fatal signals are now granted access to memory reserves, the thread holding the lock may quickly allocate and release the lock so that the oom killed task may exit. This mainly is for threads that are cloned with CLONE_VM but not CLONE_THREAD, so they are in a different thread group. Non-NPTL threads exist in the wild and this change is necessary to prevent the livelock in such cases. We care more about preventing the livelock than incurring the additional tasklist in the oom killer when a task has been killed. Systems that are sufficiently large to not want the tasklist scan in the oom killer in the first place already have the option of enabling /proc/sys/vm/oom_kill_allocating_task, which was designed specifically for that purpose. This code had existed in the oom killer for over eight years dating back to the 2.4 kernel. [akpm@linux-foundation.org: add nice comment] Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/1e99bad0d9c12a4aaa60cd812c84ef152564bcf5;" Thus, we're safe to issue a SIGKILL to any thread sharing the same mm";yes;yes;no 256;MDY6Q29tbWl0MjMyNTI5ODoxZTk5YmFkMGQ5YzEyYTRhYWE2MGNkODEyYzg0ZWYxNTI1NjRiY2Y1;David Rientjes;Linus Torvalds;"oom: kill all threads sharing oom killed task's mm It's necessary to kill all threads that share an oom killed task's mm if the goal is to lead to future memory freeing. This patch reintroduces the code removed in 8c5cd6f3 (oom: oom_kill doesn't kill vfork parent (or child)) since it is obsoleted. It's now guaranteed that any task passed to oom_kill_task() does not share an mm with any thread that is unkillable. Thus, we're safe to issue a SIGKILL to any thread sharing the same mm. This is especially necessary to solve an mm->mmap_sem livelock issue whereas an oom killed thread must acquire the lock in the exit path while another thread is holding it in the page allocator while trying to allocate memory itself (and will preempt the oom killer since a task was already killed). Since tasks with pending fatal signals are now granted access to memory reserves, the thread holding the lock may quickly allocate and release the lock so that the oom killed task may exit. This mainly is for threads that are cloned with CLONE_VM but not CLONE_THREAD, so they are in a different thread group. Non-NPTL threads exist in the wild and this change is necessary to prevent the livelock in such cases. We care more about preventing the livelock than incurring the additional tasklist in the oom killer when a task has been killed. Systems that are sufficiently large to not want the tasklist scan in the oom killer in the first place already have the option of enabling /proc/sys/vm/oom_kill_allocating_task, which was designed specifically for that purpose. This code had existed in the oom killer for over eight years dating back to the 2.4 kernel. [akpm@linux-foundation.org: add nice comment] Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/1e99bad0d9c12a4aaa60cd812c84ef152564bcf5;"This is especially necessary to solve an mm->mmap_sem livelock issue whereas an oom killed thread must acquire the lock in the exit path while another thread is holding it in the page allocator while trying to allocate memory itself (and will preempt the oom killer since a task was already killed)";no;yes;yes 256;MDY6Q29tbWl0MjMyNTI5ODoxZTk5YmFkMGQ5YzEyYTRhYWE2MGNkODEyYzg0ZWYxNTI1NjRiY2Y1;David Rientjes;Linus Torvalds;"oom: kill all threads sharing oom killed task's mm It's necessary to kill all threads that share an oom killed task's mm if the goal is to lead to future memory freeing. This patch reintroduces the code removed in 8c5cd6f3 (oom: oom_kill doesn't kill vfork parent (or child)) since it is obsoleted. It's now guaranteed that any task passed to oom_kill_task() does not share an mm with any thread that is unkillable. Thus, we're safe to issue a SIGKILL to any thread sharing the same mm. This is especially necessary to solve an mm->mmap_sem livelock issue whereas an oom killed thread must acquire the lock in the exit path while another thread is holding it in the page allocator while trying to allocate memory itself (and will preempt the oom killer since a task was already killed). Since tasks with pending fatal signals are now granted access to memory reserves, the thread holding the lock may quickly allocate and release the lock so that the oom killed task may exit. This mainly is for threads that are cloned with CLONE_VM but not CLONE_THREAD, so they are in a different thread group. Non-NPTL threads exist in the wild and this change is necessary to prevent the livelock in such cases. We care more about preventing the livelock than incurring the additional tasklist in the oom killer when a task has been killed. Systems that are sufficiently large to not want the tasklist scan in the oom killer in the first place already have the option of enabling /proc/sys/vm/oom_kill_allocating_task, which was designed specifically for that purpose. This code had existed in the oom killer for over eight years dating back to the 2.4 kernel. [akpm@linux-foundation.org: add nice comment] Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/1e99bad0d9c12a4aaa60cd812c84ef152564bcf5;" Since tasks with pending fatal signals are now granted access to memory reserves, the thread holding the lock may quickly allocate and release the lock so that the oom killed task may exit";yes;yes;yes 256;MDY6Q29tbWl0MjMyNTI5ODoxZTk5YmFkMGQ5YzEyYTRhYWE2MGNkODEyYzg0ZWYxNTI1NjRiY2Y1;David Rientjes;Linus Torvalds;"oom: kill all threads sharing oom killed task's mm It's necessary to kill all threads that share an oom killed task's mm if the goal is to lead to future memory freeing. This patch reintroduces the code removed in 8c5cd6f3 (oom: oom_kill doesn't kill vfork parent (or child)) since it is obsoleted. It's now guaranteed that any task passed to oom_kill_task() does not share an mm with any thread that is unkillable. Thus, we're safe to issue a SIGKILL to any thread sharing the same mm. This is especially necessary to solve an mm->mmap_sem livelock issue whereas an oom killed thread must acquire the lock in the exit path while another thread is holding it in the page allocator while trying to allocate memory itself (and will preempt the oom killer since a task was already killed). Since tasks with pending fatal signals are now granted access to memory reserves, the thread holding the lock may quickly allocate and release the lock so that the oom killed task may exit. This mainly is for threads that are cloned with CLONE_VM but not CLONE_THREAD, so they are in a different thread group. Non-NPTL threads exist in the wild and this change is necessary to prevent the livelock in such cases. We care more about preventing the livelock than incurring the additional tasklist in the oom killer when a task has been killed. Systems that are sufficiently large to not want the tasklist scan in the oom killer in the first place already have the option of enabling /proc/sys/vm/oom_kill_allocating_task, which was designed specifically for that purpose. This code had existed in the oom killer for over eight years dating back to the 2.4 kernel. [akpm@linux-foundation.org: add nice comment] Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/1e99bad0d9c12a4aaa60cd812c84ef152564bcf5;"This mainly is for threads that are cloned with CLONE_VM but not CLONE_THREAD, so they are in a different thread group";yes;no;yes 256;MDY6Q29tbWl0MjMyNTI5ODoxZTk5YmFkMGQ5YzEyYTRhYWE2MGNkODEyYzg0ZWYxNTI1NjRiY2Y1;David Rientjes;Linus Torvalds;"oom: kill all threads sharing oom killed task's mm It's necessary to kill all threads that share an oom killed task's mm if the goal is to lead to future memory freeing. This patch reintroduces the code removed in 8c5cd6f3 (oom: oom_kill doesn't kill vfork parent (or child)) since it is obsoleted. It's now guaranteed that any task passed to oom_kill_task() does not share an mm with any thread that is unkillable. Thus, we're safe to issue a SIGKILL to any thread sharing the same mm. This is especially necessary to solve an mm->mmap_sem livelock issue whereas an oom killed thread must acquire the lock in the exit path while another thread is holding it in the page allocator while trying to allocate memory itself (and will preempt the oom killer since a task was already killed). Since tasks with pending fatal signals are now granted access to memory reserves, the thread holding the lock may quickly allocate and release the lock so that the oom killed task may exit. This mainly is for threads that are cloned with CLONE_VM but not CLONE_THREAD, so they are in a different thread group. Non-NPTL threads exist in the wild and this change is necessary to prevent the livelock in such cases. We care more about preventing the livelock than incurring the additional tasklist in the oom killer when a task has been killed. Systems that are sufficiently large to not want the tasklist scan in the oom killer in the first place already have the option of enabling /proc/sys/vm/oom_kill_allocating_task, which was designed specifically for that purpose. This code had existed in the oom killer for over eight years dating back to the 2.4 kernel. [akpm@linux-foundation.org: add nice comment] Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/1e99bad0d9c12a4aaa60cd812c84ef152564bcf5;" Non-NPTL threads exist in the wild and this change is necessary to prevent the livelock in such cases";no;yes;yes 256;MDY6Q29tbWl0MjMyNTI5ODoxZTk5YmFkMGQ5YzEyYTRhYWE2MGNkODEyYzg0ZWYxNTI1NjRiY2Y1;David Rientjes;Linus Torvalds;"oom: kill all threads sharing oom killed task's mm It's necessary to kill all threads that share an oom killed task's mm if the goal is to lead to future memory freeing. This patch reintroduces the code removed in 8c5cd6f3 (oom: oom_kill doesn't kill vfork parent (or child)) since it is obsoleted. It's now guaranteed that any task passed to oom_kill_task() does not share an mm with any thread that is unkillable. Thus, we're safe to issue a SIGKILL to any thread sharing the same mm. This is especially necessary to solve an mm->mmap_sem livelock issue whereas an oom killed thread must acquire the lock in the exit path while another thread is holding it in the page allocator while trying to allocate memory itself (and will preempt the oom killer since a task was already killed). Since tasks with pending fatal signals are now granted access to memory reserves, the thread holding the lock may quickly allocate and release the lock so that the oom killed task may exit. This mainly is for threads that are cloned with CLONE_VM but not CLONE_THREAD, so they are in a different thread group. Non-NPTL threads exist in the wild and this change is necessary to prevent the livelock in such cases. We care more about preventing the livelock than incurring the additional tasklist in the oom killer when a task has been killed. Systems that are sufficiently large to not want the tasklist scan in the oom killer in the first place already have the option of enabling /proc/sys/vm/oom_kill_allocating_task, which was designed specifically for that purpose. This code had existed in the oom killer for over eight years dating back to the 2.4 kernel. [akpm@linux-foundation.org: add nice comment] Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/1e99bad0d9c12a4aaa60cd812c84ef152564bcf5;" We care more about preventing the livelock than incurring the additional tasklist in the oom killer when a task has been killed";yes;yes;yes 256;MDY6Q29tbWl0MjMyNTI5ODoxZTk5YmFkMGQ5YzEyYTRhYWE2MGNkODEyYzg0ZWYxNTI1NjRiY2Y1;David Rientjes;Linus Torvalds;"oom: kill all threads sharing oom killed task's mm It's necessary to kill all threads that share an oom killed task's mm if the goal is to lead to future memory freeing. This patch reintroduces the code removed in 8c5cd6f3 (oom: oom_kill doesn't kill vfork parent (or child)) since it is obsoleted. It's now guaranteed that any task passed to oom_kill_task() does not share an mm with any thread that is unkillable. Thus, we're safe to issue a SIGKILL to any thread sharing the same mm. This is especially necessary to solve an mm->mmap_sem livelock issue whereas an oom killed thread must acquire the lock in the exit path while another thread is holding it in the page allocator while trying to allocate memory itself (and will preempt the oom killer since a task was already killed). Since tasks with pending fatal signals are now granted access to memory reserves, the thread holding the lock may quickly allocate and release the lock so that the oom killed task may exit. This mainly is for threads that are cloned with CLONE_VM but not CLONE_THREAD, so they are in a different thread group. Non-NPTL threads exist in the wild and this change is necessary to prevent the livelock in such cases. We care more about preventing the livelock than incurring the additional tasklist in the oom killer when a task has been killed. Systems that are sufficiently large to not want the tasklist scan in the oom killer in the first place already have the option of enabling /proc/sys/vm/oom_kill_allocating_task, which was designed specifically for that purpose. This code had existed in the oom killer for over eight years dating back to the 2.4 kernel. [akpm@linux-foundation.org: add nice comment] Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/1e99bad0d9c12a4aaa60cd812c84ef152564bcf5;"Systems that are sufficiently large to not want the tasklist scan in the oom killer in the first place already have the option of enabling /proc/sys/vm/oom_kill_allocating_task, which was designed specifically for that purpose";yes;no;yes 256;MDY6Q29tbWl0MjMyNTI5ODoxZTk5YmFkMGQ5YzEyYTRhYWE2MGNkODEyYzg0ZWYxNTI1NjRiY2Y1;David Rientjes;Linus Torvalds;"oom: kill all threads sharing oom killed task's mm It's necessary to kill all threads that share an oom killed task's mm if the goal is to lead to future memory freeing. This patch reintroduces the code removed in 8c5cd6f3 (oom: oom_kill doesn't kill vfork parent (or child)) since it is obsoleted. It's now guaranteed that any task passed to oom_kill_task() does not share an mm with any thread that is unkillable. Thus, we're safe to issue a SIGKILL to any thread sharing the same mm. This is especially necessary to solve an mm->mmap_sem livelock issue whereas an oom killed thread must acquire the lock in the exit path while another thread is holding it in the page allocator while trying to allocate memory itself (and will preempt the oom killer since a task was already killed). Since tasks with pending fatal signals are now granted access to memory reserves, the thread holding the lock may quickly allocate and release the lock so that the oom killed task may exit. This mainly is for threads that are cloned with CLONE_VM but not CLONE_THREAD, so they are in a different thread group. Non-NPTL threads exist in the wild and this change is necessary to prevent the livelock in such cases. We care more about preventing the livelock than incurring the additional tasklist in the oom killer when a task has been killed. Systems that are sufficiently large to not want the tasklist scan in the oom killer in the first place already have the option of enabling /proc/sys/vm/oom_kill_allocating_task, which was designed specifically for that purpose. This code had existed in the oom killer for over eight years dating back to the 2.4 kernel. [akpm@linux-foundation.org: add nice comment] Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/1e99bad0d9c12a4aaa60cd812c84ef152564bcf5;"This code had existed in the oom killer for over eight years dating back to the 2.4 kernel.";no;no;yes 257;MDY6Q29tbWl0MjMyNTI5ODplMTg2NDFlMTlhOTIwNGYyNDFmMDRhNWFjNzAwMTY4ZGNkMThkZTRm;David Rientjes;Linus Torvalds;"oom: avoid killing a task if a thread sharing its mm cannot be killed The oom killer's goal is to kill a memory-hogging task so that it may exit, free its memory, and allow the current context to allocate the memory that triggered it in the first place. Thus, killing a task is pointless if other threads sharing its mm cannot be killed because of its /proc/pid/oom_adj or /proc/pid/oom_score_adj value. This patch checks whether any other thread sharing p->mm has an oom_score_adj of OOM_SCORE_ADJ_MIN. If so, the thread cannot be killed and oom_badness(p) returns 0, meaning it's unkillable. Signed-off-by: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/e18641e19a9204f241f04a5ac700168dcd18de4f;oom: avoid killing a task if a thread sharing its mm cannot be killed;yes;yes;yes 257;MDY6Q29tbWl0MjMyNTI5ODplMTg2NDFlMTlhOTIwNGYyNDFmMDRhNWFjNzAwMTY4ZGNkMThkZTRm;David Rientjes;Linus Torvalds;"oom: avoid killing a task if a thread sharing its mm cannot be killed The oom killer's goal is to kill a memory-hogging task so that it may exit, free its memory, and allow the current context to allocate the memory that triggered it in the first place. Thus, killing a task is pointless if other threads sharing its mm cannot be killed because of its /proc/pid/oom_adj or /proc/pid/oom_score_adj value. This patch checks whether any other thread sharing p->mm has an oom_score_adj of OOM_SCORE_ADJ_MIN. If so, the thread cannot be killed and oom_badness(p) returns 0, meaning it's unkillable. Signed-off-by: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/e18641e19a9204f241f04a5ac700168dcd18de4f;"The oom killer's goal is to kill a memory-hogging task so that it may exit, free its memory, and allow the current context to allocate the memory that triggered it in the first place";no;no;yes 257;MDY6Q29tbWl0MjMyNTI5ODplMTg2NDFlMTlhOTIwNGYyNDFmMDRhNWFjNzAwMTY4ZGNkMThkZTRm;David Rientjes;Linus Torvalds;"oom: avoid killing a task if a thread sharing its mm cannot be killed The oom killer's goal is to kill a memory-hogging task so that it may exit, free its memory, and allow the current context to allocate the memory that triggered it in the first place. Thus, killing a task is pointless if other threads sharing its mm cannot be killed because of its /proc/pid/oom_adj or /proc/pid/oom_score_adj value. This patch checks whether any other thread sharing p->mm has an oom_score_adj of OOM_SCORE_ADJ_MIN. If so, the thread cannot be killed and oom_badness(p) returns 0, meaning it's unkillable. Signed-off-by: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/e18641e19a9204f241f04a5ac700168dcd18de4f;" Thus, killing a task is pointless if other threads sharing its mm cannot be killed because of its /proc/pid/oom_adj or /proc/pid/oom_score_adj value";no;yes;yes 257;MDY6Q29tbWl0MjMyNTI5ODplMTg2NDFlMTlhOTIwNGYyNDFmMDRhNWFjNzAwMTY4ZGNkMThkZTRm;David Rientjes;Linus Torvalds;"oom: avoid killing a task if a thread sharing its mm cannot be killed The oom killer's goal is to kill a memory-hogging task so that it may exit, free its memory, and allow the current context to allocate the memory that triggered it in the first place. Thus, killing a task is pointless if other threads sharing its mm cannot be killed because of its /proc/pid/oom_adj or /proc/pid/oom_score_adj value. This patch checks whether any other thread sharing p->mm has an oom_score_adj of OOM_SCORE_ADJ_MIN. If so, the thread cannot be killed and oom_badness(p) returns 0, meaning it's unkillable. Signed-off-by: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/e18641e19a9204f241f04a5ac700168dcd18de4f;"This patch checks whether any other thread sharing p->mm has an oom_score_adj of OOM_SCORE_ADJ_MIN";yes;no;yes 257;MDY6Q29tbWl0MjMyNTI5ODplMTg2NDFlMTlhOTIwNGYyNDFmMDRhNWFjNzAwMTY4ZGNkMThkZTRm;David Rientjes;Linus Torvalds;"oom: avoid killing a task if a thread sharing its mm cannot be killed The oom killer's goal is to kill a memory-hogging task so that it may exit, free its memory, and allow the current context to allocate the memory that triggered it in the first place. Thus, killing a task is pointless if other threads sharing its mm cannot be killed because of its /proc/pid/oom_adj or /proc/pid/oom_score_adj value. This patch checks whether any other thread sharing p->mm has an oom_score_adj of OOM_SCORE_ADJ_MIN. If so, the thread cannot be killed and oom_badness(p) returns 0, meaning it's unkillable. Signed-off-by: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/e18641e19a9204f241f04a5ac700168dcd18de4f;" If so, the thread cannot be killed and oom_badness(p) returns 0, meaning it's unkillable.";yes;no;yes 258;MDY6Q29tbWl0MjMyNTI5ODplODViZmQzYWE3YTM0ZmE5NjNiYjI2OGE2NzZiNDE2OTRlNmRjZjk2;David Rientjes;Linus Torvalds;"oom: filter unkillable tasks from tasklist dump /proc/sys/vm/oom_dump_tasks is enabled by default, so it's necessary to limit as much information as possible that it should emit. The tasklist dump should be filtered to only those tasks that are eligible for oom kill. This is already done for memcg ooms, but this patch extends it to both cpuset and mempolicy ooms as well as init. In addition to suppressing irrelevant information, this also reduces confusion since users currently don't know which tasks in the tasklist aren't eligible for kill (such as those attached to cpusets or bound to mempolicies with a disjoint set of mems or nodes, respectively) since that information is not shown. Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/e85bfd3aa7a34fa963bb268a676b41694e6dcf96;oom: filter unkillable tasks from tasklist dump;yes;no;no 258;MDY6Q29tbWl0MjMyNTI5ODplODViZmQzYWE3YTM0ZmE5NjNiYjI2OGE2NzZiNDE2OTRlNmRjZjk2;David Rientjes;Linus Torvalds;"oom: filter unkillable tasks from tasklist dump /proc/sys/vm/oom_dump_tasks is enabled by default, so it's necessary to limit as much information as possible that it should emit. The tasklist dump should be filtered to only those tasks that are eligible for oom kill. This is already done for memcg ooms, but this patch extends it to both cpuset and mempolicy ooms as well as init. In addition to suppressing irrelevant information, this also reduces confusion since users currently don't know which tasks in the tasklist aren't eligible for kill (such as those attached to cpusets or bound to mempolicies with a disjoint set of mems or nodes, respectively) since that information is not shown. Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/e85bfd3aa7a34fa963bb268a676b41694e6dcf96;"/proc/sys/vm/oom_dump_tasks is enabled by default, so it's necessary to limit as much information as possible that it should emit";no;yes;yes 258;MDY6Q29tbWl0MjMyNTI5ODplODViZmQzYWE3YTM0ZmE5NjNiYjI2OGE2NzZiNDE2OTRlNmRjZjk2;David Rientjes;Linus Torvalds;"oom: filter unkillable tasks from tasklist dump /proc/sys/vm/oom_dump_tasks is enabled by default, so it's necessary to limit as much information as possible that it should emit. The tasklist dump should be filtered to only those tasks that are eligible for oom kill. This is already done for memcg ooms, but this patch extends it to both cpuset and mempolicy ooms as well as init. In addition to suppressing irrelevant information, this also reduces confusion since users currently don't know which tasks in the tasklist aren't eligible for kill (such as those attached to cpusets or bound to mempolicies with a disjoint set of mems or nodes, respectively) since that information is not shown. Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/e85bfd3aa7a34fa963bb268a676b41694e6dcf96;"The tasklist dump should be filtered to only those tasks that are eligible for oom kill";yes;yes;no 258;MDY6Q29tbWl0MjMyNTI5ODplODViZmQzYWE3YTM0ZmE5NjNiYjI2OGE2NzZiNDE2OTRlNmRjZjk2;David Rientjes;Linus Torvalds;"oom: filter unkillable tasks from tasklist dump /proc/sys/vm/oom_dump_tasks is enabled by default, so it's necessary to limit as much information as possible that it should emit. The tasklist dump should be filtered to only those tasks that are eligible for oom kill. This is already done for memcg ooms, but this patch extends it to both cpuset and mempolicy ooms as well as init. In addition to suppressing irrelevant information, this also reduces confusion since users currently don't know which tasks in the tasklist aren't eligible for kill (such as those attached to cpusets or bound to mempolicies with a disjoint set of mems or nodes, respectively) since that information is not shown. Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/e85bfd3aa7a34fa963bb268a676b41694e6dcf96;" This is already done for memcg ooms, but this patch extends it to both cpuset and mempolicy ooms as well as init";yes;no;yes 258;MDY6Q29tbWl0MjMyNTI5ODplODViZmQzYWE3YTM0ZmE5NjNiYjI2OGE2NzZiNDE2OTRlNmRjZjk2;David Rientjes;Linus Torvalds;"oom: filter unkillable tasks from tasklist dump /proc/sys/vm/oom_dump_tasks is enabled by default, so it's necessary to limit as much information as possible that it should emit. The tasklist dump should be filtered to only those tasks that are eligible for oom kill. This is already done for memcg ooms, but this patch extends it to both cpuset and mempolicy ooms as well as init. In addition to suppressing irrelevant information, this also reduces confusion since users currently don't know which tasks in the tasklist aren't eligible for kill (such as those attached to cpusets or bound to mempolicies with a disjoint set of mems or nodes, respectively) since that information is not shown. Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/e85bfd3aa7a34fa963bb268a676b41694e6dcf96;"In addition to suppressing irrelevant information, this also reduces confusion since users currently don't know which tasks in the tasklist aren't eligible for kill (such as those attached to cpusets or bound to mempolicies with a disjoint set of mems or nodes, respectively) since that information is not shown.";yes;yes;yes 259;MDY6Q29tbWl0MjMyNTI5ODpmMTllOGFhMTFhZmEyNDAzNmM2MjczNDI4ZGE1MTk0OWI1YWNmMzBj;David Rientjes;Linus Torvalds;"oom: always return a badness score of non-zero for eligible tasks A task's badness score is roughly a proportion of its rss and swap compared to the system's capacity. The scale ranges from 0 to 1000 with the highest score chosen for kill. Thus, this scale operates on a resolution of 0.1% of RAM + swap. Admin tasks are also given a 3% bonus, so the badness score of an admin task using 3% of memory, for example, would still be 0. It's possible that an exceptionally large number of tasks will combine to exhaust all resources but never have a single task that uses more than 0.1% of RAM and swap (or 3.0% for admin tasks). This patch ensures that the badness score of any eligible task is never 0 so the machine doesn't unnecessarily panic because it cannot find a task to kill. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/f19e8aa11afa24036c6273428da51949b5acf30c;oom: always return a badness score of non-zero for eligible tasks;yes;no;no 259;MDY6Q29tbWl0MjMyNTI5ODpmMTllOGFhMTFhZmEyNDAzNmM2MjczNDI4ZGE1MTk0OWI1YWNmMzBj;David Rientjes;Linus Torvalds;"oom: always return a badness score of non-zero for eligible tasks A task's badness score is roughly a proportion of its rss and swap compared to the system's capacity. The scale ranges from 0 to 1000 with the highest score chosen for kill. Thus, this scale operates on a resolution of 0.1% of RAM + swap. Admin tasks are also given a 3% bonus, so the badness score of an admin task using 3% of memory, for example, would still be 0. It's possible that an exceptionally large number of tasks will combine to exhaust all resources but never have a single task that uses more than 0.1% of RAM and swap (or 3.0% for admin tasks). This patch ensures that the badness score of any eligible task is never 0 so the machine doesn't unnecessarily panic because it cannot find a task to kill. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/f19e8aa11afa24036c6273428da51949b5acf30c;"A task's badness score is roughly a proportion of its rss and swap compared to the system's capacity";no;no;yes 259;MDY6Q29tbWl0MjMyNTI5ODpmMTllOGFhMTFhZmEyNDAzNmM2MjczNDI4ZGE1MTk0OWI1YWNmMzBj;David Rientjes;Linus Torvalds;"oom: always return a badness score of non-zero for eligible tasks A task's badness score is roughly a proportion of its rss and swap compared to the system's capacity. The scale ranges from 0 to 1000 with the highest score chosen for kill. Thus, this scale operates on a resolution of 0.1% of RAM + swap. Admin tasks are also given a 3% bonus, so the badness score of an admin task using 3% of memory, for example, would still be 0. It's possible that an exceptionally large number of tasks will combine to exhaust all resources but never have a single task that uses more than 0.1% of RAM and swap (or 3.0% for admin tasks). This patch ensures that the badness score of any eligible task is never 0 so the machine doesn't unnecessarily panic because it cannot find a task to kill. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/f19e8aa11afa24036c6273428da51949b5acf30c;" The scale ranges from 0 to 1000 with the highest score chosen for kill";no;no;yes 259;MDY6Q29tbWl0MjMyNTI5ODpmMTllOGFhMTFhZmEyNDAzNmM2MjczNDI4ZGE1MTk0OWI1YWNmMzBj;David Rientjes;Linus Torvalds;"oom: always return a badness score of non-zero for eligible tasks A task's badness score is roughly a proportion of its rss and swap compared to the system's capacity. The scale ranges from 0 to 1000 with the highest score chosen for kill. Thus, this scale operates on a resolution of 0.1% of RAM + swap. Admin tasks are also given a 3% bonus, so the badness score of an admin task using 3% of memory, for example, would still be 0. It's possible that an exceptionally large number of tasks will combine to exhaust all resources but never have a single task that uses more than 0.1% of RAM and swap (or 3.0% for admin tasks). This patch ensures that the badness score of any eligible task is never 0 so the machine doesn't unnecessarily panic because it cannot find a task to kill. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/f19e8aa11afa24036c6273428da51949b5acf30c;" Thus, this scale operates on a resolution of 0.1% of RAM + swap";no;no;yes 259;MDY6Q29tbWl0MjMyNTI5ODpmMTllOGFhMTFhZmEyNDAzNmM2MjczNDI4ZGE1MTk0OWI1YWNmMzBj;David Rientjes;Linus Torvalds;"oom: always return a badness score of non-zero for eligible tasks A task's badness score is roughly a proportion of its rss and swap compared to the system's capacity. The scale ranges from 0 to 1000 with the highest score chosen for kill. Thus, this scale operates on a resolution of 0.1% of RAM + swap. Admin tasks are also given a 3% bonus, so the badness score of an admin task using 3% of memory, for example, would still be 0. It's possible that an exceptionally large number of tasks will combine to exhaust all resources but never have a single task that uses more than 0.1% of RAM and swap (or 3.0% for admin tasks). This patch ensures that the badness score of any eligible task is never 0 so the machine doesn't unnecessarily panic because it cannot find a task to kill. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/f19e8aa11afa24036c6273428da51949b5acf30c;" Admin tasks are also given a 3% bonus, so the badness score of an admin task using 3% of memory, for example, would still be 0";no;no;yes 259;MDY6Q29tbWl0MjMyNTI5ODpmMTllOGFhMTFhZmEyNDAzNmM2MjczNDI4ZGE1MTk0OWI1YWNmMzBj;David Rientjes;Linus Torvalds;"oom: always return a badness score of non-zero for eligible tasks A task's badness score is roughly a proportion of its rss and swap compared to the system's capacity. The scale ranges from 0 to 1000 with the highest score chosen for kill. Thus, this scale operates on a resolution of 0.1% of RAM + swap. Admin tasks are also given a 3% bonus, so the badness score of an admin task using 3% of memory, for example, would still be 0. It's possible that an exceptionally large number of tasks will combine to exhaust all resources but never have a single task that uses more than 0.1% of RAM and swap (or 3.0% for admin tasks). This patch ensures that the badness score of any eligible task is never 0 so the machine doesn't unnecessarily panic because it cannot find a task to kill. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/f19e8aa11afa24036c6273428da51949b5acf30c;"It's possible that an exceptionally large number of tasks will combine to exhaust all resources but never have a single task that uses more than 0.1% of RAM and swap (or 3.0% for admin tasks)";no;yes;yes 259;MDY6Q29tbWl0MjMyNTI5ODpmMTllOGFhMTFhZmEyNDAzNmM2MjczNDI4ZGE1MTk0OWI1YWNmMzBj;David Rientjes;Linus Torvalds;"oom: always return a badness score of non-zero for eligible tasks A task's badness score is roughly a proportion of its rss and swap compared to the system's capacity. The scale ranges from 0 to 1000 with the highest score chosen for kill. Thus, this scale operates on a resolution of 0.1% of RAM + swap. Admin tasks are also given a 3% bonus, so the badness score of an admin task using 3% of memory, for example, would still be 0. It's possible that an exceptionally large number of tasks will combine to exhaust all resources but never have a single task that uses more than 0.1% of RAM and swap (or 3.0% for admin tasks). This patch ensures that the badness score of any eligible task is never 0 so the machine doesn't unnecessarily panic because it cannot find a task to kill. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/f19e8aa11afa24036c6273428da51949b5acf30c;"This patch ensures that the badness score of any eligible task is never 0 so the machine doesn't unnecessarily panic because it cannot find a task to kill.";yes;yes;no 261;MDY6Q29tbWl0MjMyNTI5ODpiNTI3MjNjNTYwN2Y3Njg0YzJjMGMwNzVmODZmODZkYTBkN2ZiNmQw;KOSAKI Motohiro;Linus Torvalds;"oom: fix tasklist_lock leak Commit 0aad4b3124 (""oom: fold __out_of_memory into out_of_memory"") introduced a tasklist_lock leak. Then it caused following obvious danger warnings and panic. ================================================ [ BUG: lock held when returning to user space! ] ------------------------------------------------ rsyslogd/1422 is leaving the kernel with locks still held! 1 lock held by rsyslogd/1422: #0: (tasklist_lock){.+.+.+}, at: [<ffffffff810faf64>] out_of_memory+0x164/0x3f0 BUG: scheduling while atomic: rsyslogd/1422/0x00000002 INFO: lockdep is turned off. This patch fixes it. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/b52723c5607f7684c2c0c075f86f86da0d7fb6d0;oom: fix tasklist_lock leak;yes;yes;no 261;MDY6Q29tbWl0MjMyNTI5ODpiNTI3MjNjNTYwN2Y3Njg0YzJjMGMwNzVmODZmODZkYTBkN2ZiNmQw;KOSAKI Motohiro;Linus Torvalds;"oom: fix tasklist_lock leak Commit 0aad4b3124 (""oom: fold __out_of_memory into out_of_memory"") introduced a tasklist_lock leak. Then it caused following obvious danger warnings and panic. ================================================ [ BUG: lock held when returning to user space! ] ------------------------------------------------ rsyslogd/1422 is leaving the kernel with locks still held! 1 lock held by rsyslogd/1422: #0: (tasklist_lock){.+.+.+}, at: [<ffffffff810faf64>] out_of_memory+0x164/0x3f0 BUG: scheduling while atomic: rsyslogd/1422/0x00000002 INFO: lockdep is turned off. This patch fixes it. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/b52723c5607f7684c2c0c075f86f86da0d7fb6d0;"Commit 0aad4b3124 (""oom: fold __out_of_memory into out_of_memory"") introduced a tasklist_lock leak";no;yes;yes 261;MDY6Q29tbWl0MjMyNTI5ODpiNTI3MjNjNTYwN2Y3Njg0YzJjMGMwNzVmODZmODZkYTBkN2ZiNmQw;KOSAKI Motohiro;Linus Torvalds;"oom: fix tasklist_lock leak Commit 0aad4b3124 (""oom: fold __out_of_memory into out_of_memory"") introduced a tasklist_lock leak. Then it caused following obvious danger warnings and panic. ================================================ [ BUG: lock held when returning to user space! ] ------------------------------------------------ rsyslogd/1422 is leaving the kernel with locks still held! 1 lock held by rsyslogd/1422: #0: (tasklist_lock){.+.+.+}, at: [<ffffffff810faf64>] out_of_memory+0x164/0x3f0 BUG: scheduling while atomic: rsyslogd/1422/0x00000002 INFO: lockdep is turned off. This patch fixes it. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/b52723c5607f7684c2c0c075f86f86da0d7fb6d0;" Then it caused following obvious danger warnings and panic";no;yes;yes 261;MDY6Q29tbWl0MjMyNTI5ODpiNTI3MjNjNTYwN2Y3Njg0YzJjMGMwNzVmODZmODZkYTBkN2ZiNmQw;KOSAKI Motohiro;Linus Torvalds;"oom: fix tasklist_lock leak Commit 0aad4b3124 (""oom: fold __out_of_memory into out_of_memory"") introduced a tasklist_lock leak. Then it caused following obvious danger warnings and panic. ================================================ [ BUG: lock held when returning to user space! ] ------------------------------------------------ rsyslogd/1422 is leaving the kernel with locks still held! 1 lock held by rsyslogd/1422: #0: (tasklist_lock){.+.+.+}, at: [<ffffffff810faf64>] out_of_memory+0x164/0x3f0 BUG: scheduling while atomic: rsyslogd/1422/0x00000002 INFO: lockdep is turned off. This patch fixes it. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/b52723c5607f7684c2c0c075f86f86da0d7fb6d0;This patch fixes it.;yes;yes;no 262;MDY6Q29tbWl0MjMyNTI5ODpiZTcxY2YyMjAyOTcxZTUwY2U0OTUzZDQ3MzY0OWM3MjQ3OTllYjhh;KOSAKI Motohiro;Linus Torvalds;"oom: fix NULL pointer dereference Commit b940fd7035 (""oom: remove unnecessary code and cleanup"") added an unnecessary NULL pointer dereference. remove it. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/be71cf2202971e50ce4953d473649c724799eb8a;oom: fix NULL pointer dereference;yes;yes;no 262;MDY6Q29tbWl0MjMyNTI5ODpiZTcxY2YyMjAyOTcxZTUwY2U0OTUzZDQ3MzY0OWM3MjQ3OTllYjhh;KOSAKI Motohiro;Linus Torvalds;"oom: fix NULL pointer dereference Commit b940fd7035 (""oom: remove unnecessary code and cleanup"") added an unnecessary NULL pointer dereference. remove it. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/be71cf2202971e50ce4953d473649c724799eb8a;"Commit b940fd7035 (""oom: remove unnecessary code and cleanup"") added an unnecessary NULL pointer dereference";no;yes;yes 262;MDY6Q29tbWl0MjMyNTI5ODpiZTcxY2YyMjAyOTcxZTUwY2U0OTUzZDQ3MzY0OWM3MjQ3OTllYjhh;KOSAKI Motohiro;Linus Torvalds;"oom: fix NULL pointer dereference Commit b940fd7035 (""oom: remove unnecessary code and cleanup"") added an unnecessary NULL pointer dereference. remove it. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/be71cf2202971e50ce4953d473649c724799eb8a; remove it.;yes;no;no 263;MDY6Q29tbWl0MjMyNTI5ODoxNThlMGEyZDFiM2NmZmVkOGI0NmNiYzU2MzkzYTEzOTQ2NzJlZjc5;KAMEZAWA Hiroyuki;Linus Torvalds;"memcg: use find_lock_task_mm() in memory cgroups oom When the OOM killer scans task, it check a task is under memcg or not when it's called via memcg's context. But, as Oleg pointed out, a thread group leader may have NULL ->mm and task_in_mem_cgroup() may do wrong decision. We have to use find_lock_task_mm() in memcg as generic OOM-Killer does. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/158e0a2d1b3cffed8b46cbc56393a1394672ef79;memcg: use find_lock_task_mm() in memory cgroups oom;yes;no;no 263;MDY6Q29tbWl0MjMyNTI5ODoxNThlMGEyZDFiM2NmZmVkOGI0NmNiYzU2MzkzYTEzOTQ2NzJlZjc5;KAMEZAWA Hiroyuki;Linus Torvalds;"memcg: use find_lock_task_mm() in memory cgroups oom When the OOM killer scans task, it check a task is under memcg or not when it's called via memcg's context. But, as Oleg pointed out, a thread group leader may have NULL ->mm and task_in_mem_cgroup() may do wrong decision. We have to use find_lock_task_mm() in memcg as generic OOM-Killer does. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/158e0a2d1b3cffed8b46cbc56393a1394672ef79;"When the OOM killer scans task, it check a task is under memcg or not when it's called via memcg's context";no;no;yes 263;MDY6Q29tbWl0MjMyNTI5ODoxNThlMGEyZDFiM2NmZmVkOGI0NmNiYzU2MzkzYTEzOTQ2NzJlZjc5;KAMEZAWA Hiroyuki;Linus Torvalds;"memcg: use find_lock_task_mm() in memory cgroups oom When the OOM killer scans task, it check a task is under memcg or not when it's called via memcg's context. But, as Oleg pointed out, a thread group leader may have NULL ->mm and task_in_mem_cgroup() may do wrong decision. We have to use find_lock_task_mm() in memcg as generic OOM-Killer does. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/158e0a2d1b3cffed8b46cbc56393a1394672ef79;"But, as Oleg pointed out, a thread group leader may have NULL ->mm and task_in_mem_cgroup() may do wrong decision";no;yes;yes 263;MDY6Q29tbWl0MjMyNTI5ODoxNThlMGEyZDFiM2NmZmVkOGI0NmNiYzU2MzkzYTEzOTQ2NzJlZjc5;KAMEZAWA Hiroyuki;Linus Torvalds;"memcg: use find_lock_task_mm() in memory cgroups oom When the OOM killer scans task, it check a task is under memcg or not when it's called via memcg's context. But, as Oleg pointed out, a thread group leader may have NULL ->mm and task_in_mem_cgroup() may do wrong decision. We have to use find_lock_task_mm() in memcg as generic OOM-Killer does. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/158e0a2d1b3cffed8b46cbc56393a1394672ef79;"We have to use find_lock_task_mm() in memcg as generic OOM-Killer does.";yes;no;no 264;MDY6Q29tbWl0MjMyNTI5ODphNjNkODNmNDI3ZmJjZTk3YTZjZWEwZGIyZTY0YjBlYjg0MzVjZDEw;David Rientjes;Linus Torvalds;"oom: badness heuristic rewrite This a complete rewrite of the oom killer's badness() heuristic which is used to determine which task to kill in oom conditions. The goal is to make it as simple and predictable as possible so the results are better understood and we end up killing the task which will lead to the most memory freeing while still respecting the fine-tuning from userspace. Instead of basing the heuristic on mm->total_vm for each task, the task's rss and swap space is used instead. This is a better indication of the amount of memory that will be freeable if the oom killed task is chosen and subsequently exits. This helps specifically in cases where KDE or GNOME is chosen for oom kill on desktop systems instead of a memory hogging task. The baseline for the heuristic is a proportion of memory that each task is currently using in memory plus swap compared to the amount of ""allowable"" memory. ""Allowable,"" in this sense, means the system-wide resources for unconstrained oom conditions, the set of mempolicy nodes, the mems attached to current's cpuset, or a memory controller's limit. The proportion is given on a scale of 0 (never kill) to 1000 (always kill), roughly meaning that if a task has a badness() score of 500 that the task consumes approximately 50% of allowable memory resident in RAM or in swap space. The proportion is always relative to the amount of ""allowable"" memory and not the total amount of RAM systemwide so that mempolicies and cpusets may operate in isolation; they shall not need to know the true size of the machine on which they are running if they are bound to a specific set of nodes or mems, respectively. Root tasks are given 3% extra memory just like __vm_enough_memory() provides in LSMs. In the event of two tasks consuming similar amounts of memory, it is generally better to save root's task. Because of the change in the badness() heuristic's baseline, it is also necessary to introduce a new user interface to tune it. It's not possible to redefine the meaning of /proc/pid/oom_adj with a new scale since the ABI cannot be changed for backward compatability. Instead, a new tunable, /proc/pid/oom_score_adj, is added that ranges from -1000 to +1000. It may be used to polarize the heuristic such that certain tasks are never considered for oom kill while others may always be considered. The value is added directly into the badness() score so a value of -500, for example, means to discount 50% of its memory consumption in comparison to other tasks either on the system, bound to the mempolicy, in the cpuset, or sharing the same memory controller. /proc/pid/oom_adj is changed so that its meaning is rescaled into the units used by /proc/pid/oom_score_adj, and vice versa. Changing one of these per-task tunables will rescale the value of the other to an equivalent meaning. Although /proc/pid/oom_adj was originally defined as a bitshift on the badness score, it now shares the same linear growth as /proc/pid/oom_score_adj but with different granularity. This is required so the ABI is not broken with userspace applications and allows oom_adj to be deprecated for future removal. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Nick Piggin <npiggin@suse.de> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/a63d83f427fbce97a6cea0db2e64b0eb8435cd10;oom: badness heuristic rewrite;yes;no;no 264;MDY6Q29tbWl0MjMyNTI5ODphNjNkODNmNDI3ZmJjZTk3YTZjZWEwZGIyZTY0YjBlYjg0MzVjZDEw;David Rientjes;Linus Torvalds;"oom: badness heuristic rewrite This a complete rewrite of the oom killer's badness() heuristic which is used to determine which task to kill in oom conditions. The goal is to make it as simple and predictable as possible so the results are better understood and we end up killing the task which will lead to the most memory freeing while still respecting the fine-tuning from userspace. Instead of basing the heuristic on mm->total_vm for each task, the task's rss and swap space is used instead. This is a better indication of the amount of memory that will be freeable if the oom killed task is chosen and subsequently exits. This helps specifically in cases where KDE or GNOME is chosen for oom kill on desktop systems instead of a memory hogging task. The baseline for the heuristic is a proportion of memory that each task is currently using in memory plus swap compared to the amount of ""allowable"" memory. ""Allowable,"" in this sense, means the system-wide resources for unconstrained oom conditions, the set of mempolicy nodes, the mems attached to current's cpuset, or a memory controller's limit. The proportion is given on a scale of 0 (never kill) to 1000 (always kill), roughly meaning that if a task has a badness() score of 500 that the task consumes approximately 50% of allowable memory resident in RAM or in swap space. The proportion is always relative to the amount of ""allowable"" memory and not the total amount of RAM systemwide so that mempolicies and cpusets may operate in isolation; they shall not need to know the true size of the machine on which they are running if they are bound to a specific set of nodes or mems, respectively. Root tasks are given 3% extra memory just like __vm_enough_memory() provides in LSMs. In the event of two tasks consuming similar amounts of memory, it is generally better to save root's task. Because of the change in the badness() heuristic's baseline, it is also necessary to introduce a new user interface to tune it. It's not possible to redefine the meaning of /proc/pid/oom_adj with a new scale since the ABI cannot be changed for backward compatability. Instead, a new tunable, /proc/pid/oom_score_adj, is added that ranges from -1000 to +1000. It may be used to polarize the heuristic such that certain tasks are never considered for oom kill while others may always be considered. The value is added directly into the badness() score so a value of -500, for example, means to discount 50% of its memory consumption in comparison to other tasks either on the system, bound to the mempolicy, in the cpuset, or sharing the same memory controller. /proc/pid/oom_adj is changed so that its meaning is rescaled into the units used by /proc/pid/oom_score_adj, and vice versa. Changing one of these per-task tunables will rescale the value of the other to an equivalent meaning. Although /proc/pid/oom_adj was originally defined as a bitshift on the badness score, it now shares the same linear growth as /proc/pid/oom_score_adj but with different granularity. This is required so the ABI is not broken with userspace applications and allows oom_adj to be deprecated for future removal. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Nick Piggin <npiggin@suse.de> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/a63d83f427fbce97a6cea0db2e64b0eb8435cd10;"This a complete rewrite of the oom killer's badness() heuristic which is used to determine which task to kill in oom conditions";yes;no;yes 264;MDY6Q29tbWl0MjMyNTI5ODphNjNkODNmNDI3ZmJjZTk3YTZjZWEwZGIyZTY0YjBlYjg0MzVjZDEw;David Rientjes;Linus Torvalds;"oom: badness heuristic rewrite This a complete rewrite of the oom killer's badness() heuristic which is used to determine which task to kill in oom conditions. The goal is to make it as simple and predictable as possible so the results are better understood and we end up killing the task which will lead to the most memory freeing while still respecting the fine-tuning from userspace. Instead of basing the heuristic on mm->total_vm for each task, the task's rss and swap space is used instead. This is a better indication of the amount of memory that will be freeable if the oom killed task is chosen and subsequently exits. This helps specifically in cases where KDE or GNOME is chosen for oom kill on desktop systems instead of a memory hogging task. The baseline for the heuristic is a proportion of memory that each task is currently using in memory plus swap compared to the amount of ""allowable"" memory. ""Allowable,"" in this sense, means the system-wide resources for unconstrained oom conditions, the set of mempolicy nodes, the mems attached to current's cpuset, or a memory controller's limit. The proportion is given on a scale of 0 (never kill) to 1000 (always kill), roughly meaning that if a task has a badness() score of 500 that the task consumes approximately 50% of allowable memory resident in RAM or in swap space. The proportion is always relative to the amount of ""allowable"" memory and not the total amount of RAM systemwide so that mempolicies and cpusets may operate in isolation; they shall not need to know the true size of the machine on which they are running if they are bound to a specific set of nodes or mems, respectively. Root tasks are given 3% extra memory just like __vm_enough_memory() provides in LSMs. In the event of two tasks consuming similar amounts of memory, it is generally better to save root's task. Because of the change in the badness() heuristic's baseline, it is also necessary to introduce a new user interface to tune it. It's not possible to redefine the meaning of /proc/pid/oom_adj with a new scale since the ABI cannot be changed for backward compatability. Instead, a new tunable, /proc/pid/oom_score_adj, is added that ranges from -1000 to +1000. It may be used to polarize the heuristic such that certain tasks are never considered for oom kill while others may always be considered. The value is added directly into the badness() score so a value of -500, for example, means to discount 50% of its memory consumption in comparison to other tasks either on the system, bound to the mempolicy, in the cpuset, or sharing the same memory controller. /proc/pid/oom_adj is changed so that its meaning is rescaled into the units used by /proc/pid/oom_score_adj, and vice versa. Changing one of these per-task tunables will rescale the value of the other to an equivalent meaning. Although /proc/pid/oom_adj was originally defined as a bitshift on the badness score, it now shares the same linear growth as /proc/pid/oom_score_adj but with different granularity. This is required so the ABI is not broken with userspace applications and allows oom_adj to be deprecated for future removal. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Nick Piggin <npiggin@suse.de> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/a63d83f427fbce97a6cea0db2e64b0eb8435cd10;" The goal is to make it as simple and predictable as possible so the results are better understood and we end up killing the task which will lead to the most memory freeing while still respecting the fine-tuning from userspace";yes;yes;no 264;MDY6Q29tbWl0MjMyNTI5ODphNjNkODNmNDI3ZmJjZTk3YTZjZWEwZGIyZTY0YjBlYjg0MzVjZDEw;David Rientjes;Linus Torvalds;"oom: badness heuristic rewrite This a complete rewrite of the oom killer's badness() heuristic which is used to determine which task to kill in oom conditions. The goal is to make it as simple and predictable as possible so the results are better understood and we end up killing the task which will lead to the most memory freeing while still respecting the fine-tuning from userspace. Instead of basing the heuristic on mm->total_vm for each task, the task's rss and swap space is used instead. This is a better indication of the amount of memory that will be freeable if the oom killed task is chosen and subsequently exits. This helps specifically in cases where KDE or GNOME is chosen for oom kill on desktop systems instead of a memory hogging task. The baseline for the heuristic is a proportion of memory that each task is currently using in memory plus swap compared to the amount of ""allowable"" memory. ""Allowable,"" in this sense, means the system-wide resources for unconstrained oom conditions, the set of mempolicy nodes, the mems attached to current's cpuset, or a memory controller's limit. The proportion is given on a scale of 0 (never kill) to 1000 (always kill), roughly meaning that if a task has a badness() score of 500 that the task consumes approximately 50% of allowable memory resident in RAM or in swap space. The proportion is always relative to the amount of ""allowable"" memory and not the total amount of RAM systemwide so that mempolicies and cpusets may operate in isolation; they shall not need to know the true size of the machine on which they are running if they are bound to a specific set of nodes or mems, respectively. Root tasks are given 3% extra memory just like __vm_enough_memory() provides in LSMs. In the event of two tasks consuming similar amounts of memory, it is generally better to save root's task. Because of the change in the badness() heuristic's baseline, it is also necessary to introduce a new user interface to tune it. It's not possible to redefine the meaning of /proc/pid/oom_adj with a new scale since the ABI cannot be changed for backward compatability. Instead, a new tunable, /proc/pid/oom_score_adj, is added that ranges from -1000 to +1000. It may be used to polarize the heuristic such that certain tasks are never considered for oom kill while others may always be considered. The value is added directly into the badness() score so a value of -500, for example, means to discount 50% of its memory consumption in comparison to other tasks either on the system, bound to the mempolicy, in the cpuset, or sharing the same memory controller. /proc/pid/oom_adj is changed so that its meaning is rescaled into the units used by /proc/pid/oom_score_adj, and vice versa. Changing one of these per-task tunables will rescale the value of the other to an equivalent meaning. Although /proc/pid/oom_adj was originally defined as a bitshift on the badness score, it now shares the same linear growth as /proc/pid/oom_score_adj but with different granularity. This is required so the ABI is not broken with userspace applications and allows oom_adj to be deprecated for future removal. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Nick Piggin <npiggin@suse.de> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/a63d83f427fbce97a6cea0db2e64b0eb8435cd10;"Instead of basing the heuristic on mm->total_vm for each task, the task's rss and swap space is used instead";yes;no;yes 264;MDY6Q29tbWl0MjMyNTI5ODphNjNkODNmNDI3ZmJjZTk3YTZjZWEwZGIyZTY0YjBlYjg0MzVjZDEw;David Rientjes;Linus Torvalds;"oom: badness heuristic rewrite This a complete rewrite of the oom killer's badness() heuristic which is used to determine which task to kill in oom conditions. The goal is to make it as simple and predictable as possible so the results are better understood and we end up killing the task which will lead to the most memory freeing while still respecting the fine-tuning from userspace. Instead of basing the heuristic on mm->total_vm for each task, the task's rss and swap space is used instead. This is a better indication of the amount of memory that will be freeable if the oom killed task is chosen and subsequently exits. This helps specifically in cases where KDE or GNOME is chosen for oom kill on desktop systems instead of a memory hogging task. The baseline for the heuristic is a proportion of memory that each task is currently using in memory plus swap compared to the amount of ""allowable"" memory. ""Allowable,"" in this sense, means the system-wide resources for unconstrained oom conditions, the set of mempolicy nodes, the mems attached to current's cpuset, or a memory controller's limit. The proportion is given on a scale of 0 (never kill) to 1000 (always kill), roughly meaning that if a task has a badness() score of 500 that the task consumes approximately 50% of allowable memory resident in RAM or in swap space. The proportion is always relative to the amount of ""allowable"" memory and not the total amount of RAM systemwide so that mempolicies and cpusets may operate in isolation; they shall not need to know the true size of the machine on which they are running if they are bound to a specific set of nodes or mems, respectively. Root tasks are given 3% extra memory just like __vm_enough_memory() provides in LSMs. In the event of two tasks consuming similar amounts of memory, it is generally better to save root's task. Because of the change in the badness() heuristic's baseline, it is also necessary to introduce a new user interface to tune it. It's not possible to redefine the meaning of /proc/pid/oom_adj with a new scale since the ABI cannot be changed for backward compatability. Instead, a new tunable, /proc/pid/oom_score_adj, is added that ranges from -1000 to +1000. It may be used to polarize the heuristic such that certain tasks are never considered for oom kill while others may always be considered. The value is added directly into the badness() score so a value of -500, for example, means to discount 50% of its memory consumption in comparison to other tasks either on the system, bound to the mempolicy, in the cpuset, or sharing the same memory controller. /proc/pid/oom_adj is changed so that its meaning is rescaled into the units used by /proc/pid/oom_score_adj, and vice versa. Changing one of these per-task tunables will rescale the value of the other to an equivalent meaning. Although /proc/pid/oom_adj was originally defined as a bitshift on the badness score, it now shares the same linear growth as /proc/pid/oom_score_adj but with different granularity. This is required so the ABI is not broken with userspace applications and allows oom_adj to be deprecated for future removal. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Nick Piggin <npiggin@suse.de> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/a63d83f427fbce97a6cea0db2e64b0eb8435cd10;" This is a better indication of the amount of memory that will be freeable if the oom killed task is chosen and subsequently exits";yes;yes;no 264;MDY6Q29tbWl0MjMyNTI5ODphNjNkODNmNDI3ZmJjZTk3YTZjZWEwZGIyZTY0YjBlYjg0MzVjZDEw;David Rientjes;Linus Torvalds;"oom: badness heuristic rewrite This a complete rewrite of the oom killer's badness() heuristic which is used to determine which task to kill in oom conditions. The goal is to make it as simple and predictable as possible so the results are better understood and we end up killing the task which will lead to the most memory freeing while still respecting the fine-tuning from userspace. Instead of basing the heuristic on mm->total_vm for each task, the task's rss and swap space is used instead. This is a better indication of the amount of memory that will be freeable if the oom killed task is chosen and subsequently exits. This helps specifically in cases where KDE or GNOME is chosen for oom kill on desktop systems instead of a memory hogging task. The baseline for the heuristic is a proportion of memory that each task is currently using in memory plus swap compared to the amount of ""allowable"" memory. ""Allowable,"" in this sense, means the system-wide resources for unconstrained oom conditions, the set of mempolicy nodes, the mems attached to current's cpuset, or a memory controller's limit. The proportion is given on a scale of 0 (never kill) to 1000 (always kill), roughly meaning that if a task has a badness() score of 500 that the task consumes approximately 50% of allowable memory resident in RAM or in swap space. The proportion is always relative to the amount of ""allowable"" memory and not the total amount of RAM systemwide so that mempolicies and cpusets may operate in isolation; they shall not need to know the true size of the machine on which they are running if they are bound to a specific set of nodes or mems, respectively. Root tasks are given 3% extra memory just like __vm_enough_memory() provides in LSMs. In the event of two tasks consuming similar amounts of memory, it is generally better to save root's task. Because of the change in the badness() heuristic's baseline, it is also necessary to introduce a new user interface to tune it. It's not possible to redefine the meaning of /proc/pid/oom_adj with a new scale since the ABI cannot be changed for backward compatability. Instead, a new tunable, /proc/pid/oom_score_adj, is added that ranges from -1000 to +1000. It may be used to polarize the heuristic such that certain tasks are never considered for oom kill while others may always be considered. The value is added directly into the badness() score so a value of -500, for example, means to discount 50% of its memory consumption in comparison to other tasks either on the system, bound to the mempolicy, in the cpuset, or sharing the same memory controller. /proc/pid/oom_adj is changed so that its meaning is rescaled into the units used by /proc/pid/oom_score_adj, and vice versa. Changing one of these per-task tunables will rescale the value of the other to an equivalent meaning. Although /proc/pid/oom_adj was originally defined as a bitshift on the badness score, it now shares the same linear growth as /proc/pid/oom_score_adj but with different granularity. This is required so the ABI is not broken with userspace applications and allows oom_adj to be deprecated for future removal. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Nick Piggin <npiggin@suse.de> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/a63d83f427fbce97a6cea0db2e64b0eb8435cd10;" This helps specifically in cases where KDE or GNOME is chosen for oom kill on desktop systems instead of a memory hogging task";yes;yes;yes 264;MDY6Q29tbWl0MjMyNTI5ODphNjNkODNmNDI3ZmJjZTk3YTZjZWEwZGIyZTY0YjBlYjg0MzVjZDEw;David Rientjes;Linus Torvalds;"oom: badness heuristic rewrite This a complete rewrite of the oom killer's badness() heuristic which is used to determine which task to kill in oom conditions. The goal is to make it as simple and predictable as possible so the results are better understood and we end up killing the task which will lead to the most memory freeing while still respecting the fine-tuning from userspace. Instead of basing the heuristic on mm->total_vm for each task, the task's rss and swap space is used instead. This is a better indication of the amount of memory that will be freeable if the oom killed task is chosen and subsequently exits. This helps specifically in cases where KDE or GNOME is chosen for oom kill on desktop systems instead of a memory hogging task. The baseline for the heuristic is a proportion of memory that each task is currently using in memory plus swap compared to the amount of ""allowable"" memory. ""Allowable,"" in this sense, means the system-wide resources for unconstrained oom conditions, the set of mempolicy nodes, the mems attached to current's cpuset, or a memory controller's limit. The proportion is given on a scale of 0 (never kill) to 1000 (always kill), roughly meaning that if a task has a badness() score of 500 that the task consumes approximately 50% of allowable memory resident in RAM or in swap space. The proportion is always relative to the amount of ""allowable"" memory and not the total amount of RAM systemwide so that mempolicies and cpusets may operate in isolation; they shall not need to know the true size of the machine on which they are running if they are bound to a specific set of nodes or mems, respectively. Root tasks are given 3% extra memory just like __vm_enough_memory() provides in LSMs. In the event of two tasks consuming similar amounts of memory, it is generally better to save root's task. Because of the change in the badness() heuristic's baseline, it is also necessary to introduce a new user interface to tune it. It's not possible to redefine the meaning of /proc/pid/oom_adj with a new scale since the ABI cannot be changed for backward compatability. Instead, a new tunable, /proc/pid/oom_score_adj, is added that ranges from -1000 to +1000. It may be used to polarize the heuristic such that certain tasks are never considered for oom kill while others may always be considered. The value is added directly into the badness() score so a value of -500, for example, means to discount 50% of its memory consumption in comparison to other tasks either on the system, bound to the mempolicy, in the cpuset, or sharing the same memory controller. /proc/pid/oom_adj is changed so that its meaning is rescaled into the units used by /proc/pid/oom_score_adj, and vice versa. Changing one of these per-task tunables will rescale the value of the other to an equivalent meaning. Although /proc/pid/oom_adj was originally defined as a bitshift on the badness score, it now shares the same linear growth as /proc/pid/oom_score_adj but with different granularity. This is required so the ABI is not broken with userspace applications and allows oom_adj to be deprecated for future removal. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Nick Piggin <npiggin@suse.de> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/a63d83f427fbce97a6cea0db2e64b0eb8435cd10;"The baseline for the heuristic is a proportion of memory that each task is currently using in memory plus swap compared to the amount of ""allowable"" memory";yes;no;yes 264;MDY6Q29tbWl0MjMyNTI5ODphNjNkODNmNDI3ZmJjZTk3YTZjZWEwZGIyZTY0YjBlYjg0MzVjZDEw;David Rientjes;Linus Torvalds;"oom: badness heuristic rewrite This a complete rewrite of the oom killer's badness() heuristic which is used to determine which task to kill in oom conditions. The goal is to make it as simple and predictable as possible so the results are better understood and we end up killing the task which will lead to the most memory freeing while still respecting the fine-tuning from userspace. Instead of basing the heuristic on mm->total_vm for each task, the task's rss and swap space is used instead. This is a better indication of the amount of memory that will be freeable if the oom killed task is chosen and subsequently exits. This helps specifically in cases where KDE or GNOME is chosen for oom kill on desktop systems instead of a memory hogging task. The baseline for the heuristic is a proportion of memory that each task is currently using in memory plus swap compared to the amount of ""allowable"" memory. ""Allowable,"" in this sense, means the system-wide resources for unconstrained oom conditions, the set of mempolicy nodes, the mems attached to current's cpuset, or a memory controller's limit. The proportion is given on a scale of 0 (never kill) to 1000 (always kill), roughly meaning that if a task has a badness() score of 500 that the task consumes approximately 50% of allowable memory resident in RAM or in swap space. The proportion is always relative to the amount of ""allowable"" memory and not the total amount of RAM systemwide so that mempolicies and cpusets may operate in isolation; they shall not need to know the true size of the machine on which they are running if they are bound to a specific set of nodes or mems, respectively. Root tasks are given 3% extra memory just like __vm_enough_memory() provides in LSMs. In the event of two tasks consuming similar amounts of memory, it is generally better to save root's task. Because of the change in the badness() heuristic's baseline, it is also necessary to introduce a new user interface to tune it. It's not possible to redefine the meaning of /proc/pid/oom_adj with a new scale since the ABI cannot be changed for backward compatability. Instead, a new tunable, /proc/pid/oom_score_adj, is added that ranges from -1000 to +1000. It may be used to polarize the heuristic such that certain tasks are never considered for oom kill while others may always be considered. The value is added directly into the badness() score so a value of -500, for example, means to discount 50% of its memory consumption in comparison to other tasks either on the system, bound to the mempolicy, in the cpuset, or sharing the same memory controller. /proc/pid/oom_adj is changed so that its meaning is rescaled into the units used by /proc/pid/oom_score_adj, and vice versa. Changing one of these per-task tunables will rescale the value of the other to an equivalent meaning. Although /proc/pid/oom_adj was originally defined as a bitshift on the badness score, it now shares the same linear growth as /proc/pid/oom_score_adj but with different granularity. This is required so the ABI is not broken with userspace applications and allows oom_adj to be deprecated for future removal. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Nick Piggin <npiggin@suse.de> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/a63d83f427fbce97a6cea0db2e64b0eb8435cd10;" ""Allowable,"" in this sense, means the system-wide resources for unconstrained oom conditions, the set of mempolicy nodes, the mems attached to current's cpuset, or a memory controller's limit";yes;no;yes 264;MDY6Q29tbWl0MjMyNTI5ODphNjNkODNmNDI3ZmJjZTk3YTZjZWEwZGIyZTY0YjBlYjg0MzVjZDEw;David Rientjes;Linus Torvalds;"oom: badness heuristic rewrite This a complete rewrite of the oom killer's badness() heuristic which is used to determine which task to kill in oom conditions. The goal is to make it as simple and predictable as possible so the results are better understood and we end up killing the task which will lead to the most memory freeing while still respecting the fine-tuning from userspace. Instead of basing the heuristic on mm->total_vm for each task, the task's rss and swap space is used instead. This is a better indication of the amount of memory that will be freeable if the oom killed task is chosen and subsequently exits. This helps specifically in cases where KDE or GNOME is chosen for oom kill on desktop systems instead of a memory hogging task. The baseline for the heuristic is a proportion of memory that each task is currently using in memory plus swap compared to the amount of ""allowable"" memory. ""Allowable,"" in this sense, means the system-wide resources for unconstrained oom conditions, the set of mempolicy nodes, the mems attached to current's cpuset, or a memory controller's limit. The proportion is given on a scale of 0 (never kill) to 1000 (always kill), roughly meaning that if a task has a badness() score of 500 that the task consumes approximately 50% of allowable memory resident in RAM or in swap space. The proportion is always relative to the amount of ""allowable"" memory and not the total amount of RAM systemwide so that mempolicies and cpusets may operate in isolation; they shall not need to know the true size of the machine on which they are running if they are bound to a specific set of nodes or mems, respectively. Root tasks are given 3% extra memory just like __vm_enough_memory() provides in LSMs. In the event of two tasks consuming similar amounts of memory, it is generally better to save root's task. Because of the change in the badness() heuristic's baseline, it is also necessary to introduce a new user interface to tune it. It's not possible to redefine the meaning of /proc/pid/oom_adj with a new scale since the ABI cannot be changed for backward compatability. Instead, a new tunable, /proc/pid/oom_score_adj, is added that ranges from -1000 to +1000. It may be used to polarize the heuristic such that certain tasks are never considered for oom kill while others may always be considered. The value is added directly into the badness() score so a value of -500, for example, means to discount 50% of its memory consumption in comparison to other tasks either on the system, bound to the mempolicy, in the cpuset, or sharing the same memory controller. /proc/pid/oom_adj is changed so that its meaning is rescaled into the units used by /proc/pid/oom_score_adj, and vice versa. Changing one of these per-task tunables will rescale the value of the other to an equivalent meaning. Although /proc/pid/oom_adj was originally defined as a bitshift on the badness score, it now shares the same linear growth as /proc/pid/oom_score_adj but with different granularity. This is required so the ABI is not broken with userspace applications and allows oom_adj to be deprecated for future removal. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Nick Piggin <npiggin@suse.de> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/a63d83f427fbce97a6cea0db2e64b0eb8435cd10;" The proportion is given on a scale of 0 (never kill) to 1000 (always kill), roughly meaning that if a task has a badness() score of 500 that the task consumes approximately 50% of allowable memory resident in RAM or in swap space";yes;no;yes 264;MDY6Q29tbWl0MjMyNTI5ODphNjNkODNmNDI3ZmJjZTk3YTZjZWEwZGIyZTY0YjBlYjg0MzVjZDEw;David Rientjes;Linus Torvalds;"oom: badness heuristic rewrite This a complete rewrite of the oom killer's badness() heuristic which is used to determine which task to kill in oom conditions. The goal is to make it as simple and predictable as possible so the results are better understood and we end up killing the task which will lead to the most memory freeing while still respecting the fine-tuning from userspace. Instead of basing the heuristic on mm->total_vm for each task, the task's rss and swap space is used instead. This is a better indication of the amount of memory that will be freeable if the oom killed task is chosen and subsequently exits. This helps specifically in cases where KDE or GNOME is chosen for oom kill on desktop systems instead of a memory hogging task. The baseline for the heuristic is a proportion of memory that each task is currently using in memory plus swap compared to the amount of ""allowable"" memory. ""Allowable,"" in this sense, means the system-wide resources for unconstrained oom conditions, the set of mempolicy nodes, the mems attached to current's cpuset, or a memory controller's limit. The proportion is given on a scale of 0 (never kill) to 1000 (always kill), roughly meaning that if a task has a badness() score of 500 that the task consumes approximately 50% of allowable memory resident in RAM or in swap space. The proportion is always relative to the amount of ""allowable"" memory and not the total amount of RAM systemwide so that mempolicies and cpusets may operate in isolation; they shall not need to know the true size of the machine on which they are running if they are bound to a specific set of nodes or mems, respectively. Root tasks are given 3% extra memory just like __vm_enough_memory() provides in LSMs. In the event of two tasks consuming similar amounts of memory, it is generally better to save root's task. Because of the change in the badness() heuristic's baseline, it is also necessary to introduce a new user interface to tune it. It's not possible to redefine the meaning of /proc/pid/oom_adj with a new scale since the ABI cannot be changed for backward compatability. Instead, a new tunable, /proc/pid/oom_score_adj, is added that ranges from -1000 to +1000. It may be used to polarize the heuristic such that certain tasks are never considered for oom kill while others may always be considered. The value is added directly into the badness() score so a value of -500, for example, means to discount 50% of its memory consumption in comparison to other tasks either on the system, bound to the mempolicy, in the cpuset, or sharing the same memory controller. /proc/pid/oom_adj is changed so that its meaning is rescaled into the units used by /proc/pid/oom_score_adj, and vice versa. Changing one of these per-task tunables will rescale the value of the other to an equivalent meaning. Although /proc/pid/oom_adj was originally defined as a bitshift on the badness score, it now shares the same linear growth as /proc/pid/oom_score_adj but with different granularity. This is required so the ABI is not broken with userspace applications and allows oom_adj to be deprecated for future removal. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Nick Piggin <npiggin@suse.de> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/a63d83f427fbce97a6cea0db2e64b0eb8435cd10;"The proportion is always relative to the amount of ""allowable"" memory and not the total amount of RAM systemwide so that mempolicies and cpusets may operate in isolation; they shall not need to know the true size of the machine on which they are running if they are bound to a specific set of nodes or mems, respectively";yes;no;yes 264;MDY6Q29tbWl0MjMyNTI5ODphNjNkODNmNDI3ZmJjZTk3YTZjZWEwZGIyZTY0YjBlYjg0MzVjZDEw;David Rientjes;Linus Torvalds;"oom: badness heuristic rewrite This a complete rewrite of the oom killer's badness() heuristic which is used to determine which task to kill in oom conditions. The goal is to make it as simple and predictable as possible so the results are better understood and we end up killing the task which will lead to the most memory freeing while still respecting the fine-tuning from userspace. Instead of basing the heuristic on mm->total_vm for each task, the task's rss and swap space is used instead. This is a better indication of the amount of memory that will be freeable if the oom killed task is chosen and subsequently exits. This helps specifically in cases where KDE or GNOME is chosen for oom kill on desktop systems instead of a memory hogging task. The baseline for the heuristic is a proportion of memory that each task is currently using in memory plus swap compared to the amount of ""allowable"" memory. ""Allowable,"" in this sense, means the system-wide resources for unconstrained oom conditions, the set of mempolicy nodes, the mems attached to current's cpuset, or a memory controller's limit. The proportion is given on a scale of 0 (never kill) to 1000 (always kill), roughly meaning that if a task has a badness() score of 500 that the task consumes approximately 50% of allowable memory resident in RAM or in swap space. The proportion is always relative to the amount of ""allowable"" memory and not the total amount of RAM systemwide so that mempolicies and cpusets may operate in isolation; they shall not need to know the true size of the machine on which they are running if they are bound to a specific set of nodes or mems, respectively. Root tasks are given 3% extra memory just like __vm_enough_memory() provides in LSMs. In the event of two tasks consuming similar amounts of memory, it is generally better to save root's task. Because of the change in the badness() heuristic's baseline, it is also necessary to introduce a new user interface to tune it. It's not possible to redefine the meaning of /proc/pid/oom_adj with a new scale since the ABI cannot be changed for backward compatability. Instead, a new tunable, /proc/pid/oom_score_adj, is added that ranges from -1000 to +1000. It may be used to polarize the heuristic such that certain tasks are never considered for oom kill while others may always be considered. The value is added directly into the badness() score so a value of -500, for example, means to discount 50% of its memory consumption in comparison to other tasks either on the system, bound to the mempolicy, in the cpuset, or sharing the same memory controller. /proc/pid/oom_adj is changed so that its meaning is rescaled into the units used by /proc/pid/oom_score_adj, and vice versa. Changing one of these per-task tunables will rescale the value of the other to an equivalent meaning. Although /proc/pid/oom_adj was originally defined as a bitshift on the badness score, it now shares the same linear growth as /proc/pid/oom_score_adj but with different granularity. This is required so the ABI is not broken with userspace applications and allows oom_adj to be deprecated for future removal. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Nick Piggin <npiggin@suse.de> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/a63d83f427fbce97a6cea0db2e64b0eb8435cd10;"Root tasks are given 3% extra memory just like __vm_enough_memory() provides in LSMs";yes;no;yes 264;MDY6Q29tbWl0MjMyNTI5ODphNjNkODNmNDI3ZmJjZTk3YTZjZWEwZGIyZTY0YjBlYjg0MzVjZDEw;David Rientjes;Linus Torvalds;"oom: badness heuristic rewrite This a complete rewrite of the oom killer's badness() heuristic which is used to determine which task to kill in oom conditions. The goal is to make it as simple and predictable as possible so the results are better understood and we end up killing the task which will lead to the most memory freeing while still respecting the fine-tuning from userspace. Instead of basing the heuristic on mm->total_vm for each task, the task's rss and swap space is used instead. This is a better indication of the amount of memory that will be freeable if the oom killed task is chosen and subsequently exits. This helps specifically in cases where KDE or GNOME is chosen for oom kill on desktop systems instead of a memory hogging task. The baseline for the heuristic is a proportion of memory that each task is currently using in memory plus swap compared to the amount of ""allowable"" memory. ""Allowable,"" in this sense, means the system-wide resources for unconstrained oom conditions, the set of mempolicy nodes, the mems attached to current's cpuset, or a memory controller's limit. The proportion is given on a scale of 0 (never kill) to 1000 (always kill), roughly meaning that if a task has a badness() score of 500 that the task consumes approximately 50% of allowable memory resident in RAM or in swap space. The proportion is always relative to the amount of ""allowable"" memory and not the total amount of RAM systemwide so that mempolicies and cpusets may operate in isolation; they shall not need to know the true size of the machine on which they are running if they are bound to a specific set of nodes or mems, respectively. Root tasks are given 3% extra memory just like __vm_enough_memory() provides in LSMs. In the event of two tasks consuming similar amounts of memory, it is generally better to save root's task. Because of the change in the badness() heuristic's baseline, it is also necessary to introduce a new user interface to tune it. It's not possible to redefine the meaning of /proc/pid/oom_adj with a new scale since the ABI cannot be changed for backward compatability. Instead, a new tunable, /proc/pid/oom_score_adj, is added that ranges from -1000 to +1000. It may be used to polarize the heuristic such that certain tasks are never considered for oom kill while others may always be considered. The value is added directly into the badness() score so a value of -500, for example, means to discount 50% of its memory consumption in comparison to other tasks either on the system, bound to the mempolicy, in the cpuset, or sharing the same memory controller. /proc/pid/oom_adj is changed so that its meaning is rescaled into the units used by /proc/pid/oom_score_adj, and vice versa. Changing one of these per-task tunables will rescale the value of the other to an equivalent meaning. Although /proc/pid/oom_adj was originally defined as a bitshift on the badness score, it now shares the same linear growth as /proc/pid/oom_score_adj but with different granularity. This is required so the ABI is not broken with userspace applications and allows oom_adj to be deprecated for future removal. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Nick Piggin <npiggin@suse.de> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/a63d83f427fbce97a6cea0db2e64b0eb8435cd10;" In the event of two tasks consuming similar amounts of memory, it is generally better to save root's task";yes;yes;yes 264;MDY6Q29tbWl0MjMyNTI5ODphNjNkODNmNDI3ZmJjZTk3YTZjZWEwZGIyZTY0YjBlYjg0MzVjZDEw;David Rientjes;Linus Torvalds;"oom: badness heuristic rewrite This a complete rewrite of the oom killer's badness() heuristic which is used to determine which task to kill in oom conditions. The goal is to make it as simple and predictable as possible so the results are better understood and we end up killing the task which will lead to the most memory freeing while still respecting the fine-tuning from userspace. Instead of basing the heuristic on mm->total_vm for each task, the task's rss and swap space is used instead. This is a better indication of the amount of memory that will be freeable if the oom killed task is chosen and subsequently exits. This helps specifically in cases where KDE or GNOME is chosen for oom kill on desktop systems instead of a memory hogging task. The baseline for the heuristic is a proportion of memory that each task is currently using in memory plus swap compared to the amount of ""allowable"" memory. ""Allowable,"" in this sense, means the system-wide resources for unconstrained oom conditions, the set of mempolicy nodes, the mems attached to current's cpuset, or a memory controller's limit. The proportion is given on a scale of 0 (never kill) to 1000 (always kill), roughly meaning that if a task has a badness() score of 500 that the task consumes approximately 50% of allowable memory resident in RAM or in swap space. The proportion is always relative to the amount of ""allowable"" memory and not the total amount of RAM systemwide so that mempolicies and cpusets may operate in isolation; they shall not need to know the true size of the machine on which they are running if they are bound to a specific set of nodes or mems, respectively. Root tasks are given 3% extra memory just like __vm_enough_memory() provides in LSMs. In the event of two tasks consuming similar amounts of memory, it is generally better to save root's task. Because of the change in the badness() heuristic's baseline, it is also necessary to introduce a new user interface to tune it. It's not possible to redefine the meaning of /proc/pid/oom_adj with a new scale since the ABI cannot be changed for backward compatability. Instead, a new tunable, /proc/pid/oom_score_adj, is added that ranges from -1000 to +1000. It may be used to polarize the heuristic such that certain tasks are never considered for oom kill while others may always be considered. The value is added directly into the badness() score so a value of -500, for example, means to discount 50% of its memory consumption in comparison to other tasks either on the system, bound to the mempolicy, in the cpuset, or sharing the same memory controller. /proc/pid/oom_adj is changed so that its meaning is rescaled into the units used by /proc/pid/oom_score_adj, and vice versa. Changing one of these per-task tunables will rescale the value of the other to an equivalent meaning. Although /proc/pid/oom_adj was originally defined as a bitshift on the badness score, it now shares the same linear growth as /proc/pid/oom_score_adj but with different granularity. This is required so the ABI is not broken with userspace applications and allows oom_adj to be deprecated for future removal. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Nick Piggin <npiggin@suse.de> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/a63d83f427fbce97a6cea0db2e64b0eb8435cd10;"Because of the change in the badness() heuristic's baseline, it is also necessary to introduce a new user interface to tune it";yes;yes;no 264;MDY6Q29tbWl0MjMyNTI5ODphNjNkODNmNDI3ZmJjZTk3YTZjZWEwZGIyZTY0YjBlYjg0MzVjZDEw;David Rientjes;Linus Torvalds;"oom: badness heuristic rewrite This a complete rewrite of the oom killer's badness() heuristic which is used to determine which task to kill in oom conditions. The goal is to make it as simple and predictable as possible so the results are better understood and we end up killing the task which will lead to the most memory freeing while still respecting the fine-tuning from userspace. Instead of basing the heuristic on mm->total_vm for each task, the task's rss and swap space is used instead. This is a better indication of the amount of memory that will be freeable if the oom killed task is chosen and subsequently exits. This helps specifically in cases where KDE or GNOME is chosen for oom kill on desktop systems instead of a memory hogging task. The baseline for the heuristic is a proportion of memory that each task is currently using in memory plus swap compared to the amount of ""allowable"" memory. ""Allowable,"" in this sense, means the system-wide resources for unconstrained oom conditions, the set of mempolicy nodes, the mems attached to current's cpuset, or a memory controller's limit. The proportion is given on a scale of 0 (never kill) to 1000 (always kill), roughly meaning that if a task has a badness() score of 500 that the task consumes approximately 50% of allowable memory resident in RAM or in swap space. The proportion is always relative to the amount of ""allowable"" memory and not the total amount of RAM systemwide so that mempolicies and cpusets may operate in isolation; they shall not need to know the true size of the machine on which they are running if they are bound to a specific set of nodes or mems, respectively. Root tasks are given 3% extra memory just like __vm_enough_memory() provides in LSMs. In the event of two tasks consuming similar amounts of memory, it is generally better to save root's task. Because of the change in the badness() heuristic's baseline, it is also necessary to introduce a new user interface to tune it. It's not possible to redefine the meaning of /proc/pid/oom_adj with a new scale since the ABI cannot be changed for backward compatability. Instead, a new tunable, /proc/pid/oom_score_adj, is added that ranges from -1000 to +1000. It may be used to polarize the heuristic such that certain tasks are never considered for oom kill while others may always be considered. The value is added directly into the badness() score so a value of -500, for example, means to discount 50% of its memory consumption in comparison to other tasks either on the system, bound to the mempolicy, in the cpuset, or sharing the same memory controller. /proc/pid/oom_adj is changed so that its meaning is rescaled into the units used by /proc/pid/oom_score_adj, and vice versa. Changing one of these per-task tunables will rescale the value of the other to an equivalent meaning. Although /proc/pid/oom_adj was originally defined as a bitshift on the badness score, it now shares the same linear growth as /proc/pid/oom_score_adj but with different granularity. This is required so the ABI is not broken with userspace applications and allows oom_adj to be deprecated for future removal. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Nick Piggin <npiggin@suse.de> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/a63d83f427fbce97a6cea0db2e64b0eb8435cd10;" It's not possible to redefine the meaning of /proc/pid/oom_adj with a new scale since the ABI cannot be changed for backward compatability";yes;yes;no 264;MDY6Q29tbWl0MjMyNTI5ODphNjNkODNmNDI3ZmJjZTk3YTZjZWEwZGIyZTY0YjBlYjg0MzVjZDEw;David Rientjes;Linus Torvalds;"oom: badness heuristic rewrite This a complete rewrite of the oom killer's badness() heuristic which is used to determine which task to kill in oom conditions. The goal is to make it as simple and predictable as possible so the results are better understood and we end up killing the task which will lead to the most memory freeing while still respecting the fine-tuning from userspace. Instead of basing the heuristic on mm->total_vm for each task, the task's rss and swap space is used instead. This is a better indication of the amount of memory that will be freeable if the oom killed task is chosen and subsequently exits. This helps specifically in cases where KDE or GNOME is chosen for oom kill on desktop systems instead of a memory hogging task. The baseline for the heuristic is a proportion of memory that each task is currently using in memory plus swap compared to the amount of ""allowable"" memory. ""Allowable,"" in this sense, means the system-wide resources for unconstrained oom conditions, the set of mempolicy nodes, the mems attached to current's cpuset, or a memory controller's limit. The proportion is given on a scale of 0 (never kill) to 1000 (always kill), roughly meaning that if a task has a badness() score of 500 that the task consumes approximately 50% of allowable memory resident in RAM or in swap space. The proportion is always relative to the amount of ""allowable"" memory and not the total amount of RAM systemwide so that mempolicies and cpusets may operate in isolation; they shall not need to know the true size of the machine on which they are running if they are bound to a specific set of nodes or mems, respectively. Root tasks are given 3% extra memory just like __vm_enough_memory() provides in LSMs. In the event of two tasks consuming similar amounts of memory, it is generally better to save root's task. Because of the change in the badness() heuristic's baseline, it is also necessary to introduce a new user interface to tune it. It's not possible to redefine the meaning of /proc/pid/oom_adj with a new scale since the ABI cannot be changed for backward compatability. Instead, a new tunable, /proc/pid/oom_score_adj, is added that ranges from -1000 to +1000. It may be used to polarize the heuristic such that certain tasks are never considered for oom kill while others may always be considered. The value is added directly into the badness() score so a value of -500, for example, means to discount 50% of its memory consumption in comparison to other tasks either on the system, bound to the mempolicy, in the cpuset, or sharing the same memory controller. /proc/pid/oom_adj is changed so that its meaning is rescaled into the units used by /proc/pid/oom_score_adj, and vice versa. Changing one of these per-task tunables will rescale the value of the other to an equivalent meaning. Although /proc/pid/oom_adj was originally defined as a bitshift on the badness score, it now shares the same linear growth as /proc/pid/oom_score_adj but with different granularity. This is required so the ABI is not broken with userspace applications and allows oom_adj to be deprecated for future removal. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Nick Piggin <npiggin@suse.de> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/a63d83f427fbce97a6cea0db2e64b0eb8435cd10;" Instead, a new tunable, /proc/pid/oom_score_adj, is added that ranges from -1000 to +1000";yes;no;no 264;MDY6Q29tbWl0MjMyNTI5ODphNjNkODNmNDI3ZmJjZTk3YTZjZWEwZGIyZTY0YjBlYjg0MzVjZDEw;David Rientjes;Linus Torvalds;"oom: badness heuristic rewrite This a complete rewrite of the oom killer's badness() heuristic which is used to determine which task to kill in oom conditions. The goal is to make it as simple and predictable as possible so the results are better understood and we end up killing the task which will lead to the most memory freeing while still respecting the fine-tuning from userspace. Instead of basing the heuristic on mm->total_vm for each task, the task's rss and swap space is used instead. This is a better indication of the amount of memory that will be freeable if the oom killed task is chosen and subsequently exits. This helps specifically in cases where KDE or GNOME is chosen for oom kill on desktop systems instead of a memory hogging task. The baseline for the heuristic is a proportion of memory that each task is currently using in memory plus swap compared to the amount of ""allowable"" memory. ""Allowable,"" in this sense, means the system-wide resources for unconstrained oom conditions, the set of mempolicy nodes, the mems attached to current's cpuset, or a memory controller's limit. The proportion is given on a scale of 0 (never kill) to 1000 (always kill), roughly meaning that if a task has a badness() score of 500 that the task consumes approximately 50% of allowable memory resident in RAM or in swap space. The proportion is always relative to the amount of ""allowable"" memory and not the total amount of RAM systemwide so that mempolicies and cpusets may operate in isolation; they shall not need to know the true size of the machine on which they are running if they are bound to a specific set of nodes or mems, respectively. Root tasks are given 3% extra memory just like __vm_enough_memory() provides in LSMs. In the event of two tasks consuming similar amounts of memory, it is generally better to save root's task. Because of the change in the badness() heuristic's baseline, it is also necessary to introduce a new user interface to tune it. It's not possible to redefine the meaning of /proc/pid/oom_adj with a new scale since the ABI cannot be changed for backward compatability. Instead, a new tunable, /proc/pid/oom_score_adj, is added that ranges from -1000 to +1000. It may be used to polarize the heuristic such that certain tasks are never considered for oom kill while others may always be considered. The value is added directly into the badness() score so a value of -500, for example, means to discount 50% of its memory consumption in comparison to other tasks either on the system, bound to the mempolicy, in the cpuset, or sharing the same memory controller. /proc/pid/oom_adj is changed so that its meaning is rescaled into the units used by /proc/pid/oom_score_adj, and vice versa. Changing one of these per-task tunables will rescale the value of the other to an equivalent meaning. Although /proc/pid/oom_adj was originally defined as a bitshift on the badness score, it now shares the same linear growth as /proc/pid/oom_score_adj but with different granularity. This is required so the ABI is not broken with userspace applications and allows oom_adj to be deprecated for future removal. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Nick Piggin <npiggin@suse.de> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/a63d83f427fbce97a6cea0db2e64b0eb8435cd10;" It may be used to polarize the heuristic such that certain tasks are never considered for oom kill while others may always be considered";yes;no;no 264;MDY6Q29tbWl0MjMyNTI5ODphNjNkODNmNDI3ZmJjZTk3YTZjZWEwZGIyZTY0YjBlYjg0MzVjZDEw;David Rientjes;Linus Torvalds;"oom: badness heuristic rewrite This a complete rewrite of the oom killer's badness() heuristic which is used to determine which task to kill in oom conditions. The goal is to make it as simple and predictable as possible so the results are better understood and we end up killing the task which will lead to the most memory freeing while still respecting the fine-tuning from userspace. Instead of basing the heuristic on mm->total_vm for each task, the task's rss and swap space is used instead. This is a better indication of the amount of memory that will be freeable if the oom killed task is chosen and subsequently exits. This helps specifically in cases where KDE or GNOME is chosen for oom kill on desktop systems instead of a memory hogging task. The baseline for the heuristic is a proportion of memory that each task is currently using in memory plus swap compared to the amount of ""allowable"" memory. ""Allowable,"" in this sense, means the system-wide resources for unconstrained oom conditions, the set of mempolicy nodes, the mems attached to current's cpuset, or a memory controller's limit. The proportion is given on a scale of 0 (never kill) to 1000 (always kill), roughly meaning that if a task has a badness() score of 500 that the task consumes approximately 50% of allowable memory resident in RAM or in swap space. The proportion is always relative to the amount of ""allowable"" memory and not the total amount of RAM systemwide so that mempolicies and cpusets may operate in isolation; they shall not need to know the true size of the machine on which they are running if they are bound to a specific set of nodes or mems, respectively. Root tasks are given 3% extra memory just like __vm_enough_memory() provides in LSMs. In the event of two tasks consuming similar amounts of memory, it is generally better to save root's task. Because of the change in the badness() heuristic's baseline, it is also necessary to introduce a new user interface to tune it. It's not possible to redefine the meaning of /proc/pid/oom_adj with a new scale since the ABI cannot be changed for backward compatability. Instead, a new tunable, /proc/pid/oom_score_adj, is added that ranges from -1000 to +1000. It may be used to polarize the heuristic such that certain tasks are never considered for oom kill while others may always be considered. The value is added directly into the badness() score so a value of -500, for example, means to discount 50% of its memory consumption in comparison to other tasks either on the system, bound to the mempolicy, in the cpuset, or sharing the same memory controller. /proc/pid/oom_adj is changed so that its meaning is rescaled into the units used by /proc/pid/oom_score_adj, and vice versa. Changing one of these per-task tunables will rescale the value of the other to an equivalent meaning. Although /proc/pid/oom_adj was originally defined as a bitshift on the badness score, it now shares the same linear growth as /proc/pid/oom_score_adj but with different granularity. This is required so the ABI is not broken with userspace applications and allows oom_adj to be deprecated for future removal. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Nick Piggin <npiggin@suse.de> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/a63d83f427fbce97a6cea0db2e64b0eb8435cd10;" The value is added directly into the badness() score so a value of -500, for example, means to discount 50% of its memory consumption in comparison to other tasks either on the system, bound to the mempolicy, in the cpuset, or sharing the same memory controller";yes;no;no 264;MDY6Q29tbWl0MjMyNTI5ODphNjNkODNmNDI3ZmJjZTk3YTZjZWEwZGIyZTY0YjBlYjg0MzVjZDEw;David Rientjes;Linus Torvalds;"oom: badness heuristic rewrite This a complete rewrite of the oom killer's badness() heuristic which is used to determine which task to kill in oom conditions. The goal is to make it as simple and predictable as possible so the results are better understood and we end up killing the task which will lead to the most memory freeing while still respecting the fine-tuning from userspace. Instead of basing the heuristic on mm->total_vm for each task, the task's rss and swap space is used instead. This is a better indication of the amount of memory that will be freeable if the oom killed task is chosen and subsequently exits. This helps specifically in cases where KDE or GNOME is chosen for oom kill on desktop systems instead of a memory hogging task. The baseline for the heuristic is a proportion of memory that each task is currently using in memory plus swap compared to the amount of ""allowable"" memory. ""Allowable,"" in this sense, means the system-wide resources for unconstrained oom conditions, the set of mempolicy nodes, the mems attached to current's cpuset, or a memory controller's limit. The proportion is given on a scale of 0 (never kill) to 1000 (always kill), roughly meaning that if a task has a badness() score of 500 that the task consumes approximately 50% of allowable memory resident in RAM or in swap space. The proportion is always relative to the amount of ""allowable"" memory and not the total amount of RAM systemwide so that mempolicies and cpusets may operate in isolation; they shall not need to know the true size of the machine on which they are running if they are bound to a specific set of nodes or mems, respectively. Root tasks are given 3% extra memory just like __vm_enough_memory() provides in LSMs. In the event of two tasks consuming similar amounts of memory, it is generally better to save root's task. Because of the change in the badness() heuristic's baseline, it is also necessary to introduce a new user interface to tune it. It's not possible to redefine the meaning of /proc/pid/oom_adj with a new scale since the ABI cannot be changed for backward compatability. Instead, a new tunable, /proc/pid/oom_score_adj, is added that ranges from -1000 to +1000. It may be used to polarize the heuristic such that certain tasks are never considered for oom kill while others may always be considered. The value is added directly into the badness() score so a value of -500, for example, means to discount 50% of its memory consumption in comparison to other tasks either on the system, bound to the mempolicy, in the cpuset, or sharing the same memory controller. /proc/pid/oom_adj is changed so that its meaning is rescaled into the units used by /proc/pid/oom_score_adj, and vice versa. Changing one of these per-task tunables will rescale the value of the other to an equivalent meaning. Although /proc/pid/oom_adj was originally defined as a bitshift on the badness score, it now shares the same linear growth as /proc/pid/oom_score_adj but with different granularity. This is required so the ABI is not broken with userspace applications and allows oom_adj to be deprecated for future removal. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Nick Piggin <npiggin@suse.de> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/a63d83f427fbce97a6cea0db2e64b0eb8435cd10;"/proc/pid/oom_adj is changed so that its meaning is rescaled into the units used by /proc/pid/oom_score_adj, and vice versa";yes;no;no 264;MDY6Q29tbWl0MjMyNTI5ODphNjNkODNmNDI3ZmJjZTk3YTZjZWEwZGIyZTY0YjBlYjg0MzVjZDEw;David Rientjes;Linus Torvalds;"oom: badness heuristic rewrite This a complete rewrite of the oom killer's badness() heuristic which is used to determine which task to kill in oom conditions. The goal is to make it as simple and predictable as possible so the results are better understood and we end up killing the task which will lead to the most memory freeing while still respecting the fine-tuning from userspace. Instead of basing the heuristic on mm->total_vm for each task, the task's rss and swap space is used instead. This is a better indication of the amount of memory that will be freeable if the oom killed task is chosen and subsequently exits. This helps specifically in cases where KDE or GNOME is chosen for oom kill on desktop systems instead of a memory hogging task. The baseline for the heuristic is a proportion of memory that each task is currently using in memory plus swap compared to the amount of ""allowable"" memory. ""Allowable,"" in this sense, means the system-wide resources for unconstrained oom conditions, the set of mempolicy nodes, the mems attached to current's cpuset, or a memory controller's limit. The proportion is given on a scale of 0 (never kill) to 1000 (always kill), roughly meaning that if a task has a badness() score of 500 that the task consumes approximately 50% of allowable memory resident in RAM or in swap space. The proportion is always relative to the amount of ""allowable"" memory and not the total amount of RAM systemwide so that mempolicies and cpusets may operate in isolation; they shall not need to know the true size of the machine on which they are running if they are bound to a specific set of nodes or mems, respectively. Root tasks are given 3% extra memory just like __vm_enough_memory() provides in LSMs. In the event of two tasks consuming similar amounts of memory, it is generally better to save root's task. Because of the change in the badness() heuristic's baseline, it is also necessary to introduce a new user interface to tune it. It's not possible to redefine the meaning of /proc/pid/oom_adj with a new scale since the ABI cannot be changed for backward compatability. Instead, a new tunable, /proc/pid/oom_score_adj, is added that ranges from -1000 to +1000. It may be used to polarize the heuristic such that certain tasks are never considered for oom kill while others may always be considered. The value is added directly into the badness() score so a value of -500, for example, means to discount 50% of its memory consumption in comparison to other tasks either on the system, bound to the mempolicy, in the cpuset, or sharing the same memory controller. /proc/pid/oom_adj is changed so that its meaning is rescaled into the units used by /proc/pid/oom_score_adj, and vice versa. Changing one of these per-task tunables will rescale the value of the other to an equivalent meaning. Although /proc/pid/oom_adj was originally defined as a bitshift on the badness score, it now shares the same linear growth as /proc/pid/oom_score_adj but with different granularity. This is required so the ABI is not broken with userspace applications and allows oom_adj to be deprecated for future removal. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Nick Piggin <npiggin@suse.de> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/a63d83f427fbce97a6cea0db2e64b0eb8435cd10;" Changing one of these per-task tunables will rescale the value of the other to an equivalent meaning";yes;no;no 264;MDY6Q29tbWl0MjMyNTI5ODphNjNkODNmNDI3ZmJjZTk3YTZjZWEwZGIyZTY0YjBlYjg0MzVjZDEw;David Rientjes;Linus Torvalds;"oom: badness heuristic rewrite This a complete rewrite of the oom killer's badness() heuristic which is used to determine which task to kill in oom conditions. The goal is to make it as simple and predictable as possible so the results are better understood and we end up killing the task which will lead to the most memory freeing while still respecting the fine-tuning from userspace. Instead of basing the heuristic on mm->total_vm for each task, the task's rss and swap space is used instead. This is a better indication of the amount of memory that will be freeable if the oom killed task is chosen and subsequently exits. This helps specifically in cases where KDE or GNOME is chosen for oom kill on desktop systems instead of a memory hogging task. The baseline for the heuristic is a proportion of memory that each task is currently using in memory plus swap compared to the amount of ""allowable"" memory. ""Allowable,"" in this sense, means the system-wide resources for unconstrained oom conditions, the set of mempolicy nodes, the mems attached to current's cpuset, or a memory controller's limit. The proportion is given on a scale of 0 (never kill) to 1000 (always kill), roughly meaning that if a task has a badness() score of 500 that the task consumes approximately 50% of allowable memory resident in RAM or in swap space. The proportion is always relative to the amount of ""allowable"" memory and not the total amount of RAM systemwide so that mempolicies and cpusets may operate in isolation; they shall not need to know the true size of the machine on which they are running if they are bound to a specific set of nodes or mems, respectively. Root tasks are given 3% extra memory just like __vm_enough_memory() provides in LSMs. In the event of two tasks consuming similar amounts of memory, it is generally better to save root's task. Because of the change in the badness() heuristic's baseline, it is also necessary to introduce a new user interface to tune it. It's not possible to redefine the meaning of /proc/pid/oom_adj with a new scale since the ABI cannot be changed for backward compatability. Instead, a new tunable, /proc/pid/oom_score_adj, is added that ranges from -1000 to +1000. It may be used to polarize the heuristic such that certain tasks are never considered for oom kill while others may always be considered. The value is added directly into the badness() score so a value of -500, for example, means to discount 50% of its memory consumption in comparison to other tasks either on the system, bound to the mempolicy, in the cpuset, or sharing the same memory controller. /proc/pid/oom_adj is changed so that its meaning is rescaled into the units used by /proc/pid/oom_score_adj, and vice versa. Changing one of these per-task tunables will rescale the value of the other to an equivalent meaning. Although /proc/pid/oom_adj was originally defined as a bitshift on the badness score, it now shares the same linear growth as /proc/pid/oom_score_adj but with different granularity. This is required so the ABI is not broken with userspace applications and allows oom_adj to be deprecated for future removal. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Nick Piggin <npiggin@suse.de> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/a63d83f427fbce97a6cea0db2e64b0eb8435cd10;" Although /proc/pid/oom_adj was originally defined as a bitshift on the badness score, it now shares the same linear growth as /proc/pid/oom_score_adj but with different granularity";yes;no;yes 264;MDY6Q29tbWl0MjMyNTI5ODphNjNkODNmNDI3ZmJjZTk3YTZjZWEwZGIyZTY0YjBlYjg0MzVjZDEw;David Rientjes;Linus Torvalds;"oom: badness heuristic rewrite This a complete rewrite of the oom killer's badness() heuristic which is used to determine which task to kill in oom conditions. The goal is to make it as simple and predictable as possible so the results are better understood and we end up killing the task which will lead to the most memory freeing while still respecting the fine-tuning from userspace. Instead of basing the heuristic on mm->total_vm for each task, the task's rss and swap space is used instead. This is a better indication of the amount of memory that will be freeable if the oom killed task is chosen and subsequently exits. This helps specifically in cases where KDE or GNOME is chosen for oom kill on desktop systems instead of a memory hogging task. The baseline for the heuristic is a proportion of memory that each task is currently using in memory plus swap compared to the amount of ""allowable"" memory. ""Allowable,"" in this sense, means the system-wide resources for unconstrained oom conditions, the set of mempolicy nodes, the mems attached to current's cpuset, or a memory controller's limit. The proportion is given on a scale of 0 (never kill) to 1000 (always kill), roughly meaning that if a task has a badness() score of 500 that the task consumes approximately 50% of allowable memory resident in RAM or in swap space. The proportion is always relative to the amount of ""allowable"" memory and not the total amount of RAM systemwide so that mempolicies and cpusets may operate in isolation; they shall not need to know the true size of the machine on which they are running if they are bound to a specific set of nodes or mems, respectively. Root tasks are given 3% extra memory just like __vm_enough_memory() provides in LSMs. In the event of two tasks consuming similar amounts of memory, it is generally better to save root's task. Because of the change in the badness() heuristic's baseline, it is also necessary to introduce a new user interface to tune it. It's not possible to redefine the meaning of /proc/pid/oom_adj with a new scale since the ABI cannot be changed for backward compatability. Instead, a new tunable, /proc/pid/oom_score_adj, is added that ranges from -1000 to +1000. It may be used to polarize the heuristic such that certain tasks are never considered for oom kill while others may always be considered. The value is added directly into the badness() score so a value of -500, for example, means to discount 50% of its memory consumption in comparison to other tasks either on the system, bound to the mempolicy, in the cpuset, or sharing the same memory controller. /proc/pid/oom_adj is changed so that its meaning is rescaled into the units used by /proc/pid/oom_score_adj, and vice versa. Changing one of these per-task tunables will rescale the value of the other to an equivalent meaning. Although /proc/pid/oom_adj was originally defined as a bitshift on the badness score, it now shares the same linear growth as /proc/pid/oom_score_adj but with different granularity. This is required so the ABI is not broken with userspace applications and allows oom_adj to be deprecated for future removal. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Nick Piggin <npiggin@suse.de> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/a63d83f427fbce97a6cea0db2e64b0eb8435cd10;" This is required so the ABI is not broken with userspace applications and allows oom_adj to be deprecated for future removal.";yes;yes;no 265;MDY6Q29tbWl0MjMyNTI5ODpjZWYxZDM1MjNkMzNlYmMzNWZjMjllNDU0YjFmNGJhYjk1M2ZhYmJm;KOSAKI Motohiro;Linus Torvalds;"oom: multi threaded process coredump don't make deadlock Oleg pointed out current PF_EXITING check is wrong. Because PF_EXITING is per-thread flag, not per-process flag. He said, Two threads, group-leader L and its sub-thread T. T dumps the code. In this case both threads have ->mm != NULL, L has PF_EXITING. The first problem is, select_bad_process() always return -1 in this case (even if the caller is T, this doesn't matter). The second problem is that we should add TIF_MEMDIE to T, not L. I think we can remove this dubious PF_EXITING check. but as first step, This patch add the protection of multi threaded issue. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/cef1d3523d33ebc35fc29e454b1f4bab953fabbf;oom: multi threaded process coredump don't make deadlock;yes;yes;no 265;MDY6Q29tbWl0MjMyNTI5ODpjZWYxZDM1MjNkMzNlYmMzNWZjMjllNDU0YjFmNGJhYjk1M2ZhYmJm;KOSAKI Motohiro;Linus Torvalds;"oom: multi threaded process coredump don't make deadlock Oleg pointed out current PF_EXITING check is wrong. Because PF_EXITING is per-thread flag, not per-process flag. He said, Two threads, group-leader L and its sub-thread T. T dumps the code. In this case both threads have ->mm != NULL, L has PF_EXITING. The first problem is, select_bad_process() always return -1 in this case (even if the caller is T, this doesn't matter). The second problem is that we should add TIF_MEMDIE to T, not L. I think we can remove this dubious PF_EXITING check. but as first step, This patch add the protection of multi threaded issue. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/cef1d3523d33ebc35fc29e454b1f4bab953fabbf;Oleg pointed out current PF_EXITING check is wrong;no;yes;yes 265;MDY6Q29tbWl0MjMyNTI5ODpjZWYxZDM1MjNkMzNlYmMzNWZjMjllNDU0YjFmNGJhYjk1M2ZhYmJm;KOSAKI Motohiro;Linus Torvalds;"oom: multi threaded process coredump don't make deadlock Oleg pointed out current PF_EXITING check is wrong. Because PF_EXITING is per-thread flag, not per-process flag. He said, Two threads, group-leader L and its sub-thread T. T dumps the code. In this case both threads have ->mm != NULL, L has PF_EXITING. The first problem is, select_bad_process() always return -1 in this case (even if the caller is T, this doesn't matter). The second problem is that we should add TIF_MEMDIE to T, not L. I think we can remove this dubious PF_EXITING check. but as first step, This patch add the protection of multi threaded issue. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/cef1d3523d33ebc35fc29e454b1f4bab953fabbf;"Because PF_EXITING is per-thread flag, not per-process flag";no;yes;yes 265;MDY6Q29tbWl0MjMyNTI5ODpjZWYxZDM1MjNkMzNlYmMzNWZjMjllNDU0YjFmNGJhYjk1M2ZhYmJm;KOSAKI Motohiro;Linus Torvalds;"oom: multi threaded process coredump don't make deadlock Oleg pointed out current PF_EXITING check is wrong. Because PF_EXITING is per-thread flag, not per-process flag. He said, Two threads, group-leader L and its sub-thread T. T dumps the code. In this case both threads have ->mm != NULL, L has PF_EXITING. The first problem is, select_bad_process() always return -1 in this case (even if the caller is T, this doesn't matter). The second problem is that we should add TIF_MEMDIE to T, not L. I think we can remove this dubious PF_EXITING check. but as first step, This patch add the protection of multi threaded issue. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/cef1d3523d33ebc35fc29e454b1f4bab953fabbf;"He said, Two threads, group-leader L and its sub-thread T";no;yes;yes 265;MDY6Q29tbWl0MjMyNTI5ODpjZWYxZDM1MjNkMzNlYmMzNWZjMjllNDU0YjFmNGJhYjk1M2ZhYmJm;KOSAKI Motohiro;Linus Torvalds;"oom: multi threaded process coredump don't make deadlock Oleg pointed out current PF_EXITING check is wrong. Because PF_EXITING is per-thread flag, not per-process flag. He said, Two threads, group-leader L and its sub-thread T. T dumps the code. In this case both threads have ->mm != NULL, L has PF_EXITING. The first problem is, select_bad_process() always return -1 in this case (even if the caller is T, this doesn't matter). The second problem is that we should add TIF_MEMDIE to T, not L. I think we can remove this dubious PF_EXITING check. but as first step, This patch add the protection of multi threaded issue. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/cef1d3523d33ebc35fc29e454b1f4bab953fabbf;T dumps the code;no;yes;yes 265;MDY6Q29tbWl0MjMyNTI5ODpjZWYxZDM1MjNkMzNlYmMzNWZjMjllNDU0YjFmNGJhYjk1M2ZhYmJm;KOSAKI Motohiro;Linus Torvalds;"oom: multi threaded process coredump don't make deadlock Oleg pointed out current PF_EXITING check is wrong. Because PF_EXITING is per-thread flag, not per-process flag. He said, Two threads, group-leader L and its sub-thread T. T dumps the code. In this case both threads have ->mm != NULL, L has PF_EXITING. The first problem is, select_bad_process() always return -1 in this case (even if the caller is T, this doesn't matter). The second problem is that we should add TIF_MEMDIE to T, not L. I think we can remove this dubious PF_EXITING check. but as first step, This patch add the protection of multi threaded issue. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/cef1d3523d33ebc35fc29e454b1f4bab953fabbf; In this case both threads have ->mm != NULL, L has PF_EXITING;no;yes;yes 265;MDY6Q29tbWl0MjMyNTI5ODpjZWYxZDM1MjNkMzNlYmMzNWZjMjllNDU0YjFmNGJhYjk1M2ZhYmJm;KOSAKI Motohiro;Linus Torvalds;"oom: multi threaded process coredump don't make deadlock Oleg pointed out current PF_EXITING check is wrong. Because PF_EXITING is per-thread flag, not per-process flag. He said, Two threads, group-leader L and its sub-thread T. T dumps the code. In this case both threads have ->mm != NULL, L has PF_EXITING. The first problem is, select_bad_process() always return -1 in this case (even if the caller is T, this doesn't matter). The second problem is that we should add TIF_MEMDIE to T, not L. I think we can remove this dubious PF_EXITING check. but as first step, This patch add the protection of multi threaded issue. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/cef1d3523d33ebc35fc29e454b1f4bab953fabbf;" The first problem is, select_bad_process() always return -1 in this case (even if the caller is T, this doesn't matter)";no;yes;yes 265;MDY6Q29tbWl0MjMyNTI5ODpjZWYxZDM1MjNkMzNlYmMzNWZjMjllNDU0YjFmNGJhYjk1M2ZhYmJm;KOSAKI Motohiro;Linus Torvalds;"oom: multi threaded process coredump don't make deadlock Oleg pointed out current PF_EXITING check is wrong. Because PF_EXITING is per-thread flag, not per-process flag. He said, Two threads, group-leader L and its sub-thread T. T dumps the code. In this case both threads have ->mm != NULL, L has PF_EXITING. The first problem is, select_bad_process() always return -1 in this case (even if the caller is T, this doesn't matter). The second problem is that we should add TIF_MEMDIE to T, not L. I think we can remove this dubious PF_EXITING check. but as first step, This patch add the protection of multi threaded issue. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/cef1d3523d33ebc35fc29e454b1f4bab953fabbf; The second problem is that we should add TIF_MEMDIE to T, not L;no;yes;yes 265;MDY6Q29tbWl0MjMyNTI5ODpjZWYxZDM1MjNkMzNlYmMzNWZjMjllNDU0YjFmNGJhYjk1M2ZhYmJm;KOSAKI Motohiro;Linus Torvalds;"oom: multi threaded process coredump don't make deadlock Oleg pointed out current PF_EXITING check is wrong. Because PF_EXITING is per-thread flag, not per-process flag. He said, Two threads, group-leader L and its sub-thread T. T dumps the code. In this case both threads have ->mm != NULL, L has PF_EXITING. The first problem is, select_bad_process() always return -1 in this case (even if the caller is T, this doesn't matter). The second problem is that we should add TIF_MEMDIE to T, not L. I think we can remove this dubious PF_EXITING check. but as first step, This patch add the protection of multi threaded issue. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/cef1d3523d33ebc35fc29e454b1f4bab953fabbf;I think we can remove this dubious PF_EXITING check;yes;yes;no 265;MDY6Q29tbWl0MjMyNTI5ODpjZWYxZDM1MjNkMzNlYmMzNWZjMjllNDU0YjFmNGJhYjk1M2ZhYmJm;KOSAKI Motohiro;Linus Torvalds;"oom: multi threaded process coredump don't make deadlock Oleg pointed out current PF_EXITING check is wrong. Because PF_EXITING is per-thread flag, not per-process flag. He said, Two threads, group-leader L and its sub-thread T. T dumps the code. In this case both threads have ->mm != NULL, L has PF_EXITING. The first problem is, select_bad_process() always return -1 in this case (even if the caller is T, this doesn't matter). The second problem is that we should add TIF_MEMDIE to T, not L. I think we can remove this dubious PF_EXITING check. but as first step, This patch add the protection of multi threaded issue. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/cef1d3523d33ebc35fc29e454b1f4bab953fabbf;"but as first step, This patch add the protection of multi threaded issue.";yes;yes;no 266;MDY6Q29tbWl0MjMyNTI5ODo5M2I0M2ZhNTUwODhmZTk3NzUwM2ExNTZkMTA5N2NjMjA1NTQ0OWEy;Luis Claudio R. Goncalves;Linus Torvalds;"oom: give the dying task a higher priority In a system under heavy load it was observed that even after the oom-killer selects a task to die, the task may take a long time to die. Right after sending a SIGKILL to the task selected by the oom-killer this task has its priority increased so that it can exit() soon, freeing memory. That is accomplished by: /* * We give our sacrificial lamb high priority and access to * all the memory it needs. That way it should be able to * exit() and clear out its resources quickly... */ p->rt.time_slice = HZ; set_tsk_thread_flag(p, TIF_MEMDIE); It sounds plausible giving the dying task an even higher priority to be sure it will be scheduled sooner and free the desired memory. It was suggested on LKML using SCHED_FIFO:1, the lowest RT priority so that this task won't interfere with any running RT task. If the dying task is already an RT task, leave it untouched. Another good suggestion, implemented here, was to avoid boosting the dying task priority in case of mem_cgroup OOM. Signed-off-by: Luis Claudio R. Goncalves <lclaudio@uudg.org> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/93b43fa55088fe977503a156d1097cc2055449a2;oom: give the dying task a higher priority;yes;no;no 266;MDY6Q29tbWl0MjMyNTI5ODo5M2I0M2ZhNTUwODhmZTk3NzUwM2ExNTZkMTA5N2NjMjA1NTQ0OWEy;Luis Claudio R. Goncalves;Linus Torvalds;"oom: give the dying task a higher priority In a system under heavy load it was observed that even after the oom-killer selects a task to die, the task may take a long time to die. Right after sending a SIGKILL to the task selected by the oom-killer this task has its priority increased so that it can exit() soon, freeing memory. That is accomplished by: /* * We give our sacrificial lamb high priority and access to * all the memory it needs. That way it should be able to * exit() and clear out its resources quickly... */ p->rt.time_slice = HZ; set_tsk_thread_flag(p, TIF_MEMDIE); It sounds plausible giving the dying task an even higher priority to be sure it will be scheduled sooner and free the desired memory. It was suggested on LKML using SCHED_FIFO:1, the lowest RT priority so that this task won't interfere with any running RT task. If the dying task is already an RT task, leave it untouched. Another good suggestion, implemented here, was to avoid boosting the dying task priority in case of mem_cgroup OOM. Signed-off-by: Luis Claudio R. Goncalves <lclaudio@uudg.org> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/93b43fa55088fe977503a156d1097cc2055449a2;"In a system under heavy load it was observed that even after the oom-killer selects a task to die, the task may take a long time to die";no;no;yes 266;MDY6Q29tbWl0MjMyNTI5ODo5M2I0M2ZhNTUwODhmZTk3NzUwM2ExNTZkMTA5N2NjMjA1NTQ0OWEy;Luis Claudio R. Goncalves;Linus Torvalds;"oom: give the dying task a higher priority In a system under heavy load it was observed that even after the oom-killer selects a task to die, the task may take a long time to die. Right after sending a SIGKILL to the task selected by the oom-killer this task has its priority increased so that it can exit() soon, freeing memory. That is accomplished by: /* * We give our sacrificial lamb high priority and access to * all the memory it needs. That way it should be able to * exit() and clear out its resources quickly... */ p->rt.time_slice = HZ; set_tsk_thread_flag(p, TIF_MEMDIE); It sounds plausible giving the dying task an even higher priority to be sure it will be scheduled sooner and free the desired memory. It was suggested on LKML using SCHED_FIFO:1, the lowest RT priority so that this task won't interfere with any running RT task. If the dying task is already an RT task, leave it untouched. Another good suggestion, implemented here, was to avoid boosting the dying task priority in case of mem_cgroup OOM. Signed-off-by: Luis Claudio R. Goncalves <lclaudio@uudg.org> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/93b43fa55088fe977503a156d1097cc2055449a2;"Right after sending a SIGKILL to the task selected by the oom-killer this task has its priority increased so that it can exit() soon, freeing memory";no;no;yes 266;MDY6Q29tbWl0MjMyNTI5ODo5M2I0M2ZhNTUwODhmZTk3NzUwM2ExNTZkMTA5N2NjMjA1NTQ0OWEy;Luis Claudio R. Goncalves;Linus Torvalds;"oom: give the dying task a higher priority In a system under heavy load it was observed that even after the oom-killer selects a task to die, the task may take a long time to die. Right after sending a SIGKILL to the task selected by the oom-killer this task has its priority increased so that it can exit() soon, freeing memory. That is accomplished by: /* * We give our sacrificial lamb high priority and access to * all the memory it needs. That way it should be able to * exit() and clear out its resources quickly... */ p->rt.time_slice = HZ; set_tsk_thread_flag(p, TIF_MEMDIE); It sounds plausible giving the dying task an even higher priority to be sure it will be scheduled sooner and free the desired memory. It was suggested on LKML using SCHED_FIFO:1, the lowest RT priority so that this task won't interfere with any running RT task. If the dying task is already an RT task, leave it untouched. Another good suggestion, implemented here, was to avoid boosting the dying task priority in case of mem_cgroup OOM. Signed-off-by: Luis Claudio R. Goncalves <lclaudio@uudg.org> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/93b43fa55088fe977503a156d1097cc2055449a2; That is accomplished by;no;no;yes 266;MDY6Q29tbWl0MjMyNTI5ODo5M2I0M2ZhNTUwODhmZTk3NzUwM2ExNTZkMTA5N2NjMjA1NTQ0OWEy;Luis Claudio R. Goncalves;Linus Torvalds;"oom: give the dying task a higher priority In a system under heavy load it was observed that even after the oom-killer selects a task to die, the task may take a long time to die. Right after sending a SIGKILL to the task selected by the oom-killer this task has its priority increased so that it can exit() soon, freeing memory. That is accomplished by: /* * We give our sacrificial lamb high priority and access to * all the memory it needs. That way it should be able to * exit() and clear out its resources quickly... */ p->rt.time_slice = HZ; set_tsk_thread_flag(p, TIF_MEMDIE); It sounds plausible giving the dying task an even higher priority to be sure it will be scheduled sooner and free the desired memory. It was suggested on LKML using SCHED_FIFO:1, the lowest RT priority so that this task won't interfere with any running RT task. If the dying task is already an RT task, leave it untouched. Another good suggestion, implemented here, was to avoid boosting the dying task priority in case of mem_cgroup OOM. Signed-off-by: Luis Claudio R. Goncalves <lclaudio@uudg.org> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/93b43fa55088fe977503a156d1097cc2055449a2;"It sounds plausible giving the dying task an even higher priority to be sure it will be scheduled sooner and free the desired memory";yes;yes;yes 266;MDY6Q29tbWl0MjMyNTI5ODo5M2I0M2ZhNTUwODhmZTk3NzUwM2ExNTZkMTA5N2NjMjA1NTQ0OWEy;Luis Claudio R. Goncalves;Linus Torvalds;"oom: give the dying task a higher priority In a system under heavy load it was observed that even after the oom-killer selects a task to die, the task may take a long time to die. Right after sending a SIGKILL to the task selected by the oom-killer this task has its priority increased so that it can exit() soon, freeing memory. That is accomplished by: /* * We give our sacrificial lamb high priority and access to * all the memory it needs. That way it should be able to * exit() and clear out its resources quickly... */ p->rt.time_slice = HZ; set_tsk_thread_flag(p, TIF_MEMDIE); It sounds plausible giving the dying task an even higher priority to be sure it will be scheduled sooner and free the desired memory. It was suggested on LKML using SCHED_FIFO:1, the lowest RT priority so that this task won't interfere with any running RT task. If the dying task is already an RT task, leave it untouched. Another good suggestion, implemented here, was to avoid boosting the dying task priority in case of mem_cgroup OOM. Signed-off-by: Luis Claudio R. Goncalves <lclaudio@uudg.org> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/93b43fa55088fe977503a156d1097cc2055449a2;" It was suggested on LKML using SCHED_FIFO:1, the lowest RT priority so that this task won't interfere with any running RT task";yes;yes;yes 266;MDY6Q29tbWl0MjMyNTI5ODo5M2I0M2ZhNTUwODhmZTk3NzUwM2ExNTZkMTA5N2NjMjA1NTQ0OWEy;Luis Claudio R. Goncalves;Linus Torvalds;"oom: give the dying task a higher priority In a system under heavy load it was observed that even after the oom-killer selects a task to die, the task may take a long time to die. Right after sending a SIGKILL to the task selected by the oom-killer this task has its priority increased so that it can exit() soon, freeing memory. That is accomplished by: /* * We give our sacrificial lamb high priority and access to * all the memory it needs. That way it should be able to * exit() and clear out its resources quickly... */ p->rt.time_slice = HZ; set_tsk_thread_flag(p, TIF_MEMDIE); It sounds plausible giving the dying task an even higher priority to be sure it will be scheduled sooner and free the desired memory. It was suggested on LKML using SCHED_FIFO:1, the lowest RT priority so that this task won't interfere with any running RT task. If the dying task is already an RT task, leave it untouched. Another good suggestion, implemented here, was to avoid boosting the dying task priority in case of mem_cgroup OOM. Signed-off-by: Luis Claudio R. Goncalves <lclaudio@uudg.org> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/93b43fa55088fe977503a156d1097cc2055449a2;If the dying task is already an RT task, leave it untouched;yes;no;no 266;MDY6Q29tbWl0MjMyNTI5ODo5M2I0M2ZhNTUwODhmZTk3NzUwM2ExNTZkMTA5N2NjMjA1NTQ0OWEy;Luis Claudio R. Goncalves;Linus Torvalds;"oom: give the dying task a higher priority In a system under heavy load it was observed that even after the oom-killer selects a task to die, the task may take a long time to die. Right after sending a SIGKILL to the task selected by the oom-killer this task has its priority increased so that it can exit() soon, freeing memory. That is accomplished by: /* * We give our sacrificial lamb high priority and access to * all the memory it needs. That way it should be able to * exit() and clear out its resources quickly... */ p->rt.time_slice = HZ; set_tsk_thread_flag(p, TIF_MEMDIE); It sounds plausible giving the dying task an even higher priority to be sure it will be scheduled sooner and free the desired memory. It was suggested on LKML using SCHED_FIFO:1, the lowest RT priority so that this task won't interfere with any running RT task. If the dying task is already an RT task, leave it untouched. Another good suggestion, implemented here, was to avoid boosting the dying task priority in case of mem_cgroup OOM. Signed-off-by: Luis Claudio R. Goncalves <lclaudio@uudg.org> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/93b43fa55088fe977503a156d1097cc2055449a2;" Another good suggestion, implemented here, was to avoid boosting the dying task priority in case of mem_cgroup OOM.";yes;yes;yes 267;MDY6Q29tbWl0MjMyNTI5ODoxOWI0NTg2Y2Q5YzhlZDY0Mjc5ODkwMmU1NWM2ZjYxZWQ1NzZhZDkz;KOSAKI Motohiro;Linus Torvalds;"oom: remove child->mm check from oom_kill_process() The current ""child->mm == p->mm"" check prevents selection of vfork()ed task. But we don't have any reason to don't consider vfork(). Removed. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/19b4586cd9c8ed642798902e55c6f61ed576ad93;oom: remove child->mm check from oom_kill_process();yes;no;no 267;MDY6Q29tbWl0MjMyNTI5ODoxOWI0NTg2Y2Q5YzhlZDY0Mjc5ODkwMmU1NWM2ZjYxZWQ1NzZhZDkz;KOSAKI Motohiro;Linus Torvalds;"oom: remove child->mm check from oom_kill_process() The current ""child->mm == p->mm"" check prevents selection of vfork()ed task. But we don't have any reason to don't consider vfork(). Removed. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/19b4586cd9c8ed642798902e55c6f61ed576ad93;"The current ""child->mm == p->mm"" check prevents selection of vfork()ed task";yes;yes;yes 267;MDY6Q29tbWl0MjMyNTI5ODoxOWI0NTg2Y2Q5YzhlZDY0Mjc5ODkwMmU1NWM2ZjYxZWQ1NzZhZDkz;KOSAKI Motohiro;Linus Torvalds;"oom: remove child->mm check from oom_kill_process() The current ""child->mm == p->mm"" check prevents selection of vfork()ed task. But we don't have any reason to don't consider vfork(). Removed. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/19b4586cd9c8ed642798902e55c6f61ed576ad93; But we don't have any reason to don't consider vfork();yes;yes;yes 267;MDY6Q29tbWl0MjMyNTI5ODoxOWI0NTg2Y2Q5YzhlZDY0Mjc5ODkwMmU1NWM2ZjYxZWQ1NzZhZDkz;KOSAKI Motohiro;Linus Torvalds;"oom: remove child->mm check from oom_kill_process() The current ""child->mm == p->mm"" check prevents selection of vfork()ed task. But we don't have any reason to don't consider vfork(). Removed. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/19b4586cd9c8ed642798902e55c6f61ed576ad93;Removed.;yes;no;no 268;MDY6Q29tbWl0MjMyNTI5ODpkZjEwOTBhOGRkYTQwYjZlMTFkOGNkMDllOGZjOTAwY2ZlOTEzYjM4;KOSAKI Motohiro;Linus Torvalds;"oom: cleanup has_intersects_mems_allowed() presently has_intersects_mems_allowed() has own thread iterate logic, but it should use while_each_thread(). It slightly improve the code readability. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/df1090a8dda40b6e11d8cd09e8fc900cfe913b38;oom: cleanup has_intersects_mems_allowed();yes;yes;no 268;MDY6Q29tbWl0MjMyNTI5ODpkZjEwOTBhOGRkYTQwYjZlMTFkOGNkMDllOGZjOTAwY2ZlOTEzYjM4;KOSAKI Motohiro;Linus Torvalds;"oom: cleanup has_intersects_mems_allowed() presently has_intersects_mems_allowed() has own thread iterate logic, but it should use while_each_thread(). It slightly improve the code readability. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/df1090a8dda40b6e11d8cd09e8fc900cfe913b38;"presently has_intersects_mems_allowed() has own thread iterate logic, but it should use while_each_thread()";yes;yes;yes 268;MDY6Q29tbWl0MjMyNTI5ODpkZjEwOTBhOGRkYTQwYjZlMTFkOGNkMDllOGZjOTAwY2ZlOTEzYjM4;KOSAKI Motohiro;Linus Torvalds;"oom: cleanup has_intersects_mems_allowed() presently has_intersects_mems_allowed() has own thread iterate logic, but it should use while_each_thread(). It slightly improve the code readability. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/df1090a8dda40b6e11d8cd09e8fc900cfe913b38;It slightly improve the code readability.;yes;yes;no 269;MDY6Q29tbWl0MjMyNTI5ODphOTZjZmQ2ZTkxNzZhZDQ0MjIzMzAwMWI3ZDE1ZTllZDQyMjM0MzIw;KOSAKI Motohiro;Linus Torvalds;"oom: move OOM_DISABLE check from oom_kill_task to out_of_memory() Presently if oom_kill_allocating_task is enabled and current have OOM_DISABLED, following printk in oom_kill_process is called twice. pr_err(""%s: Kill process %d (%s) score %lu or sacrifice child\n"", message, task_pid_nr(p), p->comm, points); So, OOM_DISABLE check should be more early. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/a96cfd6e9176ad442233001b7d15e9ed42234320;oom: move OOM_DISABLE check from oom_kill_task to out_of_memory();yes;no;no 269;MDY6Q29tbWl0MjMyNTI5ODphOTZjZmQ2ZTkxNzZhZDQ0MjIzMzAwMWI3ZDE1ZTllZDQyMjM0MzIw;KOSAKI Motohiro;Linus Torvalds;"oom: move OOM_DISABLE check from oom_kill_task to out_of_memory() Presently if oom_kill_allocating_task is enabled and current have OOM_DISABLED, following printk in oom_kill_process is called twice. pr_err(""%s: Kill process %d (%s) score %lu or sacrifice child\n"", message, task_pid_nr(p), p->comm, points); So, OOM_DISABLE check should be more early. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/a96cfd6e9176ad442233001b7d15e9ed42234320;"Presently if oom_kill_allocating_task is enabled and current have OOM_DISABLED, following printk in oom_kill_process is called twice";no;no;yes 269;MDY6Q29tbWl0MjMyNTI5ODphOTZjZmQ2ZTkxNzZhZDQ0MjIzMzAwMWI3ZDE1ZTllZDQyMjM0MzIw;KOSAKI Motohiro;Linus Torvalds;"oom: move OOM_DISABLE check from oom_kill_task to out_of_memory() Presently if oom_kill_allocating_task is enabled and current have OOM_DISABLED, following printk in oom_kill_process is called twice. pr_err(""%s: Kill process %d (%s) score %lu or sacrifice child\n"", message, task_pid_nr(p), p->comm, points); So, OOM_DISABLE check should be more early. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/a96cfd6e9176ad442233001b7d15e9ed42234320;" pr_err(""%s: Kill process %d (%s) score %lu or sacrifice child\n"", So, OOM_DISABLE check should be more early.";no;yes;yes 270;MDY6Q29tbWl0MjMyNTI5ODoxMTNlMjdmMzZkZmY5ODk1MDQ5ZGYzMjRmMjkyNDc0ODU0NzUwZDIx;KOSAKI Motohiro;Linus Torvalds;"oom: kill duplicate OOM_DISABLE check select_bad_process() and badness() have the same OOM_DISABLE check. This patch kills one. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/113e27f36dff9895049df324f292474854750d21;oom: kill duplicate OOM_DISABLE check;yes;yes;no 270;MDY6Q29tbWl0MjMyNTI5ODoxMTNlMjdmMzZkZmY5ODk1MDQ5ZGYzMjRmMjkyNDc0ODU0NzUwZDIx;KOSAKI Motohiro;Linus Torvalds;"oom: kill duplicate OOM_DISABLE check select_bad_process() and badness() have the same OOM_DISABLE check. This patch kills one. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/113e27f36dff9895049df324f292474854750d21;select_bad_process() and badness() have the same OOM_DISABLE check;no;yes;yes 270;MDY6Q29tbWl0MjMyNTI5ODoxMTNlMjdmMzZkZmY5ODk1MDQ5ZGYzMjRmMjkyNDc0ODU0NzUwZDIx;KOSAKI Motohiro;Linus Torvalds;"oom: kill duplicate OOM_DISABLE check select_bad_process() and badness() have the same OOM_DISABLE check. This patch kills one. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/113e27f36dff9895049df324f292474854750d21;" This patch kills one.";yes;no;no 271;MDY6Q29tbWl0MjMyNTI5ODoyNmViYzk4NDkxM2I2YThkODZkNzI0YjNhNzlkMmVkNGVkNTc0NjEy;KOSAKI Motohiro;Linus Torvalds;"oom: /proc/<pid>/oom_score treat kernel thread honestly If a kernel thread is using use_mm(), badness() returns a positive value. This is not a big issue because caller take care of it correctly. But there is one exception, /proc/<pid>/oom_score calls badness() directly and doesn't care that the task is a regular process. Another example, /proc/1/oom_score return !0 value. But it's unkillable. This incorrectness makes administration a little confusing. This patch fixes it. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/26ebc984913b6a8d86d724b3a79d2ed4ed574612;oom: /proc/<pid>/oom_score treat kernel thread honestly;yes;yes;no 271;MDY6Q29tbWl0MjMyNTI5ODoyNmViYzk4NDkxM2I2YThkODZkNzI0YjNhNzlkMmVkNGVkNTc0NjEy;KOSAKI Motohiro;Linus Torvalds;"oom: /proc/<pid>/oom_score treat kernel thread honestly If a kernel thread is using use_mm(), badness() returns a positive value. This is not a big issue because caller take care of it correctly. But there is one exception, /proc/<pid>/oom_score calls badness() directly and doesn't care that the task is a regular process. Another example, /proc/1/oom_score return !0 value. But it's unkillable. This incorrectness makes administration a little confusing. This patch fixes it. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/26ebc984913b6a8d86d724b3a79d2ed4ed574612;If a kernel thread is using use_mm(), badness() returns a positive value;no;no;yes 271;MDY6Q29tbWl0MjMyNTI5ODoyNmViYzk4NDkxM2I2YThkODZkNzI0YjNhNzlkMmVkNGVkNTc0NjEy;KOSAKI Motohiro;Linus Torvalds;"oom: /proc/<pid>/oom_score treat kernel thread honestly If a kernel thread is using use_mm(), badness() returns a positive value. This is not a big issue because caller take care of it correctly. But there is one exception, /proc/<pid>/oom_score calls badness() directly and doesn't care that the task is a regular process. Another example, /proc/1/oom_score return !0 value. But it's unkillable. This incorrectness makes administration a little confusing. This patch fixes it. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/26ebc984913b6a8d86d724b3a79d2ed4ed574612;This is not a big issue because caller take care of it correctly;no;no;yes 271;MDY6Q29tbWl0MjMyNTI5ODoyNmViYzk4NDkxM2I2YThkODZkNzI0YjNhNzlkMmVkNGVkNTc0NjEy;KOSAKI Motohiro;Linus Torvalds;"oom: /proc/<pid>/oom_score treat kernel thread honestly If a kernel thread is using use_mm(), badness() returns a positive value. This is not a big issue because caller take care of it correctly. But there is one exception, /proc/<pid>/oom_score calls badness() directly and doesn't care that the task is a regular process. Another example, /proc/1/oom_score return !0 value. But it's unkillable. This incorrectness makes administration a little confusing. This patch fixes it. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/26ebc984913b6a8d86d724b3a79d2ed4ed574612;" But there is one exception, /proc/<pid>/oom_score calls badness() directly and doesn't care that the task is a regular process";no;yes;yes 271;MDY6Q29tbWl0MjMyNTI5ODoyNmViYzk4NDkxM2I2YThkODZkNzI0YjNhNzlkMmVkNGVkNTc0NjEy;KOSAKI Motohiro;Linus Torvalds;"oom: /proc/<pid>/oom_score treat kernel thread honestly If a kernel thread is using use_mm(), badness() returns a positive value. This is not a big issue because caller take care of it correctly. But there is one exception, /proc/<pid>/oom_score calls badness() directly and doesn't care that the task is a regular process. Another example, /proc/1/oom_score return !0 value. But it's unkillable. This incorrectness makes administration a little confusing. This patch fixes it. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/26ebc984913b6a8d86d724b3a79d2ed4ed574612;Another example, /proc/1/oom_score return !0 value;no;yes;yes 271;MDY6Q29tbWl0MjMyNTI5ODoyNmViYzk4NDkxM2I2YThkODZkNzI0YjNhNzlkMmVkNGVkNTc0NjEy;KOSAKI Motohiro;Linus Torvalds;"oom: /proc/<pid>/oom_score treat kernel thread honestly If a kernel thread is using use_mm(), badness() returns a positive value. This is not a big issue because caller take care of it correctly. But there is one exception, /proc/<pid>/oom_score calls badness() directly and doesn't care that the task is a regular process. Another example, /proc/1/oom_score return !0 value. But it's unkillable. This incorrectness makes administration a little confusing. This patch fixes it. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/26ebc984913b6a8d86d724b3a79d2ed4ed574612; But it's unkillable;no;yes;yes 271;MDY6Q29tbWl0MjMyNTI5ODoyNmViYzk4NDkxM2I2YThkODZkNzI0YjNhNzlkMmVkNGVkNTc0NjEy;KOSAKI Motohiro;Linus Torvalds;"oom: /proc/<pid>/oom_score treat kernel thread honestly If a kernel thread is using use_mm(), badness() returns a positive value. This is not a big issue because caller take care of it correctly. But there is one exception, /proc/<pid>/oom_score calls badness() directly and doesn't care that the task is a regular process. Another example, /proc/1/oom_score return !0 value. But it's unkillable. This incorrectness makes administration a little confusing. This patch fixes it. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/26ebc984913b6a8d86d724b3a79d2ed4ed574612;This incorrectness makes administration a little confusing;no;yes;no 271;MDY6Q29tbWl0MjMyNTI5ODoyNmViYzk4NDkxM2I2YThkODZkNzI0YjNhNzlkMmVkNGVkNTc0NjEy;KOSAKI Motohiro;Linus Torvalds;"oom: /proc/<pid>/oom_score treat kernel thread honestly If a kernel thread is using use_mm(), badness() returns a positive value. This is not a big issue because caller take care of it correctly. But there is one exception, /proc/<pid>/oom_score calls badness() directly and doesn't care that the task is a regular process. Another example, /proc/1/oom_score return !0 value. But it's unkillable. This incorrectness makes administration a little confusing. This patch fixes it. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/26ebc984913b6a8d86d724b3a79d2ed4ed574612;This patch fixes it.;yes;yes;no 272;MDY6Q29tbWl0MjMyNTI5ODpmODhjY2FkNTg4NmQ1YTg2NGI4YjBkNDhjNjY2ZWU5OTk4ZGVjNTNm;KOSAKI Motohiro;Linus Torvalds;"oom: oom_kill_process() needs to check that p is unkillable When oom_kill_allocating_task is enabled, an argument task of oom_kill_process is not selected by select_bad_process(), It's just out_of_memory() caller task. It mean the task can be unkillable. check it first. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/f88ccad5886d5a864b8b0d48c666ee9998dec53f;oom: oom_kill_process() needs to check that p is unkillable;yes;yes;no 272;MDY6Q29tbWl0MjMyNTI5ODpmODhjY2FkNTg4NmQ1YTg2NGI4YjBkNDhjNjY2ZWU5OTk4ZGVjNTNm;KOSAKI Motohiro;Linus Torvalds;"oom: oom_kill_process() needs to check that p is unkillable When oom_kill_allocating_task is enabled, an argument task of oom_kill_process is not selected by select_bad_process(), It's just out_of_memory() caller task. It mean the task can be unkillable. check it first. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/f88ccad5886d5a864b8b0d48c666ee9998dec53f;"When oom_kill_allocating_task is enabled, an argument task of oom_kill_process is not selected by select_bad_process(), It's just out_of_memory() caller task";no;no;yes 272;MDY6Q29tbWl0MjMyNTI5ODpmODhjY2FkNTg4NmQ1YTg2NGI4YjBkNDhjNjY2ZWU5OTk4ZGVjNTNm;KOSAKI Motohiro;Linus Torvalds;"oom: oom_kill_process() needs to check that p is unkillable When oom_kill_allocating_task is enabled, an argument task of oom_kill_process is not selected by select_bad_process(), It's just out_of_memory() caller task. It mean the task can be unkillable. check it first. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/f88ccad5886d5a864b8b0d48c666ee9998dec53f; It mean the task can be unkillable;no;yes;yes 272;MDY6Q29tbWl0MjMyNTI5ODpmODhjY2FkNTg4NmQ1YTg2NGI4YjBkNDhjNjY2ZWU5OTk4ZGVjNTNm;KOSAKI Motohiro;Linus Torvalds;"oom: oom_kill_process() needs to check that p is unkillable When oom_kill_allocating_task is enabled, an argument task of oom_kill_process is not selected by select_bad_process(), It's just out_of_memory() caller task. It mean the task can be unkillable. check it first. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/f88ccad5886d5a864b8b0d48c666ee9998dec53f;" check it first.";yes;yes;no 273;MDY6Q29tbWl0MjMyNTI5ODphYjI5MGFkYmFmOGY0Njc3MGYwMTRlYTg3OTY4ZGU1YmFjYTI5YzMw;KOSAKI Motohiro;Linus Torvalds;"oom: make oom_unkillable_task() helper function Presently we have the same task check in two places. Unify it. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/ab290adbaf8f46770f014ea87968de5baca29c30;oom: make oom_unkillable_task() helper function;yes;no;no 273;MDY6Q29tbWl0MjMyNTI5ODphYjI5MGFkYmFmOGY0Njc3MGYwMTRlYTg3OTY4ZGU1YmFjYTI5YzMw;KOSAKI Motohiro;Linus Torvalds;"oom: make oom_unkillable_task() helper function Presently we have the same task check in two places. Unify it. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/ab290adbaf8f46770f014ea87968de5baca29c30;Presently we have the same task check in two places;no;yes;yes 273;MDY6Q29tbWl0MjMyNTI5ODphYjI5MGFkYmFmOGY0Njc3MGYwMTRlYTg3OTY4ZGU1YmFjYTI5YzMw;KOSAKI Motohiro;Linus Torvalds;"oom: make oom_unkillable_task() helper function Presently we have the same task check in two places. Unify it. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/ab290adbaf8f46770f014ea87968de5baca29c30;Unify it.;yes;yes;no 274;MDY6Q29tbWl0MjMyNTI5ODoyYzVlYTUzY2U0NmViYjIzMmUwZDlhNDc1ZmRkMmIxNjZkMmE1MTZi;KOSAKI Motohiro;Linus Torvalds;"oom: oom_kill_process() doesn't select kthread child Presently select_bad_process() has a PF_KTHREAD check, but oom_kill_process doesn't. It mean oom_kill_process() may choose wrong task, especially, when the child are using use_mm(). Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/2c5ea53ce46ebb232e0d9a475fdd2b166d2a516b;oom: oom_kill_process() doesn't select kthread child;yes;no;yes 274;MDY6Q29tbWl0MjMyNTI5ODoyYzVlYTUzY2U0NmViYjIzMmUwZDlhNDc1ZmRkMmIxNjZkMmE1MTZi;KOSAKI Motohiro;Linus Torvalds;"oom: oom_kill_process() doesn't select kthread child Presently select_bad_process() has a PF_KTHREAD check, but oom_kill_process doesn't. It mean oom_kill_process() may choose wrong task, especially, when the child are using use_mm(). Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/2c5ea53ce46ebb232e0d9a475fdd2b166d2a516b;"Presently select_bad_process() has a PF_KTHREAD check, but oom_kill_process doesn't";no;no;yes 274;MDY6Q29tbWl0MjMyNTI5ODoyYzVlYTUzY2U0NmViYjIzMmUwZDlhNDc1ZmRkMmIxNjZkMmE1MTZi;KOSAKI Motohiro;Linus Torvalds;"oom: oom_kill_process() doesn't select kthread child Presently select_bad_process() has a PF_KTHREAD check, but oom_kill_process doesn't. It mean oom_kill_process() may choose wrong task, especially, when the child are using use_mm(). Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/2c5ea53ce46ebb232e0d9a475fdd2b166d2a516b;" It mean oom_kill_process() may choose wrong task, especially, when the child are using use_mm().";no;yes;yes 275;MDY6Q29tbWl0MjMyNTI5ODo3YzU5YWVjODMwYzdlZDZjNzQ1YmQ1MTM5ODJjZWUzNTYzZWQyMGMx;KOSAKI Motohiro;Linus Torvalds;"oom: don't try to kill oom_unkillable child Presently, badness() doesn't care about either CPUSET nor mempolicy. Then if the victim child process have disjoint nodemask, OOM Killer might kill innocent process. This patch fixes it. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/7c59aec830c7ed6c745bd513982cee3563ed20c1;oom: don't try to kill oom_unkillable child;yes;yes;no 275;MDY6Q29tbWl0MjMyNTI5ODo3YzU5YWVjODMwYzdlZDZjNzQ1YmQ1MTM5ODJjZWUzNTYzZWQyMGMx;KOSAKI Motohiro;Linus Torvalds;"oom: don't try to kill oom_unkillable child Presently, badness() doesn't care about either CPUSET nor mempolicy. Then if the victim child process have disjoint nodemask, OOM Killer might kill innocent process. This patch fixes it. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/7c59aec830c7ed6c745bd513982cee3563ed20c1;Presently, badness() doesn't care about either CPUSET nor mempolicy;no;no;yes 275;MDY6Q29tbWl0MjMyNTI5ODo3YzU5YWVjODMwYzdlZDZjNzQ1YmQ1MTM5ODJjZWUzNTYzZWQyMGMx;KOSAKI Motohiro;Linus Torvalds;"oom: don't try to kill oom_unkillable child Presently, badness() doesn't care about either CPUSET nor mempolicy. Then if the victim child process have disjoint nodemask, OOM Killer might kill innocent process. This patch fixes it. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/7c59aec830c7ed6c745bd513982cee3563ed20c1;" Then if the victim child process have disjoint nodemask, OOM Killer might kill innocent process";no;yes;yes 275;MDY6Q29tbWl0MjMyNTI5ODo3YzU5YWVjODMwYzdlZDZjNzQ1YmQ1MTM5ODJjZWUzNTYzZWQyMGMx;KOSAKI Motohiro;Linus Torvalds;"oom: don't try to kill oom_unkillable child Presently, badness() doesn't care about either CPUSET nor mempolicy. Then if the victim child process have disjoint nodemask, OOM Killer might kill innocent process. This patch fixes it. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/7c59aec830c7ed6c745bd513982cee3563ed20c1;This patch fixes it.;yes;yes;no 276;MDY6Q29tbWl0MjMyNTI5ODowYWFkNGIzMTI0ODUwZTg1ZmU1NGU2MTA4MDJmMDkxN2NlNDZhMWFl;David Rientjes;Linus Torvalds;"oom: fold __out_of_memory into out_of_memory __out_of_memory() only has a single caller, so fold it into out_of_memory() and add a comment about locking for its call to oom_kill_process(). Signed-off-by: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/0aad4b3124850e85fe54e610802f0917ce46a1ae;oom: fold __out_of_memory into out_of_memory;yes;no;no 276;MDY6Q29tbWl0MjMyNTI5ODowYWFkNGIzMTI0ODUwZTg1ZmU1NGU2MTA4MDJmMDkxN2NlNDZhMWFl;David Rientjes;Linus Torvalds;"oom: fold __out_of_memory into out_of_memory __out_of_memory() only has a single caller, so fold it into out_of_memory() and add a comment about locking for its call to oom_kill_process(). Signed-off-by: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/0aad4b3124850e85fe54e610802f0917ce46a1ae;"__out_of_memory() only has a single caller, so fold it into out_of_memory() and add a comment about locking for its call to oom_kill_process().";yes;yes;yes 277;MDY6Q29tbWl0MjMyNTI5ODpmNDQyMDAzMjBiMTBjNzYwMDMxMDFkZWUyMWM1Zjk2MWU4MGZhZjBi;David Rientjes;Linus Torvalds;"oom: remove constraint argument from select_bad_process and __out_of_memory select_bad_process() and __out_of_memory() doe not need their enum oom_constraint arguments: it's possible to pass a NULL nodemask if constraint == CONSTRAINT_MEMORY_POLICY in the caller, out_of_memory(). Signed-off-by: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/f44200320b10c76003101dee21c5f961e80faf0b;oom: remove constraint argument from select_bad_process and __out_of_memory;yes;no;no 277;MDY6Q29tbWl0MjMyNTI5ODpmNDQyMDAzMjBiMTBjNzYwMDMxMDFkZWUyMWM1Zjk2MWU4MGZhZjBi;David Rientjes;Linus Torvalds;"oom: remove constraint argument from select_bad_process and __out_of_memory select_bad_process() and __out_of_memory() doe not need their enum oom_constraint arguments: it's possible to pass a NULL nodemask if constraint == CONSTRAINT_MEMORY_POLICY in the caller, out_of_memory(). Signed-off-by: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/f44200320b10c76003101dee21c5f961e80faf0b;"select_bad_process() and __out_of_memory() doe not need their enum oom_constraint arguments: it's possible to pass a NULL nodemask if constraint == CONSTRAINT_MEMORY_POLICY in the caller, out_of_memory().";no;yes;yes 278;MDY6Q29tbWl0MjMyNTI5ODpmZjMyMWZlYWMyMjMxM2NmNTNmZmNlYjY5MjI0YjA5YWMxOWZmMjJi;Minchan Kim;Linus Torvalds;"mm: rename try_set_zone_oom() to try_set_zonelist_oom() We have been used naming try_set_zone_oom and clear_zonelist_oom. The role of functions is to lock of zonelist for preventing parallel OOM. So clear_zonelist_oom makes sense but try_set_zone_oome is rather awkward and unmatched with clear_zonelist_oom. Let's change it with try_set_zonelist_oom. Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/ff321feac22313cf53ffceb69224b09ac19ff22b;mm: rename try_set_zone_oom() to try_set_zonelist_oom();yes;no;no 278;MDY6Q29tbWl0MjMyNTI5ODpmZjMyMWZlYWMyMjMxM2NmNTNmZmNlYjY5MjI0YjA5YWMxOWZmMjJi;Minchan Kim;Linus Torvalds;"mm: rename try_set_zone_oom() to try_set_zonelist_oom() We have been used naming try_set_zone_oom and clear_zonelist_oom. The role of functions is to lock of zonelist for preventing parallel OOM. So clear_zonelist_oom makes sense but try_set_zone_oome is rather awkward and unmatched with clear_zonelist_oom. Let's change it with try_set_zonelist_oom. Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/ff321feac22313cf53ffceb69224b09ac19ff22b;We have been used naming try_set_zone_oom and clear_zonelist_oom;no;yes;yes 278;MDY6Q29tbWl0MjMyNTI5ODpmZjMyMWZlYWMyMjMxM2NmNTNmZmNlYjY5MjI0YjA5YWMxOWZmMjJi;Minchan Kim;Linus Torvalds;"mm: rename try_set_zone_oom() to try_set_zonelist_oom() We have been used naming try_set_zone_oom and clear_zonelist_oom. The role of functions is to lock of zonelist for preventing parallel OOM. So clear_zonelist_oom makes sense but try_set_zone_oome is rather awkward and unmatched with clear_zonelist_oom. Let's change it with try_set_zonelist_oom. Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/ff321feac22313cf53ffceb69224b09ac19ff22b;"The role of functions is to lock of zonelist for preventing parallel OOM";no;no;yes 278;MDY6Q29tbWl0MjMyNTI5ODpmZjMyMWZlYWMyMjMxM2NmNTNmZmNlYjY5MjI0YjA5YWMxOWZmMjJi;Minchan Kim;Linus Torvalds;"mm: rename try_set_zone_oom() to try_set_zonelist_oom() We have been used naming try_set_zone_oom and clear_zonelist_oom. The role of functions is to lock of zonelist for preventing parallel OOM. So clear_zonelist_oom makes sense but try_set_zone_oome is rather awkward and unmatched with clear_zonelist_oom. Let's change it with try_set_zonelist_oom. Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/ff321feac22313cf53ffceb69224b09ac19ff22b;"So clear_zonelist_oom makes sense but try_set_zone_oome is rather awkward and unmatched with clear_zonelist_oom";no;yes;yes 278;MDY6Q29tbWl0MjMyNTI5ODpmZjMyMWZlYWMyMjMxM2NmNTNmZmNlYjY5MjI0YjA5YWMxOWZmMjJi;Minchan Kim;Linus Torvalds;"mm: rename try_set_zone_oom() to try_set_zonelist_oom() We have been used naming try_set_zone_oom and clear_zonelist_oom. The role of functions is to lock of zonelist for preventing parallel OOM. So clear_zonelist_oom makes sense but try_set_zone_oome is rather awkward and unmatched with clear_zonelist_oom. Let's change it with try_set_zonelist_oom. Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/ff321feac22313cf53ffceb69224b09ac19ff22b;Let's change it with try_set_zonelist_oom.;yes;no;no 279;MDY6Q29tbWl0MjMyNTI5ODpiOTQwZmQ3MDM1NzJmN2Y5ZTVmODk0YzY4MmM5MWMzY2JkODRjMTFl;David Rientjes;Linus Torvalds;"oom: remove unnecessary code and cleanup Remove the redundancy in __oom_kill_task() since: - init can never be passed to this function: it will never be PF_EXITING or selectable from select_bad_process(), and - it will never be passed a task from oom_kill_task() without an ->mm and we're unconcerned about detachment from exiting tasks, there's no reason to protect them against SIGKILL or access to memory reserves. Also moves the kernel log message to a higher level since the verbosity is not always emitted here; we need not print an error message if an exiting task is given a longer timeslice. __oom_kill_task() only has a single caller, so it can be merged into that function at the same time. Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/b940fd703572f7f9e5f894c682c91c3cbd84c11e;oom: remove unnecessary code and cleanup;yes;yes;no 279;MDY6Q29tbWl0MjMyNTI5ODpiOTQwZmQ3MDM1NzJmN2Y5ZTVmODk0YzY4MmM5MWMzY2JkODRjMTFl;David Rientjes;Linus Torvalds;"oom: remove unnecessary code and cleanup Remove the redundancy in __oom_kill_task() since: - init can never be passed to this function: it will never be PF_EXITING or selectable from select_bad_process(), and - it will never be passed a task from oom_kill_task() without an ->mm and we're unconcerned about detachment from exiting tasks, there's no reason to protect them against SIGKILL or access to memory reserves. Also moves the kernel log message to a higher level since the verbosity is not always emitted here; we need not print an error message if an exiting task is given a longer timeslice. __oom_kill_task() only has a single caller, so it can be merged into that function at the same time. Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/b940fd703572f7f9e5f894c682c91c3cbd84c11e;Remove the redundancy in __oom_kill_task() since;yes;yes;no 279;MDY6Q29tbWl0MjMyNTI5ODpiOTQwZmQ3MDM1NzJmN2Y5ZTVmODk0YzY4MmM5MWMzY2JkODRjMTFl;David Rientjes;Linus Torvalds;"oom: remove unnecessary code and cleanup Remove the redundancy in __oom_kill_task() since: - init can never be passed to this function: it will never be PF_EXITING or selectable from select_bad_process(), and - it will never be passed a task from oom_kill_task() without an ->mm and we're unconcerned about detachment from exiting tasks, there's no reason to protect them against SIGKILL or access to memory reserves. Also moves the kernel log message to a higher level since the verbosity is not always emitted here; we need not print an error message if an exiting task is given a longer timeslice. __oom_kill_task() only has a single caller, so it can be merged into that function at the same time. Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/b940fd703572f7f9e5f894c682c91c3cbd84c11e;" - init can never be passed to this function: it will never be PF_EXITING or selectable from select_bad_process(), and - it will never be passed a task from oom_kill_task() without an ->mm and we're unconcerned about detachment from exiting tasks, there's no reason to protect them against SIGKILL or access to memory reserves";no;yes;yes 279;MDY6Q29tbWl0MjMyNTI5ODpiOTQwZmQ3MDM1NzJmN2Y5ZTVmODk0YzY4MmM5MWMzY2JkODRjMTFl;David Rientjes;Linus Torvalds;"oom: remove unnecessary code and cleanup Remove the redundancy in __oom_kill_task() since: - init can never be passed to this function: it will never be PF_EXITING or selectable from select_bad_process(), and - it will never be passed a task from oom_kill_task() without an ->mm and we're unconcerned about detachment from exiting tasks, there's no reason to protect them against SIGKILL or access to memory reserves. Also moves the kernel log message to a higher level since the verbosity is not always emitted here; we need not print an error message if an exiting task is given a longer timeslice. __oom_kill_task() only has a single caller, so it can be merged into that function at the same time. Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/b940fd703572f7f9e5f894c682c91c3cbd84c11e;"Also moves the kernel log message to a higher level since the verbosity is not always emitted here; we need not print an error message if an exiting task is given a longer timeslice";yes;yes;yes 279;MDY6Q29tbWl0MjMyNTI5ODpiOTQwZmQ3MDM1NzJmN2Y5ZTVmODk0YzY4MmM5MWMzY2JkODRjMTFl;David Rientjes;Linus Torvalds;"oom: remove unnecessary code and cleanup Remove the redundancy in __oom_kill_task() since: - init can never be passed to this function: it will never be PF_EXITING or selectable from select_bad_process(), and - it will never be passed a task from oom_kill_task() without an ->mm and we're unconcerned about detachment from exiting tasks, there's no reason to protect them against SIGKILL or access to memory reserves. Also moves the kernel log message to a higher level since the verbosity is not always emitted here; we need not print an error message if an exiting task is given a longer timeslice. __oom_kill_task() only has a single caller, so it can be merged into that function at the same time. Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/b940fd703572f7f9e5f894c682c91c3cbd84c11e;"__oom_kill_task() only has a single caller, so it can be merged into that function at the same time.";yes;yes;yes 281;MDY6Q29tbWl0MjMyNTI5ODozMDllZDg4MjUwOGNjNDcxMzIwZmY3OTI2NWU3MzQwNzc0ZDY3NDZj;David Rientjes;Linus Torvalds;"oom: extract panic helper function There are various points in the oom killer where the kernel must determine whether to panic or not. It's better to extract this to a helper function to remove all the confusion as to its semantics. Also fix a call to dump_header() where tasklist_lock is not read- locked, as required. There's no functional change with this patch. Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/309ed882508cc471320ff79265e7340774d6746c;oom: extract panic helper function;yes;no;no 281;MDY6Q29tbWl0MjMyNTI5ODozMDllZDg4MjUwOGNjNDcxMzIwZmY3OTI2NWU3MzQwNzc0ZDY3NDZj;David Rientjes;Linus Torvalds;"oom: extract panic helper function There are various points in the oom killer where the kernel must determine whether to panic or not. It's better to extract this to a helper function to remove all the confusion as to its semantics. Also fix a call to dump_header() where tasklist_lock is not read- locked, as required. There's no functional change with this patch. Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/309ed882508cc471320ff79265e7340774d6746c;"There are various points in the oom killer where the kernel must determine whether to panic or not";no;no;yes 281;MDY6Q29tbWl0MjMyNTI5ODozMDllZDg4MjUwOGNjNDcxMzIwZmY3OTI2NWU3MzQwNzc0ZDY3NDZj;David Rientjes;Linus Torvalds;"oom: extract panic helper function There are various points in the oom killer where the kernel must determine whether to panic or not. It's better to extract this to a helper function to remove all the confusion as to its semantics. Also fix a call to dump_header() where tasklist_lock is not read- locked, as required. There's no functional change with this patch. Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/309ed882508cc471320ff79265e7340774d6746c;" It's better to extract this to a helper function to remove all the confusion as to its semantics";yes;yes;no 281;MDY6Q29tbWl0MjMyNTI5ODozMDllZDg4MjUwOGNjNDcxMzIwZmY3OTI2NWU3MzQwNzc0ZDY3NDZj;David Rientjes;Linus Torvalds;"oom: extract panic helper function There are various points in the oom killer where the kernel must determine whether to panic or not. It's better to extract this to a helper function to remove all the confusion as to its semantics. Also fix a call to dump_header() where tasklist_lock is not read- locked, as required. There's no functional change with this patch. Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/309ed882508cc471320ff79265e7340774d6746c;"Also fix a call to dump_header() where tasklist_lock is not read- locked, as required";yes;yes;yes 281;MDY6Q29tbWl0MjMyNTI5ODozMDllZDg4MjUwOGNjNDcxMzIwZmY3OTI2NWU3MzQwNzc0ZDY3NDZj;David Rientjes;Linus Torvalds;"oom: extract panic helper function There are various points in the oom killer where the kernel must determine whether to panic or not. It's better to extract this to a helper function to remove all the confusion as to its semantics. Also fix a call to dump_header() where tasklist_lock is not read- locked, as required. There's no functional change with this patch. Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/309ed882508cc471320ff79265e7340774d6746c;There's no functional change with this patch.;yes;no;no 282;MDY6Q29tbWl0MjMyNTI5ODphZDkxNWM0MzJlY2NiNDgyNDI3YzFiYmQ3N2M3NGU2ZjdiZmU2MGIz;David Rientjes;Linus Torvalds;"oom: enable oom tasklist dump by default The oom killer tasklist dump, enabled with the oom_dump_tasks sysctl, is very helpful information in diagnosing why a user's task has been killed. It emits useful information such as each eligible thread's memory usage that can determine why the system is oom, so it should be enabled by default. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/ad915c432eccb482427c1bbd77c74e6f7bfe60b3;oom: enable oom tasklist dump by default;yes;no;no 282;MDY6Q29tbWl0MjMyNTI5ODphZDkxNWM0MzJlY2NiNDgyNDI3YzFiYmQ3N2M3NGU2ZjdiZmU2MGIz;David Rientjes;Linus Torvalds;"oom: enable oom tasklist dump by default The oom killer tasklist dump, enabled with the oom_dump_tasks sysctl, is very helpful information in diagnosing why a user's task has been killed. It emits useful information such as each eligible thread's memory usage that can determine why the system is oom, so it should be enabled by default. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/ad915c432eccb482427c1bbd77c74e6f7bfe60b3;"The oom killer tasklist dump, enabled with the oom_dump_tasks sysctl, is very helpful information in diagnosing why a user's task has been killed";no;yes;yes 282;MDY6Q29tbWl0MjMyNTI5ODphZDkxNWM0MzJlY2NiNDgyNDI3YzFiYmQ3N2M3NGU2ZjdiZmU2MGIz;David Rientjes;Linus Torvalds;"oom: enable oom tasklist dump by default The oom killer tasklist dump, enabled with the oom_dump_tasks sysctl, is very helpful information in diagnosing why a user's task has been killed. It emits useful information such as each eligible thread's memory usage that can determine why the system is oom, so it should be enabled by default. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/ad915c432eccb482427c1bbd77c74e6f7bfe60b3;"It emits useful information such as each eligible thread's memory usage that can determine why the system is oom, so it should be enabled by default.";yes;yes;yes 283;MDY6Q29tbWl0MjMyNTI5ODo2ZjQ4ZDBlYmQ5MDdhZTQxOTM4N2YyN2I2MDJlZTk4ODcwY2ZhN2Ji;David Rientjes;Linus Torvalds;"oom: select task from tasklist for mempolicy ooms The oom killer presently kills current whenever there is no more memory free or reclaimable on its mempolicy's nodes. There is no guarantee that current is a memory-hogging task or that killing it will free any substantial amount of memory, however. In such situations, it is better to scan the tasklist for nodes that are allowed to allocate on current's set of nodes and kill the task with the highest badness() score. This ensures that the most memory-hogging task, or the one configured by the user with /proc/pid/oom_adj, is always selected in such scenarios. Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/6f48d0ebd907ae419387f27b602ee98870cfa7bb;oom: select task from tasklist for mempolicy ooms;yes;no;no 283;MDY6Q29tbWl0MjMyNTI5ODo2ZjQ4ZDBlYmQ5MDdhZTQxOTM4N2YyN2I2MDJlZTk4ODcwY2ZhN2Ji;David Rientjes;Linus Torvalds;"oom: select task from tasklist for mempolicy ooms The oom killer presently kills current whenever there is no more memory free or reclaimable on its mempolicy's nodes. There is no guarantee that current is a memory-hogging task or that killing it will free any substantial amount of memory, however. In such situations, it is better to scan the tasklist for nodes that are allowed to allocate on current's set of nodes and kill the task with the highest badness() score. This ensures that the most memory-hogging task, or the one configured by the user with /proc/pid/oom_adj, is always selected in such scenarios. Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/6f48d0ebd907ae419387f27b602ee98870cfa7bb;"The oom killer presently kills current whenever there is no more memory free or reclaimable on its mempolicy's nodes";no;no;yes 283;MDY6Q29tbWl0MjMyNTI5ODo2ZjQ4ZDBlYmQ5MDdhZTQxOTM4N2YyN2I2MDJlZTk4ODcwY2ZhN2Ji;David Rientjes;Linus Torvalds;"oom: select task from tasklist for mempolicy ooms The oom killer presently kills current whenever there is no more memory free or reclaimable on its mempolicy's nodes. There is no guarantee that current is a memory-hogging task or that killing it will free any substantial amount of memory, however. In such situations, it is better to scan the tasklist for nodes that are allowed to allocate on current's set of nodes and kill the task with the highest badness() score. This ensures that the most memory-hogging task, or the one configured by the user with /proc/pid/oom_adj, is always selected in such scenarios. Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/6f48d0ebd907ae419387f27b602ee98870cfa7bb;" There is no guarantee that current is a memory-hogging task or that killing it will free any substantial amount of memory, however";no;yes;yes 283;MDY6Q29tbWl0MjMyNTI5ODo2ZjQ4ZDBlYmQ5MDdhZTQxOTM4N2YyN2I2MDJlZTk4ODcwY2ZhN2Ji;David Rientjes;Linus Torvalds;"oom: select task from tasklist for mempolicy ooms The oom killer presently kills current whenever there is no more memory free or reclaimable on its mempolicy's nodes. There is no guarantee that current is a memory-hogging task or that killing it will free any substantial amount of memory, however. In such situations, it is better to scan the tasklist for nodes that are allowed to allocate on current's set of nodes and kill the task with the highest badness() score. This ensures that the most memory-hogging task, or the one configured by the user with /proc/pid/oom_adj, is always selected in such scenarios. Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/6f48d0ebd907ae419387f27b602ee98870cfa7bb;"In such situations, it is better to scan the tasklist for nodes that are allowed to allocate on current's set of nodes and kill the task with the highest badness() score";yes;yes;yes 283;MDY6Q29tbWl0MjMyNTI5ODo2ZjQ4ZDBlYmQ5MDdhZTQxOTM4N2YyN2I2MDJlZTk4ODcwY2ZhN2Ji;David Rientjes;Linus Torvalds;"oom: select task from tasklist for mempolicy ooms The oom killer presently kills current whenever there is no more memory free or reclaimable on its mempolicy's nodes. There is no guarantee that current is a memory-hogging task or that killing it will free any substantial amount of memory, however. In such situations, it is better to scan the tasklist for nodes that are allowed to allocate on current's set of nodes and kill the task with the highest badness() score. This ensures that the most memory-hogging task, or the one configured by the user with /proc/pid/oom_adj, is always selected in such scenarios. Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/6f48d0ebd907ae419387f27b602ee98870cfa7bb;" This ensures that the most memory-hogging task, or the one configured by the user with /proc/pid/oom_adj, is always selected in such scenarios.";yes;yes;yes 284;MDY6Q29tbWl0MjMyNTI5ODo1ZTlkODM0YTBlMGMwNDg1ZGZhNDg3MjgxYWI5NjUwZmMzN2EzYmI1;David Rientjes;Linus Torvalds;"oom: sacrifice child with highest badness score for parent When a task is chosen for oom kill, the oom killer first attempts to sacrifice a child not sharing its parent's memory instead. Unfortunately, this often kills in a seemingly random fashion based on the ordering of the selected task's child list. Additionally, it is not guaranteed at all to free a large amount of memory that we need to prevent additional oom killing in the very near future. Instead, we now only attempt to sacrifice the worst child not sharing its parent's memory, if one exists. The worst child is indicated with the highest badness() score. This serves two advantages: we kill a memory-hogging task more often, and we allow the configurable /proc/pid/oom_adj value to be considered as a factor in which child to kill. Reviewers may observe that the previous implementation would iterate through the children and attempt to kill each until one was successful and then the parent if none were found while the new code simply kills the most memory-hogging task or the parent. Note that the only time oom_kill_task() fails, however, is when a child does not have an mm or has a /proc/pid/oom_adj of OOM_DISABLE. badness() returns 0 for both cases, so the final oom_kill_task() will always succeed. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Nick Piggin <npiggin@suse.de> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/5e9d834a0e0c0485dfa487281ab9650fc37a3bb5;oom: sacrifice child with highest badness score for parent;yes;no;no 284;MDY6Q29tbWl0MjMyNTI5ODo1ZTlkODM0YTBlMGMwNDg1ZGZhNDg3MjgxYWI5NjUwZmMzN2EzYmI1;David Rientjes;Linus Torvalds;"oom: sacrifice child with highest badness score for parent When a task is chosen for oom kill, the oom killer first attempts to sacrifice a child not sharing its parent's memory instead. Unfortunately, this often kills in a seemingly random fashion based on the ordering of the selected task's child list. Additionally, it is not guaranteed at all to free a large amount of memory that we need to prevent additional oom killing in the very near future. Instead, we now only attempt to sacrifice the worst child not sharing its parent's memory, if one exists. The worst child is indicated with the highest badness() score. This serves two advantages: we kill a memory-hogging task more often, and we allow the configurable /proc/pid/oom_adj value to be considered as a factor in which child to kill. Reviewers may observe that the previous implementation would iterate through the children and attempt to kill each until one was successful and then the parent if none were found while the new code simply kills the most memory-hogging task or the parent. Note that the only time oom_kill_task() fails, however, is when a child does not have an mm or has a /proc/pid/oom_adj of OOM_DISABLE. badness() returns 0 for both cases, so the final oom_kill_task() will always succeed. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Nick Piggin <npiggin@suse.de> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/5e9d834a0e0c0485dfa487281ab9650fc37a3bb5;"When a task is chosen for oom kill, the oom killer first attempts to sacrifice a child not sharing its parent's memory instead";no;no;yes 284;MDY6Q29tbWl0MjMyNTI5ODo1ZTlkODM0YTBlMGMwNDg1ZGZhNDg3MjgxYWI5NjUwZmMzN2EzYmI1;David Rientjes;Linus Torvalds;"oom: sacrifice child with highest badness score for parent When a task is chosen for oom kill, the oom killer first attempts to sacrifice a child not sharing its parent's memory instead. Unfortunately, this often kills in a seemingly random fashion based on the ordering of the selected task's child list. Additionally, it is not guaranteed at all to free a large amount of memory that we need to prevent additional oom killing in the very near future. Instead, we now only attempt to sacrifice the worst child not sharing its parent's memory, if one exists. The worst child is indicated with the highest badness() score. This serves two advantages: we kill a memory-hogging task more often, and we allow the configurable /proc/pid/oom_adj value to be considered as a factor in which child to kill. Reviewers may observe that the previous implementation would iterate through the children and attempt to kill each until one was successful and then the parent if none were found while the new code simply kills the most memory-hogging task or the parent. Note that the only time oom_kill_task() fails, however, is when a child does not have an mm or has a /proc/pid/oom_adj of OOM_DISABLE. badness() returns 0 for both cases, so the final oom_kill_task() will always succeed. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Nick Piggin <npiggin@suse.de> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/5e9d834a0e0c0485dfa487281ab9650fc37a3bb5;" Unfortunately, this often kills in a seemingly random fashion based on the ordering of the selected task's child list";no;yes;yes 284;MDY6Q29tbWl0MjMyNTI5ODo1ZTlkODM0YTBlMGMwNDg1ZGZhNDg3MjgxYWI5NjUwZmMzN2EzYmI1;David Rientjes;Linus Torvalds;"oom: sacrifice child with highest badness score for parent When a task is chosen for oom kill, the oom killer first attempts to sacrifice a child not sharing its parent's memory instead. Unfortunately, this often kills in a seemingly random fashion based on the ordering of the selected task's child list. Additionally, it is not guaranteed at all to free a large amount of memory that we need to prevent additional oom killing in the very near future. Instead, we now only attempt to sacrifice the worst child not sharing its parent's memory, if one exists. The worst child is indicated with the highest badness() score. This serves two advantages: we kill a memory-hogging task more often, and we allow the configurable /proc/pid/oom_adj value to be considered as a factor in which child to kill. Reviewers may observe that the previous implementation would iterate through the children and attempt to kill each until one was successful and then the parent if none were found while the new code simply kills the most memory-hogging task or the parent. Note that the only time oom_kill_task() fails, however, is when a child does not have an mm or has a /proc/pid/oom_adj of OOM_DISABLE. badness() returns 0 for both cases, so the final oom_kill_task() will always succeed. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Nick Piggin <npiggin@suse.de> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/5e9d834a0e0c0485dfa487281ab9650fc37a3bb5;" Additionally, it is not guaranteed at all to free a large amount of memory that we need to prevent additional oom killing in the very near future";no;yes;yes 284;MDY6Q29tbWl0MjMyNTI5ODo1ZTlkODM0YTBlMGMwNDg1ZGZhNDg3MjgxYWI5NjUwZmMzN2EzYmI1;David Rientjes;Linus Torvalds;"oom: sacrifice child with highest badness score for parent When a task is chosen for oom kill, the oom killer first attempts to sacrifice a child not sharing its parent's memory instead. Unfortunately, this often kills in a seemingly random fashion based on the ordering of the selected task's child list. Additionally, it is not guaranteed at all to free a large amount of memory that we need to prevent additional oom killing in the very near future. Instead, we now only attempt to sacrifice the worst child not sharing its parent's memory, if one exists. The worst child is indicated with the highest badness() score. This serves two advantages: we kill a memory-hogging task more often, and we allow the configurable /proc/pid/oom_adj value to be considered as a factor in which child to kill. Reviewers may observe that the previous implementation would iterate through the children and attempt to kill each until one was successful and then the parent if none were found while the new code simply kills the most memory-hogging task or the parent. Note that the only time oom_kill_task() fails, however, is when a child does not have an mm or has a /proc/pid/oom_adj of OOM_DISABLE. badness() returns 0 for both cases, so the final oom_kill_task() will always succeed. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Nick Piggin <npiggin@suse.de> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/5e9d834a0e0c0485dfa487281ab9650fc37a3bb5;"Instead, we now only attempt to sacrifice the worst child not sharing its parent's memory, if one exists";yes;no;no 284;MDY6Q29tbWl0MjMyNTI5ODo1ZTlkODM0YTBlMGMwNDg1ZGZhNDg3MjgxYWI5NjUwZmMzN2EzYmI1;David Rientjes;Linus Torvalds;"oom: sacrifice child with highest badness score for parent When a task is chosen for oom kill, the oom killer first attempts to sacrifice a child not sharing its parent's memory instead. Unfortunately, this often kills in a seemingly random fashion based on the ordering of the selected task's child list. Additionally, it is not guaranteed at all to free a large amount of memory that we need to prevent additional oom killing in the very near future. Instead, we now only attempt to sacrifice the worst child not sharing its parent's memory, if one exists. The worst child is indicated with the highest badness() score. This serves two advantages: we kill a memory-hogging task more often, and we allow the configurable /proc/pid/oom_adj value to be considered as a factor in which child to kill. Reviewers may observe that the previous implementation would iterate through the children and attempt to kill each until one was successful and then the parent if none were found while the new code simply kills the most memory-hogging task or the parent. Note that the only time oom_kill_task() fails, however, is when a child does not have an mm or has a /proc/pid/oom_adj of OOM_DISABLE. badness() returns 0 for both cases, so the final oom_kill_task() will always succeed. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Nick Piggin <npiggin@suse.de> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/5e9d834a0e0c0485dfa487281ab9650fc37a3bb5;" The worst child is indicated with the highest badness() score";yes;no;yes 284;MDY6Q29tbWl0MjMyNTI5ODo1ZTlkODM0YTBlMGMwNDg1ZGZhNDg3MjgxYWI5NjUwZmMzN2EzYmI1;David Rientjes;Linus Torvalds;"oom: sacrifice child with highest badness score for parent When a task is chosen for oom kill, the oom killer first attempts to sacrifice a child not sharing its parent's memory instead. Unfortunately, this often kills in a seemingly random fashion based on the ordering of the selected task's child list. Additionally, it is not guaranteed at all to free a large amount of memory that we need to prevent additional oom killing in the very near future. Instead, we now only attempt to sacrifice the worst child not sharing its parent's memory, if one exists. The worst child is indicated with the highest badness() score. This serves two advantages: we kill a memory-hogging task more often, and we allow the configurable /proc/pid/oom_adj value to be considered as a factor in which child to kill. Reviewers may observe that the previous implementation would iterate through the children and attempt to kill each until one was successful and then the parent if none were found while the new code simply kills the most memory-hogging task or the parent. Note that the only time oom_kill_task() fails, however, is when a child does not have an mm or has a /proc/pid/oom_adj of OOM_DISABLE. badness() returns 0 for both cases, so the final oom_kill_task() will always succeed. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Nick Piggin <npiggin@suse.de> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/5e9d834a0e0c0485dfa487281ab9650fc37a3bb5;" This serves two advantages: we kill a memory-hogging task more often, and we allow the configurable /proc/pid/oom_adj value to be considered as a factor in which child to kill";yes;yes;no 284;MDY6Q29tbWl0MjMyNTI5ODo1ZTlkODM0YTBlMGMwNDg1ZGZhNDg3MjgxYWI5NjUwZmMzN2EzYmI1;David Rientjes;Linus Torvalds;"oom: sacrifice child with highest badness score for parent When a task is chosen for oom kill, the oom killer first attempts to sacrifice a child not sharing its parent's memory instead. Unfortunately, this often kills in a seemingly random fashion based on the ordering of the selected task's child list. Additionally, it is not guaranteed at all to free a large amount of memory that we need to prevent additional oom killing in the very near future. Instead, we now only attempt to sacrifice the worst child not sharing its parent's memory, if one exists. The worst child is indicated with the highest badness() score. This serves two advantages: we kill a memory-hogging task more often, and we allow the configurable /proc/pid/oom_adj value to be considered as a factor in which child to kill. Reviewers may observe that the previous implementation would iterate through the children and attempt to kill each until one was successful and then the parent if none were found while the new code simply kills the most memory-hogging task or the parent. Note that the only time oom_kill_task() fails, however, is when a child does not have an mm or has a /proc/pid/oom_adj of OOM_DISABLE. badness() returns 0 for both cases, so the final oom_kill_task() will always succeed. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Nick Piggin <npiggin@suse.de> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/5e9d834a0e0c0485dfa487281ab9650fc37a3bb5;"Reviewers may observe that the previous implementation would iterate through the children and attempt to kill each until one was successful and then the parent if none were found while the new code simply kills the most memory-hogging task or the parent";yes;no;yes 284;MDY6Q29tbWl0MjMyNTI5ODo1ZTlkODM0YTBlMGMwNDg1ZGZhNDg3MjgxYWI5NjUwZmMzN2EzYmI1;David Rientjes;Linus Torvalds;"oom: sacrifice child with highest badness score for parent When a task is chosen for oom kill, the oom killer first attempts to sacrifice a child not sharing its parent's memory instead. Unfortunately, this often kills in a seemingly random fashion based on the ordering of the selected task's child list. Additionally, it is not guaranteed at all to free a large amount of memory that we need to prevent additional oom killing in the very near future. Instead, we now only attempt to sacrifice the worst child not sharing its parent's memory, if one exists. The worst child is indicated with the highest badness() score. This serves two advantages: we kill a memory-hogging task more often, and we allow the configurable /proc/pid/oom_adj value to be considered as a factor in which child to kill. Reviewers may observe that the previous implementation would iterate through the children and attempt to kill each until one was successful and then the parent if none were found while the new code simply kills the most memory-hogging task or the parent. Note that the only time oom_kill_task() fails, however, is when a child does not have an mm or has a /proc/pid/oom_adj of OOM_DISABLE. badness() returns 0 for both cases, so the final oom_kill_task() will always succeed. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Nick Piggin <npiggin@suse.de> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/5e9d834a0e0c0485dfa487281ab9650fc37a3bb5;" Note that the only time oom_kill_task() fails, however, is when a child does not have an mm or has a /proc/pid/oom_adj of OOM_DISABLE";yes;no;yes 284;MDY6Q29tbWl0MjMyNTI5ODo1ZTlkODM0YTBlMGMwNDg1ZGZhNDg3MjgxYWI5NjUwZmMzN2EzYmI1;David Rientjes;Linus Torvalds;"oom: sacrifice child with highest badness score for parent When a task is chosen for oom kill, the oom killer first attempts to sacrifice a child not sharing its parent's memory instead. Unfortunately, this often kills in a seemingly random fashion based on the ordering of the selected task's child list. Additionally, it is not guaranteed at all to free a large amount of memory that we need to prevent additional oom killing in the very near future. Instead, we now only attempt to sacrifice the worst child not sharing its parent's memory, if one exists. The worst child is indicated with the highest badness() score. This serves two advantages: we kill a memory-hogging task more often, and we allow the configurable /proc/pid/oom_adj value to be considered as a factor in which child to kill. Reviewers may observe that the previous implementation would iterate through the children and attempt to kill each until one was successful and then the parent if none were found while the new code simply kills the most memory-hogging task or the parent. Note that the only time oom_kill_task() fails, however, is when a child does not have an mm or has a /proc/pid/oom_adj of OOM_DISABLE. badness() returns 0 for both cases, so the final oom_kill_task() will always succeed. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Nick Piggin <npiggin@suse.de> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/5e9d834a0e0c0485dfa487281ab9650fc37a3bb5;" badness() returns 0 for both cases, so the final oom_kill_task() will always succeed.";yes;no;yes 285;MDY6Q29tbWl0MjMyNTI5ODo2Y2Y4NmFjNmYzNmI2Mzg0NTlhOWE2YzI1NzZkNWU2NTVkNDFkNDUx;David Rientjes;Linus Torvalds;"oom: filter tasks not sharing the same cpuset Tasks that do not share the same set of allowed nodes with the task that triggered the oom should not be considered as candidates for oom kill. Tasks in other cpusets with a disjoint set of mems would be unfairly penalized otherwise because of oom conditions elsewhere; an extreme example could unfairly kill all other applications on the system if a single task in a user's cpuset sets itself to OOM_DISABLE and then uses more memory than allowed. Killing tasks outside of current's cpuset rarely would free memory for current anyway. To use a sane heuristic, we must ensure that killing a task would likely free memory for current and avoid needlessly killing others at all costs just because their potential memory freeing is unknown. It is better to kill current than another task needlessly. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Nick Piggin <npiggin@suse.de> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/6cf86ac6f36b638459a9a6c2576d5e655d41d451;oom: filter tasks not sharing the same cpuset;yes;no;no 285;MDY6Q29tbWl0MjMyNTI5ODo2Y2Y4NmFjNmYzNmI2Mzg0NTlhOWE2YzI1NzZkNWU2NTVkNDFkNDUx;David Rientjes;Linus Torvalds;"oom: filter tasks not sharing the same cpuset Tasks that do not share the same set of allowed nodes with the task that triggered the oom should not be considered as candidates for oom kill. Tasks in other cpusets with a disjoint set of mems would be unfairly penalized otherwise because of oom conditions elsewhere; an extreme example could unfairly kill all other applications on the system if a single task in a user's cpuset sets itself to OOM_DISABLE and then uses more memory than allowed. Killing tasks outside of current's cpuset rarely would free memory for current anyway. To use a sane heuristic, we must ensure that killing a task would likely free memory for current and avoid needlessly killing others at all costs just because their potential memory freeing is unknown. It is better to kill current than another task needlessly. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Nick Piggin <npiggin@suse.de> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/6cf86ac6f36b638459a9a6c2576d5e655d41d451;"Tasks that do not share the same set of allowed nodes with the task that triggered the oom should not be considered as candidates for oom kill";yes;no;yes 285;MDY6Q29tbWl0MjMyNTI5ODo2Y2Y4NmFjNmYzNmI2Mzg0NTlhOWE2YzI1NzZkNWU2NTVkNDFkNDUx;David Rientjes;Linus Torvalds;"oom: filter tasks not sharing the same cpuset Tasks that do not share the same set of allowed nodes with the task that triggered the oom should not be considered as candidates for oom kill. Tasks in other cpusets with a disjoint set of mems would be unfairly penalized otherwise because of oom conditions elsewhere; an extreme example could unfairly kill all other applications on the system if a single task in a user's cpuset sets itself to OOM_DISABLE and then uses more memory than allowed. Killing tasks outside of current's cpuset rarely would free memory for current anyway. To use a sane heuristic, we must ensure that killing a task would likely free memory for current and avoid needlessly killing others at all costs just because their potential memory freeing is unknown. It is better to kill current than another task needlessly. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Nick Piggin <npiggin@suse.de> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/6cf86ac6f36b638459a9a6c2576d5e655d41d451;"Tasks in other cpusets with a disjoint set of mems would be unfairly penalized otherwise because of oom conditions elsewhere; an extreme example could unfairly kill all other applications on the system if a single task in a user's cpuset sets itself to OOM_DISABLE and then uses more memory than allowed";no;yes;yes 285;MDY6Q29tbWl0MjMyNTI5ODo2Y2Y4NmFjNmYzNmI2Mzg0NTlhOWE2YzI1NzZkNWU2NTVkNDFkNDUx;David Rientjes;Linus Torvalds;"oom: filter tasks not sharing the same cpuset Tasks that do not share the same set of allowed nodes with the task that triggered the oom should not be considered as candidates for oom kill. Tasks in other cpusets with a disjoint set of mems would be unfairly penalized otherwise because of oom conditions elsewhere; an extreme example could unfairly kill all other applications on the system if a single task in a user's cpuset sets itself to OOM_DISABLE and then uses more memory than allowed. Killing tasks outside of current's cpuset rarely would free memory for current anyway. To use a sane heuristic, we must ensure that killing a task would likely free memory for current and avoid needlessly killing others at all costs just because their potential memory freeing is unknown. It is better to kill current than another task needlessly. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Nick Piggin <npiggin@suse.de> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/6cf86ac6f36b638459a9a6c2576d5e655d41d451;"Killing tasks outside of current's cpuset rarely would free memory for current anyway";no;yes;yes 285;MDY6Q29tbWl0MjMyNTI5ODo2Y2Y4NmFjNmYzNmI2Mzg0NTlhOWE2YzI1NzZkNWU2NTVkNDFkNDUx;David Rientjes;Linus Torvalds;"oom: filter tasks not sharing the same cpuset Tasks that do not share the same set of allowed nodes with the task that triggered the oom should not be considered as candidates for oom kill. Tasks in other cpusets with a disjoint set of mems would be unfairly penalized otherwise because of oom conditions elsewhere; an extreme example could unfairly kill all other applications on the system if a single task in a user's cpuset sets itself to OOM_DISABLE and then uses more memory than allowed. Killing tasks outside of current's cpuset rarely would free memory for current anyway. To use a sane heuristic, we must ensure that killing a task would likely free memory for current and avoid needlessly killing others at all costs just because their potential memory freeing is unknown. It is better to kill current than another task needlessly. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Nick Piggin <npiggin@suse.de> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/6cf86ac6f36b638459a9a6c2576d5e655d41d451;" To use a sane heuristic, we must ensure that killing a task would likely free memory for current and avoid needlessly killing others at all costs just because their potential memory freeing is unknown";yes;yes;no 285;MDY6Q29tbWl0MjMyNTI5ODo2Y2Y4NmFjNmYzNmI2Mzg0NTlhOWE2YzI1NzZkNWU2NTVkNDFkNDUx;David Rientjes;Linus Torvalds;"oom: filter tasks not sharing the same cpuset Tasks that do not share the same set of allowed nodes with the task that triggered the oom should not be considered as candidates for oom kill. Tasks in other cpusets with a disjoint set of mems would be unfairly penalized otherwise because of oom conditions elsewhere; an extreme example could unfairly kill all other applications on the system if a single task in a user's cpuset sets itself to OOM_DISABLE and then uses more memory than allowed. Killing tasks outside of current's cpuset rarely would free memory for current anyway. To use a sane heuristic, we must ensure that killing a task would likely free memory for current and avoid needlessly killing others at all costs just because their potential memory freeing is unknown. It is better to kill current than another task needlessly. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Nick Piggin <npiggin@suse.de> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/6cf86ac6f36b638459a9a6c2576d5e655d41d451; It is better to kill current than another task needlessly.;yes;yes;no 286;MDY6Q29tbWl0MjMyNTI5ODo0MzU4OTk3YWUzOGExOTAxNDk4ZDEyOGQ2NTA4MTE5ZDlmMzE4YjM2;David Rientjes;Linus Torvalds;"oom: avoid sending exiting tasks a SIGKILL It's unnecessary to SIGKILL a task that is already PF_EXITING and can actually cause a NULL pointer dereference of the sighand if it has already been detached. Instead, simply set TIF_MEMDIE so it has access to memory reserves and can quickly exit as the comment implies. Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/4358997ae38a1901498d128d6508119d9f318b36;oom: avoid sending exiting tasks a SIGKILL;yes;no;no 286;MDY6Q29tbWl0MjMyNTI5ODo0MzU4OTk3YWUzOGExOTAxNDk4ZDEyOGQ2NTA4MTE5ZDlmMzE4YjM2;David Rientjes;Linus Torvalds;"oom: avoid sending exiting tasks a SIGKILL It's unnecessary to SIGKILL a task that is already PF_EXITING and can actually cause a NULL pointer dereference of the sighand if it has already been detached. Instead, simply set TIF_MEMDIE so it has access to memory reserves and can quickly exit as the comment implies. Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/4358997ae38a1901498d128d6508119d9f318b36;"It's unnecessary to SIGKILL a task that is already PF_EXITING and can actually cause a NULL pointer dereference of the sighand if it has already been detached";no;yes;yes 286;MDY6Q29tbWl0MjMyNTI5ODo0MzU4OTk3YWUzOGExOTAxNDk4ZDEyOGQ2NTA4MTE5ZDlmMzE4YjM2;David Rientjes;Linus Torvalds;"oom: avoid sending exiting tasks a SIGKILL It's unnecessary to SIGKILL a task that is already PF_EXITING and can actually cause a NULL pointer dereference of the sighand if it has already been detached. Instead, simply set TIF_MEMDIE so it has access to memory reserves and can quickly exit as the comment implies. Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/4358997ae38a1901498d128d6508119d9f318b36;" Instead, simply set TIF_MEMDIE so it has access to memory reserves and can quickly exit as the comment implies.";yes;yes;no 287;MDY6Q29tbWl0MjMyNTI5ODo3Yjk4YzJlNDAyZWFhMWYyYmVlYzE4YjFiZGUxN2Y3NDk0OGExOWRi;David Rientjes;Linus Torvalds;"oom: give current access to memory reserves if it has been killed It's possible to livelock the page allocator if a thread has mm->mmap_sem and fails to make forward progress because the oom killer selects another thread sharing the same ->mm to kill that cannot exit until the semaphore is dropped. The oom killer will not kill multiple tasks at the same time; each oom killed task must exit before another task may be killed. Thus, if one thread is holding mm->mmap_sem and cannot allocate memory, all threads sharing the same ->mm are blocked from exiting as well. In the oom kill case, that means the thread holding mm->mmap_sem will never free additional memory since it cannot get access to memory reserves and the thread that depends on it with access to memory reserves cannot exit because it cannot acquire the semaphore. Thus, the page allocators livelocks. When the oom killer is called and current happens to have a pending SIGKILL, this patch automatically gives it access to memory reserves and returns. Upon returning to the page allocator, its allocation will hopefully succeed so it can quickly exit and free its memory. If not, the page allocator will fail the allocation if it is not __GFP_NOFAIL. Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/7b98c2e402eaa1f2beec18b1bde17f74948a19db;oom: give current access to memory reserves if it has been killed;yes;no;no 287;MDY6Q29tbWl0MjMyNTI5ODo3Yjk4YzJlNDAyZWFhMWYyYmVlYzE4YjFiZGUxN2Y3NDk0OGExOWRi;David Rientjes;Linus Torvalds;"oom: give current access to memory reserves if it has been killed It's possible to livelock the page allocator if a thread has mm->mmap_sem and fails to make forward progress because the oom killer selects another thread sharing the same ->mm to kill that cannot exit until the semaphore is dropped. The oom killer will not kill multiple tasks at the same time; each oom killed task must exit before another task may be killed. Thus, if one thread is holding mm->mmap_sem and cannot allocate memory, all threads sharing the same ->mm are blocked from exiting as well. In the oom kill case, that means the thread holding mm->mmap_sem will never free additional memory since it cannot get access to memory reserves and the thread that depends on it with access to memory reserves cannot exit because it cannot acquire the semaphore. Thus, the page allocators livelocks. When the oom killer is called and current happens to have a pending SIGKILL, this patch automatically gives it access to memory reserves and returns. Upon returning to the page allocator, its allocation will hopefully succeed so it can quickly exit and free its memory. If not, the page allocator will fail the allocation if it is not __GFP_NOFAIL. Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/7b98c2e402eaa1f2beec18b1bde17f74948a19db;"It's possible to livelock the page allocator if a thread has mm->mmap_sem and fails to make forward progress because the oom killer selects another thread sharing the same ->mm to kill that cannot exit until the semaphore is dropped";no;no;yes 287;MDY6Q29tbWl0MjMyNTI5ODo3Yjk4YzJlNDAyZWFhMWYyYmVlYzE4YjFiZGUxN2Y3NDk0OGExOWRi;David Rientjes;Linus Torvalds;"oom: give current access to memory reserves if it has been killed It's possible to livelock the page allocator if a thread has mm->mmap_sem and fails to make forward progress because the oom killer selects another thread sharing the same ->mm to kill that cannot exit until the semaphore is dropped. The oom killer will not kill multiple tasks at the same time; each oom killed task must exit before another task may be killed. Thus, if one thread is holding mm->mmap_sem and cannot allocate memory, all threads sharing the same ->mm are blocked from exiting as well. In the oom kill case, that means the thread holding mm->mmap_sem will never free additional memory since it cannot get access to memory reserves and the thread that depends on it with access to memory reserves cannot exit because it cannot acquire the semaphore. Thus, the page allocators livelocks. When the oom killer is called and current happens to have a pending SIGKILL, this patch automatically gives it access to memory reserves and returns. Upon returning to the page allocator, its allocation will hopefully succeed so it can quickly exit and free its memory. If not, the page allocator will fail the allocation if it is not __GFP_NOFAIL. Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/7b98c2e402eaa1f2beec18b1bde17f74948a19db;"The oom killer will not kill multiple tasks at the same time; each oom killed task must exit before another task may be killed";no;no;yes 287;MDY6Q29tbWl0MjMyNTI5ODo3Yjk4YzJlNDAyZWFhMWYyYmVlYzE4YjFiZGUxN2Y3NDk0OGExOWRi;David Rientjes;Linus Torvalds;"oom: give current access to memory reserves if it has been killed It's possible to livelock the page allocator if a thread has mm->mmap_sem and fails to make forward progress because the oom killer selects another thread sharing the same ->mm to kill that cannot exit until the semaphore is dropped. The oom killer will not kill multiple tasks at the same time; each oom killed task must exit before another task may be killed. Thus, if one thread is holding mm->mmap_sem and cannot allocate memory, all threads sharing the same ->mm are blocked from exiting as well. In the oom kill case, that means the thread holding mm->mmap_sem will never free additional memory since it cannot get access to memory reserves and the thread that depends on it with access to memory reserves cannot exit because it cannot acquire the semaphore. Thus, the page allocators livelocks. When the oom killer is called and current happens to have a pending SIGKILL, this patch automatically gives it access to memory reserves and returns. Upon returning to the page allocator, its allocation will hopefully succeed so it can quickly exit and free its memory. If not, the page allocator will fail the allocation if it is not __GFP_NOFAIL. Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/7b98c2e402eaa1f2beec18b1bde17f74948a19db;" Thus, if one thread is holding mm->mmap_sem and cannot allocate memory, all threads sharing the same ->mm are blocked from exiting as well";no;yes;yes 287;MDY6Q29tbWl0MjMyNTI5ODo3Yjk4YzJlNDAyZWFhMWYyYmVlYzE4YjFiZGUxN2Y3NDk0OGExOWRi;David Rientjes;Linus Torvalds;"oom: give current access to memory reserves if it has been killed It's possible to livelock the page allocator if a thread has mm->mmap_sem and fails to make forward progress because the oom killer selects another thread sharing the same ->mm to kill that cannot exit until the semaphore is dropped. The oom killer will not kill multiple tasks at the same time; each oom killed task must exit before another task may be killed. Thus, if one thread is holding mm->mmap_sem and cannot allocate memory, all threads sharing the same ->mm are blocked from exiting as well. In the oom kill case, that means the thread holding mm->mmap_sem will never free additional memory since it cannot get access to memory reserves and the thread that depends on it with access to memory reserves cannot exit because it cannot acquire the semaphore. Thus, the page allocators livelocks. When the oom killer is called and current happens to have a pending SIGKILL, this patch automatically gives it access to memory reserves and returns. Upon returning to the page allocator, its allocation will hopefully succeed so it can quickly exit and free its memory. If not, the page allocator will fail the allocation if it is not __GFP_NOFAIL. Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/7b98c2e402eaa1f2beec18b1bde17f74948a19db;" In the oom kill case, that means the thread holding mm->mmap_sem will never free additional memory since it cannot get access to memory reserves and the thread that depends on it with access to memory reserves cannot exit because it cannot acquire the semaphore";no;yes;yes 287;MDY6Q29tbWl0MjMyNTI5ODo3Yjk4YzJlNDAyZWFhMWYyYmVlYzE4YjFiZGUxN2Y3NDk0OGExOWRi;David Rientjes;Linus Torvalds;"oom: give current access to memory reserves if it has been killed It's possible to livelock the page allocator if a thread has mm->mmap_sem and fails to make forward progress because the oom killer selects another thread sharing the same ->mm to kill that cannot exit until the semaphore is dropped. The oom killer will not kill multiple tasks at the same time; each oom killed task must exit before another task may be killed. Thus, if one thread is holding mm->mmap_sem and cannot allocate memory, all threads sharing the same ->mm are blocked from exiting as well. In the oom kill case, that means the thread holding mm->mmap_sem will never free additional memory since it cannot get access to memory reserves and the thread that depends on it with access to memory reserves cannot exit because it cannot acquire the semaphore. Thus, the page allocators livelocks. When the oom killer is called and current happens to have a pending SIGKILL, this patch automatically gives it access to memory reserves and returns. Upon returning to the page allocator, its allocation will hopefully succeed so it can quickly exit and free its memory. If not, the page allocator will fail the allocation if it is not __GFP_NOFAIL. Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/7b98c2e402eaa1f2beec18b1bde17f74948a19db;" Thus, the page allocators livelocks";no;yes;yes 287;MDY6Q29tbWl0MjMyNTI5ODo3Yjk4YzJlNDAyZWFhMWYyYmVlYzE4YjFiZGUxN2Y3NDk0OGExOWRi;David Rientjes;Linus Torvalds;"oom: give current access to memory reserves if it has been killed It's possible to livelock the page allocator if a thread has mm->mmap_sem and fails to make forward progress because the oom killer selects another thread sharing the same ->mm to kill that cannot exit until the semaphore is dropped. The oom killer will not kill multiple tasks at the same time; each oom killed task must exit before another task may be killed. Thus, if one thread is holding mm->mmap_sem and cannot allocate memory, all threads sharing the same ->mm are blocked from exiting as well. In the oom kill case, that means the thread holding mm->mmap_sem will never free additional memory since it cannot get access to memory reserves and the thread that depends on it with access to memory reserves cannot exit because it cannot acquire the semaphore. Thus, the page allocators livelocks. When the oom killer is called and current happens to have a pending SIGKILL, this patch automatically gives it access to memory reserves and returns. Upon returning to the page allocator, its allocation will hopefully succeed so it can quickly exit and free its memory. If not, the page allocator will fail the allocation if it is not __GFP_NOFAIL. Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/7b98c2e402eaa1f2beec18b1bde17f74948a19db;"When the oom killer is called and current happens to have a pending SIGKILL, this patch automatically gives it access to memory reserves and returns";yes;no;no 287;MDY6Q29tbWl0MjMyNTI5ODo3Yjk4YzJlNDAyZWFhMWYyYmVlYzE4YjFiZGUxN2Y3NDk0OGExOWRi;David Rientjes;Linus Torvalds;"oom: give current access to memory reserves if it has been killed It's possible to livelock the page allocator if a thread has mm->mmap_sem and fails to make forward progress because the oom killer selects another thread sharing the same ->mm to kill that cannot exit until the semaphore is dropped. The oom killer will not kill multiple tasks at the same time; each oom killed task must exit before another task may be killed. Thus, if one thread is holding mm->mmap_sem and cannot allocate memory, all threads sharing the same ->mm are blocked from exiting as well. In the oom kill case, that means the thread holding mm->mmap_sem will never free additional memory since it cannot get access to memory reserves and the thread that depends on it with access to memory reserves cannot exit because it cannot acquire the semaphore. Thus, the page allocators livelocks. When the oom killer is called and current happens to have a pending SIGKILL, this patch automatically gives it access to memory reserves and returns. Upon returning to the page allocator, its allocation will hopefully succeed so it can quickly exit and free its memory. If not, the page allocator will fail the allocation if it is not __GFP_NOFAIL. Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/7b98c2e402eaa1f2beec18b1bde17f74948a19db;" Upon returning to the page allocator, its allocation will hopefully succeed so it can quickly exit and free its memory";yes;yes;no 287;MDY6Q29tbWl0MjMyNTI5ODo3Yjk4YzJlNDAyZWFhMWYyYmVlYzE4YjFiZGUxN2Y3NDk0OGExOWRi;David Rientjes;Linus Torvalds;"oom: give current access to memory reserves if it has been killed It's possible to livelock the page allocator if a thread has mm->mmap_sem and fails to make forward progress because the oom killer selects another thread sharing the same ->mm to kill that cannot exit until the semaphore is dropped. The oom killer will not kill multiple tasks at the same time; each oom killed task must exit before another task may be killed. Thus, if one thread is holding mm->mmap_sem and cannot allocate memory, all threads sharing the same ->mm are blocked from exiting as well. In the oom kill case, that means the thread holding mm->mmap_sem will never free additional memory since it cannot get access to memory reserves and the thread that depends on it with access to memory reserves cannot exit because it cannot acquire the semaphore. Thus, the page allocators livelocks. When the oom killer is called and current happens to have a pending SIGKILL, this patch automatically gives it access to memory reserves and returns. Upon returning to the page allocator, its allocation will hopefully succeed so it can quickly exit and free its memory. If not, the page allocator will fail the allocation if it is not __GFP_NOFAIL. Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/7b98c2e402eaa1f2beec18b1bde17f74948a19db;" If not, the page allocator will fail the allocation if it is not __GFP_NOFAIL.";yes;no;no 288;MDY6Q29tbWl0MjMyNTI5ODpjODFmYWM1Y2I4YzkyYjhiNDc5NWFjMjUwYTQ2Yzc1MTRkMWZjZTA2;David Rientjes;Linus Torvalds;"oom: dump_tasks use find_lock_task_mm too fix When find_lock_task_mm() returns a thread other than p in dump_tasks(), its name should be displayed instead. This is the thread that will be targeted by the oom killer, not its mm-less parent. This also allows us to safely dereference task->comm without needing get_task_comm(). While we're here, remove the cast on task_cpu(task) as Andrew suggested. Signed-off-by: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/c81fac5cb8c92b8b4795ac250a46c7514d1fce06;oom: dump_tasks use find_lock_task_mm too fix;yes;yes;no 288;MDY6Q29tbWl0MjMyNTI5ODpjODFmYWM1Y2I4YzkyYjhiNDc5NWFjMjUwYTQ2Yzc1MTRkMWZjZTA2;David Rientjes;Linus Torvalds;"oom: dump_tasks use find_lock_task_mm too fix When find_lock_task_mm() returns a thread other than p in dump_tasks(), its name should be displayed instead. This is the thread that will be targeted by the oom killer, not its mm-less parent. This also allows us to safely dereference task->comm without needing get_task_comm(). While we're here, remove the cast on task_cpu(task) as Andrew suggested. Signed-off-by: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/c81fac5cb8c92b8b4795ac250a46c7514d1fce06;"When find_lock_task_mm() returns a thread other than p in dump_tasks(), its name should be displayed instead";yes;yes;yes 288;MDY6Q29tbWl0MjMyNTI5ODpjODFmYWM1Y2I4YzkyYjhiNDc5NWFjMjUwYTQ2Yzc1MTRkMWZjZTA2;David Rientjes;Linus Torvalds;"oom: dump_tasks use find_lock_task_mm too fix When find_lock_task_mm() returns a thread other than p in dump_tasks(), its name should be displayed instead. This is the thread that will be targeted by the oom killer, not its mm-less parent. This also allows us to safely dereference task->comm without needing get_task_comm(). While we're here, remove the cast on task_cpu(task) as Andrew suggested. Signed-off-by: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/c81fac5cb8c92b8b4795ac250a46c7514d1fce06;" This is the thread that will be targeted by the oom killer, not its mm-less parent";yes;no;yes 288;MDY6Q29tbWl0MjMyNTI5ODpjODFmYWM1Y2I4YzkyYjhiNDc5NWFjMjUwYTQ2Yzc1MTRkMWZjZTA2;David Rientjes;Linus Torvalds;"oom: dump_tasks use find_lock_task_mm too fix When find_lock_task_mm() returns a thread other than p in dump_tasks(), its name should be displayed instead. This is the thread that will be targeted by the oom killer, not its mm-less parent. This also allows us to safely dereference task->comm without needing get_task_comm(). While we're here, remove the cast on task_cpu(task) as Andrew suggested. Signed-off-by: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/c81fac5cb8c92b8b4795ac250a46c7514d1fce06;"This also allows us to safely dereference task->comm without needing get_task_comm()";yes;yes;no 288;MDY6Q29tbWl0MjMyNTI5ODpjODFmYWM1Y2I4YzkyYjhiNDc5NWFjMjUwYTQ2Yzc1MTRkMWZjZTA2;David Rientjes;Linus Torvalds;"oom: dump_tasks use find_lock_task_mm too fix When find_lock_task_mm() returns a thread other than p in dump_tasks(), its name should be displayed instead. This is the thread that will be targeted by the oom killer, not its mm-less parent. This also allows us to safely dereference task->comm without needing get_task_comm(). While we're here, remove the cast on task_cpu(task) as Andrew suggested. Signed-off-by: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/c81fac5cb8c92b8b4795ac250a46c7514d1fce06;While we're here, remove the cast on task_cpu(task) as Andrew suggested.;yes;yes;no 289;MDY6Q29tbWl0MjMyNTI5ODo3NGFiN2YxZDNmMjJjY2IwMmY4YjE0ZjFmMjM3NTQxNmIxYWIwYWRi;David Rientjes;Linus Torvalds;"oom: improve commentary in dump_tasks() The comments in dump_tasks() should be updated to be more clear about why tasks are filtered and how they are filtered by its argument. An unnecessary comment concerning a check for is_global_init() is removed since it isn't of importance. Suggested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/74ab7f1d3f22ccb02f8b14f1f2375416b1ab0adb;oom: improve commentary in dump_tasks();yes;yes;no 289;MDY6Q29tbWl0MjMyNTI5ODo3NGFiN2YxZDNmMjJjY2IwMmY4YjE0ZjFmMjM3NTQxNmIxYWIwYWRi;David Rientjes;Linus Torvalds;"oom: improve commentary in dump_tasks() The comments in dump_tasks() should be updated to be more clear about why tasks are filtered and how they are filtered by its argument. An unnecessary comment concerning a check for is_global_init() is removed since it isn't of importance. Suggested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/74ab7f1d3f22ccb02f8b14f1f2375416b1ab0adb;"The comments in dump_tasks() should be updated to be more clear about why tasks are filtered and how they are filtered by its argument";yes;yes;no 289;MDY6Q29tbWl0MjMyNTI5ODo3NGFiN2YxZDNmMjJjY2IwMmY4YjE0ZjFmMjM3NTQxNmIxYWIwYWRi;David Rientjes;Linus Torvalds;"oom: improve commentary in dump_tasks() The comments in dump_tasks() should be updated to be more clear about why tasks are filtered and how they are filtered by its argument. An unnecessary comment concerning a check for is_global_init() is removed since it isn't of importance. Suggested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/74ab7f1d3f22ccb02f8b14f1f2375416b1ab0adb;"An unnecessary comment concerning a check for is_global_init() is removed since it isn't of importance.";yes;yes;no 290;MDY6Q29tbWl0MjMyNTI5ODpjNTVkYjk1Nzg4YTJhNTVhNzdmNWEzY2VkMWU1OTU3ODcxMDQ0MGIy;KOSAKI Motohiro;Linus Torvalds;"oom: dump_tasks use find_lock_task_mm too dump_task() should use find_lock_task_mm() too. It is necessary for protecting task-exiting race. dump_tasks() currently filters any task that does not have an attached ->mm since it incorrectly assumes that it must either be in the process of exiting and has detached its memory or that it's a kernel thread; multithreaded tasks may actually have subthreads that have a valid ->mm pointer and thus those threads should actually be displayed. This change finds those threads, if they exist, and emit their information along with the rest of the candidate tasks for kill. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: David Rientjes <rientjes@google.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/c55db95788a2a55a77f5a3ced1e59578710440b2;oom: dump_tasks use find_lock_task_mm too;yes;no;no 290;MDY6Q29tbWl0MjMyNTI5ODpjNTVkYjk1Nzg4YTJhNTVhNzdmNWEzY2VkMWU1OTU3ODcxMDQ0MGIy;KOSAKI Motohiro;Linus Torvalds;"oom: dump_tasks use find_lock_task_mm too dump_task() should use find_lock_task_mm() too. It is necessary for protecting task-exiting race. dump_tasks() currently filters any task that does not have an attached ->mm since it incorrectly assumes that it must either be in the process of exiting and has detached its memory or that it's a kernel thread; multithreaded tasks may actually have subthreads that have a valid ->mm pointer and thus those threads should actually be displayed. This change finds those threads, if they exist, and emit their information along with the rest of the candidate tasks for kill. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: David Rientjes <rientjes@google.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/c55db95788a2a55a77f5a3ced1e59578710440b2;dump_task() should use find_lock_task_mm() too;yes;yes;no 290;MDY6Q29tbWl0MjMyNTI5ODpjNTVkYjk1Nzg4YTJhNTVhNzdmNWEzY2VkMWU1OTU3ODcxMDQ0MGIy;KOSAKI Motohiro;Linus Torvalds;"oom: dump_tasks use find_lock_task_mm too dump_task() should use find_lock_task_mm() too. It is necessary for protecting task-exiting race. dump_tasks() currently filters any task that does not have an attached ->mm since it incorrectly assumes that it must either be in the process of exiting and has detached its memory or that it's a kernel thread; multithreaded tasks may actually have subthreads that have a valid ->mm pointer and thus those threads should actually be displayed. This change finds those threads, if they exist, and emit their information along with the rest of the candidate tasks for kill. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: David Rientjes <rientjes@google.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/c55db95788a2a55a77f5a3ced1e59578710440b2;"It is necessary for protecting task-exiting race";no;yes;no 290;MDY6Q29tbWl0MjMyNTI5ODpjNTVkYjk1Nzg4YTJhNTVhNzdmNWEzY2VkMWU1OTU3ODcxMDQ0MGIy;KOSAKI Motohiro;Linus Torvalds;"oom: dump_tasks use find_lock_task_mm too dump_task() should use find_lock_task_mm() too. It is necessary for protecting task-exiting race. dump_tasks() currently filters any task that does not have an attached ->mm since it incorrectly assumes that it must either be in the process of exiting and has detached its memory or that it's a kernel thread; multithreaded tasks may actually have subthreads that have a valid ->mm pointer and thus those threads should actually be displayed. This change finds those threads, if they exist, and emit their information along with the rest of the candidate tasks for kill. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: David Rientjes <rientjes@google.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/c55db95788a2a55a77f5a3ced1e59578710440b2;"dump_tasks() currently filters any task that does not have an attached ->mm since it incorrectly assumes that it must either be in the process of multithreaded tasks may actually have subthreads that have a valid ->mm pointer and thus those threads should actually be displayed";no;yes;yes 290;MDY6Q29tbWl0MjMyNTI5ODpjNTVkYjk1Nzg4YTJhNTVhNzdmNWEzY2VkMWU1OTU3ODcxMDQ0MGIy;KOSAKI Motohiro;Linus Torvalds;"oom: dump_tasks use find_lock_task_mm too dump_task() should use find_lock_task_mm() too. It is necessary for protecting task-exiting race. dump_tasks() currently filters any task that does not have an attached ->mm since it incorrectly assumes that it must either be in the process of exiting and has detached its memory or that it's a kernel thread; multithreaded tasks may actually have subthreads that have a valid ->mm pointer and thus those threads should actually be displayed. This change finds those threads, if they exist, and emit their information along with the rest of the candidate tasks for kill. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: David Rientjes <rientjes@google.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/c55db95788a2a55a77f5a3ced1e59578710440b2;" This change finds those threads, if they exist, and emit their information along with the rest of the candidate tasks for kill.";yes;no;no 291;MDY6Q29tbWl0MjMyNTI5ODpkZDhlOGY0MDVjYTM4NmM3Y2U3Y2JiOTk2Y2NkOTg1ZDI4M2IwZTAz;Oleg Nesterov;Linus Torvalds;"oom: introduce find_lock_task_mm() to fix !mm false positives Almost all ->mm == NULL checks in oom_kill.c are wrong. The current code assumes that the task without ->mm has already released its memory and ignores the process. However this is not necessarily true when this process is multithreaded, other live sub-threads can use this ->mm. - Remove the ""if (!p->mm)"" check in select_bad_process(), it is just wrong. - Add the new helper, find_lock_task_mm(), which finds the live thread which uses the memory and takes task_lock() to pin ->mm - change oom_badness() to use this helper instead of just checking ->mm != NULL. - As David pointed out, select_bad_process() must never choose the task without ->mm, but no matter what oom_badness() returns the task can be chosen if nothing else has been found yet. Change oom_badness() to return int, change it to return -1 if find_lock_task_mm() fails, and change select_bad_process() to check points >= 0. Note! This patch is not enough, we need more changes. - oom_badness() was fixed, but oom_kill_task() still ignores the task without ->mm - oom_forkbomb_penalty() should use find_lock_task_mm() too, and it also needs other changes to actually find the first first-descendant children This will be addressed later. [kosaki.motohiro@jp.fujitsu.com: use in badness(), __oom_kill_task()] Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/dd8e8f405ca386c7ce7cbb996ccd985d283b0e03;oom: introduce find_lock_task_mm() to fix !mm false positives;yes;yes;no 291;MDY6Q29tbWl0MjMyNTI5ODpkZDhlOGY0MDVjYTM4NmM3Y2U3Y2JiOTk2Y2NkOTg1ZDI4M2IwZTAz;Oleg Nesterov;Linus Torvalds;"oom: introduce find_lock_task_mm() to fix !mm false positives Almost all ->mm == NULL checks in oom_kill.c are wrong. The current code assumes that the task without ->mm has already released its memory and ignores the process. However this is not necessarily true when this process is multithreaded, other live sub-threads can use this ->mm. - Remove the ""if (!p->mm)"" check in select_bad_process(), it is just wrong. - Add the new helper, find_lock_task_mm(), which finds the live thread which uses the memory and takes task_lock() to pin ->mm - change oom_badness() to use this helper instead of just checking ->mm != NULL. - As David pointed out, select_bad_process() must never choose the task without ->mm, but no matter what oom_badness() returns the task can be chosen if nothing else has been found yet. Change oom_badness() to return int, change it to return -1 if find_lock_task_mm() fails, and change select_bad_process() to check points >= 0. Note! This patch is not enough, we need more changes. - oom_badness() was fixed, but oom_kill_task() still ignores the task without ->mm - oom_forkbomb_penalty() should use find_lock_task_mm() too, and it also needs other changes to actually find the first first-descendant children This will be addressed later. [kosaki.motohiro@jp.fujitsu.com: use in badness(), __oom_kill_task()] Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/dd8e8f405ca386c7ce7cbb996ccd985d283b0e03;Almost all ->mm == NULL checks in oom_kill.c are wrong;no;yes;yes 291;MDY6Q29tbWl0MjMyNTI5ODpkZDhlOGY0MDVjYTM4NmM3Y2U3Y2JiOTk2Y2NkOTg1ZDI4M2IwZTAz;Oleg Nesterov;Linus Torvalds;"oom: introduce find_lock_task_mm() to fix !mm false positives Almost all ->mm == NULL checks in oom_kill.c are wrong. The current code assumes that the task without ->mm has already released its memory and ignores the process. However this is not necessarily true when this process is multithreaded, other live sub-threads can use this ->mm. - Remove the ""if (!p->mm)"" check in select_bad_process(), it is just wrong. - Add the new helper, find_lock_task_mm(), which finds the live thread which uses the memory and takes task_lock() to pin ->mm - change oom_badness() to use this helper instead of just checking ->mm != NULL. - As David pointed out, select_bad_process() must never choose the task without ->mm, but no matter what oom_badness() returns the task can be chosen if nothing else has been found yet. Change oom_badness() to return int, change it to return -1 if find_lock_task_mm() fails, and change select_bad_process() to check points >= 0. Note! This patch is not enough, we need more changes. - oom_badness() was fixed, but oom_kill_task() still ignores the task without ->mm - oom_forkbomb_penalty() should use find_lock_task_mm() too, and it also needs other changes to actually find the first first-descendant children This will be addressed later. [kosaki.motohiro@jp.fujitsu.com: use in badness(), __oom_kill_task()] Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/dd8e8f405ca386c7ce7cbb996ccd985d283b0e03;"The current code assumes that the task without ->mm has already released its memory and ignores the process";no;no;yes 291;MDY6Q29tbWl0MjMyNTI5ODpkZDhlOGY0MDVjYTM4NmM3Y2U3Y2JiOTk2Y2NkOTg1ZDI4M2IwZTAz;Oleg Nesterov;Linus Torvalds;"oom: introduce find_lock_task_mm() to fix !mm false positives Almost all ->mm == NULL checks in oom_kill.c are wrong. The current code assumes that the task without ->mm has already released its memory and ignores the process. However this is not necessarily true when this process is multithreaded, other live sub-threads can use this ->mm. - Remove the ""if (!p->mm)"" check in select_bad_process(), it is just wrong. - Add the new helper, find_lock_task_mm(), which finds the live thread which uses the memory and takes task_lock() to pin ->mm - change oom_badness() to use this helper instead of just checking ->mm != NULL. - As David pointed out, select_bad_process() must never choose the task without ->mm, but no matter what oom_badness() returns the task can be chosen if nothing else has been found yet. Change oom_badness() to return int, change it to return -1 if find_lock_task_mm() fails, and change select_bad_process() to check points >= 0. Note! This patch is not enough, we need more changes. - oom_badness() was fixed, but oom_kill_task() still ignores the task without ->mm - oom_forkbomb_penalty() should use find_lock_task_mm() too, and it also needs other changes to actually find the first first-descendant children This will be addressed later. [kosaki.motohiro@jp.fujitsu.com: use in badness(), __oom_kill_task()] Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/dd8e8f405ca386c7ce7cbb996ccd985d283b0e03;" However this is not necessarily true when this process is multithreaded, other live sub-threads can use this ->mm";no;yes;yes 291;MDY6Q29tbWl0MjMyNTI5ODpkZDhlOGY0MDVjYTM4NmM3Y2U3Y2JiOTk2Y2NkOTg1ZDI4M2IwZTAz;Oleg Nesterov;Linus Torvalds;"oom: introduce find_lock_task_mm() to fix !mm false positives Almost all ->mm == NULL checks in oom_kill.c are wrong. The current code assumes that the task without ->mm has already released its memory and ignores the process. However this is not necessarily true when this process is multithreaded, other live sub-threads can use this ->mm. - Remove the ""if (!p->mm)"" check in select_bad_process(), it is just wrong. - Add the new helper, find_lock_task_mm(), which finds the live thread which uses the memory and takes task_lock() to pin ->mm - change oom_badness() to use this helper instead of just checking ->mm != NULL. - As David pointed out, select_bad_process() must never choose the task without ->mm, but no matter what oom_badness() returns the task can be chosen if nothing else has been found yet. Change oom_badness() to return int, change it to return -1 if find_lock_task_mm() fails, and change select_bad_process() to check points >= 0. Note! This patch is not enough, we need more changes. - oom_badness() was fixed, but oom_kill_task() still ignores the task without ->mm - oom_forkbomb_penalty() should use find_lock_task_mm() too, and it also needs other changes to actually find the first first-descendant children This will be addressed later. [kosaki.motohiro@jp.fujitsu.com: use in badness(), __oom_kill_task()] Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/dd8e8f405ca386c7ce7cbb996ccd985d283b0e03;"- Remove the ""if (!p->mm)"" check in select_bad_process(), it is just wrong";yes;yes;no 291;MDY6Q29tbWl0MjMyNTI5ODpkZDhlOGY0MDVjYTM4NmM3Y2U3Y2JiOTk2Y2NkOTg1ZDI4M2IwZTAz;Oleg Nesterov;Linus Torvalds;"oom: introduce find_lock_task_mm() to fix !mm false positives Almost all ->mm == NULL checks in oom_kill.c are wrong. The current code assumes that the task without ->mm has already released its memory and ignores the process. However this is not necessarily true when this process is multithreaded, other live sub-threads can use this ->mm. - Remove the ""if (!p->mm)"" check in select_bad_process(), it is just wrong. - Add the new helper, find_lock_task_mm(), which finds the live thread which uses the memory and takes task_lock() to pin ->mm - change oom_badness() to use this helper instead of just checking ->mm != NULL. - As David pointed out, select_bad_process() must never choose the task without ->mm, but no matter what oom_badness() returns the task can be chosen if nothing else has been found yet. Change oom_badness() to return int, change it to return -1 if find_lock_task_mm() fails, and change select_bad_process() to check points >= 0. Note! This patch is not enough, we need more changes. - oom_badness() was fixed, but oom_kill_task() still ignores the task without ->mm - oom_forkbomb_penalty() should use find_lock_task_mm() too, and it also needs other changes to actually find the first first-descendant children This will be addressed later. [kosaki.motohiro@jp.fujitsu.com: use in badness(), __oom_kill_task()] Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/dd8e8f405ca386c7ce7cbb996ccd985d283b0e03;"- Add the new helper, find_lock_task_mm(), which finds the live thread which uses the memory and takes task_lock() to pin ->mm - change oom_badness() to use this helper instead of just checking ->mm != NULL";yes;yes;no 291;MDY6Q29tbWl0MjMyNTI5ODpkZDhlOGY0MDVjYTM4NmM3Y2U3Y2JiOTk2Y2NkOTg1ZDI4M2IwZTAz;Oleg Nesterov;Linus Torvalds;"oom: introduce find_lock_task_mm() to fix !mm false positives Almost all ->mm == NULL checks in oom_kill.c are wrong. The current code assumes that the task without ->mm has already released its memory and ignores the process. However this is not necessarily true when this process is multithreaded, other live sub-threads can use this ->mm. - Remove the ""if (!p->mm)"" check in select_bad_process(), it is just wrong. - Add the new helper, find_lock_task_mm(), which finds the live thread which uses the memory and takes task_lock() to pin ->mm - change oom_badness() to use this helper instead of just checking ->mm != NULL. - As David pointed out, select_bad_process() must never choose the task without ->mm, but no matter what oom_badness() returns the task can be chosen if nothing else has been found yet. Change oom_badness() to return int, change it to return -1 if find_lock_task_mm() fails, and change select_bad_process() to check points >= 0. Note! This patch is not enough, we need more changes. - oom_badness() was fixed, but oom_kill_task() still ignores the task without ->mm - oom_forkbomb_penalty() should use find_lock_task_mm() too, and it also needs other changes to actually find the first first-descendant children This will be addressed later. [kosaki.motohiro@jp.fujitsu.com: use in badness(), __oom_kill_task()] Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/dd8e8f405ca386c7ce7cbb996ccd985d283b0e03;"- As David pointed out, select_bad_process() must never choose the task without ->mm, but no matter what oom_badness() returns the task can be chosen if nothing else has been found yet";yes;no;yes 291;MDY6Q29tbWl0MjMyNTI5ODpkZDhlOGY0MDVjYTM4NmM3Y2U3Y2JiOTk2Y2NkOTg1ZDI4M2IwZTAz;Oleg Nesterov;Linus Torvalds;"oom: introduce find_lock_task_mm() to fix !mm false positives Almost all ->mm == NULL checks in oom_kill.c are wrong. The current code assumes that the task without ->mm has already released its memory and ignores the process. However this is not necessarily true when this process is multithreaded, other live sub-threads can use this ->mm. - Remove the ""if (!p->mm)"" check in select_bad_process(), it is just wrong. - Add the new helper, find_lock_task_mm(), which finds the live thread which uses the memory and takes task_lock() to pin ->mm - change oom_badness() to use this helper instead of just checking ->mm != NULL. - As David pointed out, select_bad_process() must never choose the task without ->mm, but no matter what oom_badness() returns the task can be chosen if nothing else has been found yet. Change oom_badness() to return int, change it to return -1 if find_lock_task_mm() fails, and change select_bad_process() to check points >= 0. Note! This patch is not enough, we need more changes. - oom_badness() was fixed, but oom_kill_task() still ignores the task without ->mm - oom_forkbomb_penalty() should use find_lock_task_mm() too, and it also needs other changes to actually find the first first-descendant children This will be addressed later. [kosaki.motohiro@jp.fujitsu.com: use in badness(), __oom_kill_task()] Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/dd8e8f405ca386c7ce7cbb996ccd985d283b0e03;" Change oom_badness() to return int, change it to return -1 if find_lock_task_mm() fails, and change select_bad_process() to check points >= 0";yes;no;no 291;MDY6Q29tbWl0MjMyNTI5ODpkZDhlOGY0MDVjYTM4NmM3Y2U3Y2JiOTk2Y2NkOTg1ZDI4M2IwZTAz;Oleg Nesterov;Linus Torvalds;"oom: introduce find_lock_task_mm() to fix !mm false positives Almost all ->mm == NULL checks in oom_kill.c are wrong. The current code assumes that the task without ->mm has already released its memory and ignores the process. However this is not necessarily true when this process is multithreaded, other live sub-threads can use this ->mm. - Remove the ""if (!p->mm)"" check in select_bad_process(), it is just wrong. - Add the new helper, find_lock_task_mm(), which finds the live thread which uses the memory and takes task_lock() to pin ->mm - change oom_badness() to use this helper instead of just checking ->mm != NULL. - As David pointed out, select_bad_process() must never choose the task without ->mm, but no matter what oom_badness() returns the task can be chosen if nothing else has been found yet. Change oom_badness() to return int, change it to return -1 if find_lock_task_mm() fails, and change select_bad_process() to check points >= 0. Note! This patch is not enough, we need more changes. - oom_badness() was fixed, but oom_kill_task() still ignores the task without ->mm - oom_forkbomb_penalty() should use find_lock_task_mm() too, and it also needs other changes to actually find the first first-descendant children This will be addressed later. [kosaki.motohiro@jp.fujitsu.com: use in badness(), __oom_kill_task()] Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/dd8e8f405ca386c7ce7cbb996ccd985d283b0e03;Note! This patch is not enough, we need more changes;yes;no;no 291;MDY6Q29tbWl0MjMyNTI5ODpkZDhlOGY0MDVjYTM4NmM3Y2U3Y2JiOTk2Y2NkOTg1ZDI4M2IwZTAz;Oleg Nesterov;Linus Torvalds;"oom: introduce find_lock_task_mm() to fix !mm false positives Almost all ->mm == NULL checks in oom_kill.c are wrong. The current code assumes that the task without ->mm has already released its memory and ignores the process. However this is not necessarily true when this process is multithreaded, other live sub-threads can use this ->mm. - Remove the ""if (!p->mm)"" check in select_bad_process(), it is just wrong. - Add the new helper, find_lock_task_mm(), which finds the live thread which uses the memory and takes task_lock() to pin ->mm - change oom_badness() to use this helper instead of just checking ->mm != NULL. - As David pointed out, select_bad_process() must never choose the task without ->mm, but no matter what oom_badness() returns the task can be chosen if nothing else has been found yet. Change oom_badness() to return int, change it to return -1 if find_lock_task_mm() fails, and change select_bad_process() to check points >= 0. Note! This patch is not enough, we need more changes. - oom_badness() was fixed, but oom_kill_task() still ignores the task without ->mm - oom_forkbomb_penalty() should use find_lock_task_mm() too, and it also needs other changes to actually find the first first-descendant children This will be addressed later. [kosaki.motohiro@jp.fujitsu.com: use in badness(), __oom_kill_task()] Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/dd8e8f405ca386c7ce7cbb996ccd985d283b0e03;" - oom_badness() was fixed, but oom_kill_task() still ignores the task without ->mm - oom_forkbomb_penalty() should use find_lock_task_mm() too, and it also needs other changes to actually find the first first-descendant children This will be addressed later.";yes;yes;yes 292;MDY6Q29tbWl0MjMyNTI5ODpiNTIyNzk0MDZlNzdiZTcxMWMwNjhmOWE4ZTk3MGVhNjQ3MWUwODlj;Oleg Nesterov;Linus Torvalds;"oom: PF_EXITING check should take mm into account select_bad_process() checks PF_EXITING to detect the task which is going to release its memory, but the logic is very wrong. - a single process P with the dead group leader disables select_bad_process() completely, it will always return ERR_PTR() while P can live forever - if the PF_EXITING task has already released its ->mm it doesn't make sense to expect it is goiing to free more memory (except task_struct/etc) Change the code to ignore the PF_EXITING tasks without ->mm. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: David Rientjes <rientjes@google.com> Cc: Balbir Singh <balbir@in.ibm.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/b52279406e77be711c068f9a8e970ea6471e089c;oom: PF_EXITING check should take mm into account;yes;yes;no 292;MDY6Q29tbWl0MjMyNTI5ODpiNTIyNzk0MDZlNzdiZTcxMWMwNjhmOWE4ZTk3MGVhNjQ3MWUwODlj;Oleg Nesterov;Linus Torvalds;"oom: PF_EXITING check should take mm into account select_bad_process() checks PF_EXITING to detect the task which is going to release its memory, but the logic is very wrong. - a single process P with the dead group leader disables select_bad_process() completely, it will always return ERR_PTR() while P can live forever - if the PF_EXITING task has already released its ->mm it doesn't make sense to expect it is goiing to free more memory (except task_struct/etc) Change the code to ignore the PF_EXITING tasks without ->mm. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: David Rientjes <rientjes@google.com> Cc: Balbir Singh <balbir@in.ibm.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/b52279406e77be711c068f9a8e970ea6471e089c;"select_bad_process() checks PF_EXITING to detect the task which is going to release its memory, but the logic is very wrong";no;yes;yes 292;MDY6Q29tbWl0MjMyNTI5ODpiNTIyNzk0MDZlNzdiZTcxMWMwNjhmOWE4ZTk3MGVhNjQ3MWUwODlj;Oleg Nesterov;Linus Torvalds;"oom: PF_EXITING check should take mm into account select_bad_process() checks PF_EXITING to detect the task which is going to release its memory, but the logic is very wrong. - a single process P with the dead group leader disables select_bad_process() completely, it will always return ERR_PTR() while P can live forever - if the PF_EXITING task has already released its ->mm it doesn't make sense to expect it is goiing to free more memory (except task_struct/etc) Change the code to ignore the PF_EXITING tasks without ->mm. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: David Rientjes <rientjes@google.com> Cc: Balbir Singh <balbir@in.ibm.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/b52279406e77be711c068f9a8e970ea6471e089c;" - a single process P with the dead group leader disables select_bad_process() completely, it will always return ERR_PTR() while P can live forever - if the PF_EXITING task has already released its ->mm it doesn't make sense to expect it is goiing to free more memory (except task_struct/etc) Change the code to ignore the PF_EXITING tasks without ->mm.";yes;yes;yes 293;MDY6Q29tbWl0MjMyNTI5ODo0NTVjMGU1ZmIwM2I2N2ZhNjJiZDEyZTNhYmUzZmE0ODRiOTk2MGM1;Oleg Nesterov;Linus Torvalds;"oom: check PF_KTHREAD instead of !mm to skip kthreads select_bad_process() thinks a kernel thread can't have ->mm != NULL, this is not true due to use_mm(). Change the code to check PF_KTHREAD. Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/455c0e5fb03b67fa62bd12e3abe3fa484b9960c5;oom: check PF_KTHREAD instead of !mm to skip kthreads;yes;no;no 293;MDY6Q29tbWl0MjMyNTI5ODo0NTVjMGU1ZmIwM2I2N2ZhNjJiZDEyZTNhYmUzZmE0ODRiOTk2MGM1;Oleg Nesterov;Linus Torvalds;"oom: check PF_KTHREAD instead of !mm to skip kthreads select_bad_process() thinks a kernel thread can't have ->mm != NULL, this is not true due to use_mm(). Change the code to check PF_KTHREAD. Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/455c0e5fb03b67fa62bd12e3abe3fa484b9960c5;"select_bad_process() thinks a kernel thread can't have ->mm != NULL, this is not true due to use_mm()";no;yes;yes 293;MDY6Q29tbWl0MjMyNTI5ODo0NTVjMGU1ZmIwM2I2N2ZhNjJiZDEyZTNhYmUzZmE0ODRiOTk2MGM1;Oleg Nesterov;Linus Torvalds;"oom: check PF_KTHREAD instead of !mm to skip kthreads select_bad_process() thinks a kernel thread can't have ->mm != NULL, this is not true due to use_mm(). Change the code to check PF_KTHREAD. Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/455c0e5fb03b67fa62bd12e3abe3fa484b9960c5;Change the code to check PF_KTHREAD.;yes;no;no 294;MDY6Q29tbWl0MjMyNTI5ODpkZjY0ZjgxYmIxZTAxY2JlZjk2N2E5NjY0MmRhY2YyMDhhY2I3ZTcy;David Rientjes;Linus Torvalds;"memcg: make oom killer a no-op when no killable task can be found It's pointless to try to kill current if select_bad_process() did not find an eligible task to kill in mem_cgroup_out_of_memory() since it's guaranteed that current is a member of the memcg that is oom and it is, by definition, unkillable. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/df64f81bb1e01cbef967a96642dacf208acb7e72;memcg: make oom killer a no-op when no killable task can be found;yes;no;no 294;MDY6Q29tbWl0MjMyNTI5ODpkZjY0ZjgxYmIxZTAxY2JlZjk2N2E5NjY0MmRhY2YyMDhhY2I3ZTcy;David Rientjes;Linus Torvalds;"memcg: make oom killer a no-op when no killable task can be found It's pointless to try to kill current if select_bad_process() did not find an eligible task to kill in mem_cgroup_out_of_memory() since it's guaranteed that current is a member of the memcg that is oom and it is, by definition, unkillable. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/df64f81bb1e01cbef967a96642dacf208acb7e72;"It's pointless to try to kill current if select_bad_process() did not find an eligible task to kill in mem_cgroup_out_of_memory() since it's guaranteed that current is a member of the memcg that is oom and it is, by definition, unkillable.";no;yes;yes 295;MDY6Q29tbWl0MjMyNTI5ODo1YTBlM2FkNmFmODY2MGJlMjFjYTk4YTk3MWNkMDBmMzMxMzE4YzA1;Tejun Heo;Tejun Heo;"include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>";https://api.github.com/repos/torvalds/linux/git/commits/5a0e3ad6af8660be21ca98a971cd00f331318c05;include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h;yes;yes;no 295;MDY6Q29tbWl0MjMyNTI5ODo1YTBlM2FkNmFmODY2MGJlMjFjYTk4YTk3MWNkMDBmMzMxMzE4YzA1;Tejun Heo;Tejun Heo;"include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>";https://api.github.com/repos/torvalds/linux/git/commits/5a0e3ad6af8660be21ca98a971cd00f331318c05;"percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files";no;no;yes 295;MDY6Q29tbWl0MjMyNTI5ODo1YTBlM2FkNmFmODY2MGJlMjFjYTk4YTk3MWNkMDBmMzMxMzE4YzA1;Tejun Heo;Tejun Heo;"include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>";https://api.github.com/repos/torvalds/linux/git/commits/5a0e3ad6af8660be21ca98a971cd00f331318c05;" percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies";no;yes;yes 295;MDY6Q29tbWl0MjMyNTI5ODo1YTBlM2FkNmFmODY2MGJlMjFjYTk4YTk3MWNkMDBmMzMxMzE4YzA1;Tejun Heo;Tejun Heo;"include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>";https://api.github.com/repos/torvalds/linux/git/commits/5a0e3ad6af8660be21ca98a971cd00f331318c05;percpu.h -> slab.h dependency is about to be removed;no;yes;yes 295;MDY6Q29tbWl0MjMyNTI5ODo1YTBlM2FkNmFmODY2MGJlMjFjYTk4YTk3MWNkMDBmMzMxMzE4YzA1;Tejun Heo;Tejun Heo;"include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>";https://api.github.com/repos/torvalds/linux/git/commits/5a0e3ad6af8660be21ca98a971cd00f331318c05;" Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability";yes;yes;no 295;MDY6Q29tbWl0MjMyNTI5ODo1YTBlM2FkNmFmODY2MGJlMjFjYTk4YTk3MWNkMDBmMzMxMzE4YzA1;Tejun Heo;Tejun Heo;"include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>";https://api.github.com/repos/torvalds/linux/git/commits/5a0e3ad6af8660be21ca98a971cd00f331318c05;" As this conversion needs to touch large number of source files, the following script is used as the basis of conversion";yes;yes;yes 295;MDY6Q29tbWl0MjMyNTI5ODo1YTBlM2FkNmFmODY2MGJlMjFjYTk4YTk3MWNkMDBmMzMxMzE4YzA1;Tejun Heo;Tejun Heo;"include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>";https://api.github.com/repos/torvalds/linux/git/commits/5a0e3ad6af8660be21ca98a971cd00f331318c05;The script does the followings;yes;no;yes 295;MDY6Q29tbWl0MjMyNTI5ODo1YTBlM2FkNmFmODY2MGJlMjFjYTk4YTk3MWNkMDBmMzMxMzE4YzA1;Tejun Heo;Tejun Heo;"include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>";https://api.github.com/repos/torvalds/linux/git/commits/5a0e3ad6af8660be21ca98a971cd00f331318c05; only the necessary includes are there;yes;yes;no 295;MDY6Q29tbWl0MjMyNTI5ODo1YTBlM2FkNmFmODY2MGJlMjFjYTk4YTk3MWNkMDBmMzMxMzE4YzA1;Tejun Heo;Tejun Heo;"include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>";https://api.github.com/repos/torvalds/linux/git/commits/5a0e3ad6af8660be21ca98a971cd00f331318c05;"if only gfp is used, gfp.h, if slab is used, slab.h";yes;no;no 295;MDY6Q29tbWl0MjMyNTI5ODo1YTBlM2FkNmFmODY2MGJlMjFjYTk4YTk3MWNkMDBmMzMxMzE4YzA1;Tejun Heo;Tejun Heo;"include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>";https://api.github.com/repos/torvalds/linux/git/commits/5a0e3ad6af8660be21ca98a971cd00f331318c05;" blocks and try to put the new include such that its order conforms to its surrounding";yes;yes;no 295;MDY6Q29tbWl0MjMyNTI5ODo1YTBlM2FkNmFmODY2MGJlMjFjYTk4YTk3MWNkMDBmMzMxMzE4YzA1;Tejun Heo;Tejun Heo;"include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>";https://api.github.com/repos/torvalds/linux/git/commits/5a0e3ad6af8660be21ca98a971cd00f331318c05;" It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order";yes;no;no 295;MDY6Q29tbWl0MjMyNTI5ODo1YTBlM2FkNmFmODY2MGJlMjFjYTk4YTk3MWNkMDBmMzMxMzE4YzA1;Tejun Heo;Tejun Heo;"include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>";https://api.github.com/repos/torvalds/linux/git/commits/5a0e3ad6af8660be21ca98a971cd00f331318c05;" because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file";yes;yes;no 295;MDY6Q29tbWl0MjMyNTI5ODo1YTBlM2FkNmFmODY2MGJlMjFjYTk4YTk3MWNkMDBmMzMxMzE4YzA1;Tejun Heo;Tejun Heo;"include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>";https://api.github.com/repos/torvalds/linux/git/commits/5a0e3ad6af8660be21ca98a971cd00f331318c05;The conversion was done in the following steps;yes;no;no 295;MDY6Q29tbWl0MjMyNTI5ODo1YTBlM2FkNmFmODY2MGJlMjFjYTk4YTk3MWNkMDBmMzMxMzE4YzA1;Tejun Heo;Tejun Heo;"include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>";https://api.github.com/repos/torvalds/linux/git/commits/5a0e3ad6af8660be21ca98a971cd00f331318c05;"The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions";yes;no;no 295;MDY6Q29tbWl0MjMyNTI5ODo1YTBlM2FkNmFmODY2MGJlMjFjYTk4YTk3MWNkMDBmMzMxMzE4YzA1;Tejun Heo;Tejun Heo;"include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>";https://api.github.com/repos/torvalds/linux/git/commits/5a0e3ad6af8660be21ca98a971cd00f331318c05;" The script emitted errors for ~400 files";yes;no;no 295;MDY6Q29tbWl0MjMyNTI5ODo1YTBlM2FkNmFmODY2MGJlMjFjYTk4YTk3MWNkMDBmMzMxMzE4YzA1;Tejun Heo;Tejun Heo;"include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>";https://api.github.com/repos/torvalds/linux/git/commits/5a0e3ad6af8660be21ca98a971cd00f331318c05;Each error was manually checked;yes;no;no 295;MDY6Q29tbWl0MjMyNTI5ODo1YTBlM2FkNmFmODY2MGJlMjFjYTk4YTk3MWNkMDBmMzMxMzE4YzA1;Tejun Heo;Tejun Heo;"include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>";https://api.github.com/repos/torvalds/linux/git/commits/5a0e3ad6af8660be21ca98a971cd00f331318c05;" Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others";yes;no;no 295;MDY6Q29tbWl0MjMyNTI5ODo1YTBlM2FkNmFmODY2MGJlMjFjYTk4YTk3MWNkMDBmMzMxMzE4YzA1;Tejun Heo;Tejun Heo;"include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>";https://api.github.com/repos/torvalds/linux/git/commits/5a0e3ad6af8660be21ca98a971cd00f331318c05;" This step added inclusions to around 150 files";yes;no;no 295;MDY6Q29tbWl0MjMyNTI5ODo1YTBlM2FkNmFmODY2MGJlMjFjYTk4YTk3MWNkMDBmMzMxMzE4YzA1;Tejun Heo;Tejun Heo;"include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>";https://api.github.com/repos/torvalds/linux/git/commits/5a0e3ad6af8660be21ca98a971cd00f331318c05;"The script was run again and the output was compared to the edits from #2 to make sure no file was left behind";yes;yes;no 295;MDY6Q29tbWl0MjMyNTI5ODo1YTBlM2FkNmFmODY2MGJlMjFjYTk4YTk3MWNkMDBmMzMxMzE4YzA1;Tejun Heo;Tejun Heo;"include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>";https://api.github.com/repos/torvalds/linux/git/commits/5a0e3ad6af8660be21ca98a971cd00f331318c05;Several build tests were done and a couple of problems were fixed;yes;yes;no 295;MDY6Q29tbWl0MjMyNTI5ODo1YTBlM2FkNmFmODY2MGJlMjFjYTk4YTk3MWNkMDBmMzMxMzE4YzA1;Tejun Heo;Tejun Heo;"include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>";https://api.github.com/repos/torvalds/linux/git/commits/5a0e3ad6af8660be21ca98a971cd00f331318c05;"lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually";yes;no;yes 295;MDY6Q29tbWl0MjMyNTI5ODo1YTBlM2FkNmFmODY2MGJlMjFjYTk4YTk3MWNkMDBmMzMxMzE4YzA1;Tejun Heo;Tejun Heo;"include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>";https://api.github.com/repos/torvalds/linux/git/commits/5a0e3ad6af8660be21ca98a971cd00f331318c05;"The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell";yes;yes;no 295;MDY6Q29tbWl0MjMyNTI5ODo1YTBlM2FkNmFmODY2MGJlMjFjYTk4YTk3MWNkMDBmMzMxMzE4YzA1;Tejun Heo;Tejun Heo;"include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>";https://api.github.com/repos/torvalds/linux/git/commits/5a0e3ad6af8660be21ca98a971cd00f331318c05;" Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros";yes;yes;no 295;MDY6Q29tbWl0MjMyNTI5ODo1YTBlM2FkNmFmODY2MGJlMjFjYTk4YTk3MWNkMDBmMzMxMzE4YzA1;Tejun Heo;Tejun Heo;"include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>";https://api.github.com/repos/torvalds/linux/git/commits/5a0e3ad6af8660be21ca98a971cd00f331318c05;" Each slab.h inclusion directive was examined and added manually as necessary";yes;no;no 295;MDY6Q29tbWl0MjMyNTI5ODo1YTBlM2FkNmFmODY2MGJlMjFjYTk4YTk3MWNkMDBmMzMxMzE4YzA1;Tejun Heo;Tejun Heo;"include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>";https://api.github.com/repos/torvalds/linux/git/commits/5a0e3ad6af8660be21ca98a971cd00f331318c05;percpu.h was updated not to include slab.h;yes;no;no 295;MDY6Q29tbWl0MjMyNTI5ODo1YTBlM2FkNmFmODY2MGJlMjFjYTk4YTk3MWNkMDBmMzMxMzE4YzA1;Tejun Heo;Tejun Heo;"include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>";https://api.github.com/repos/torvalds/linux/git/commits/5a0e3ad6af8660be21ca98a971cd00f331318c05;"Build test were done on the following configurations and failures were fixed";yes;yes;no 295;MDY6Q29tbWl0MjMyNTI5ODo1YTBlM2FkNmFmODY2MGJlMjFjYTk4YTk3MWNkMDBmMzMxMzE4YzA1;Tejun Heo;Tejun Heo;"include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>";https://api.github.com/repos/torvalds/linux/git/commits/5a0e3ad6af8660be21ca98a971cd00f331318c05;" CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq)";yes;yes;no 295;MDY6Q29tbWl0MjMyNTI5ODo1YTBlM2FkNmFmODY2MGJlMjFjYTk4YTk3MWNkMDBmMzMxMzE4YzA1;Tejun Heo;Tejun Heo;"include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>";https://api.github.com/repos/torvalds/linux/git/commits/5a0e3ad6af8660be21ca98a971cd00f331318c05;"percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point";yes;yes;no 295;MDY6Q29tbWl0MjMyNTI5ODo1YTBlM2FkNmFmODY2MGJlMjFjYTk4YTk3MWNkMDBmMzMxMzE4YzA1;Tejun Heo;Tejun Heo;"include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>";https://api.github.com/repos/torvalds/linux/git/commits/5a0e3ad6af8660be21ca98a971cd00f331318c05;"Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch";yes;yes;no 295;MDY6Q29tbWl0MjMyNTI5ODo1YTBlM2FkNmFmODY2MGJlMjFjYTk4YTk3MWNkMDBmMzMxMzE4YzA1;Tejun Heo;Tejun Heo;"include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>";https://api.github.com/repos/torvalds/linux/git/commits/5a0e3ad6af8660be21ca98a971cd00f331318c05;"If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch.";yes;no;no 296;MDY6Q29tbWl0MjMyNTI5ODo4Njc1NzhjYmNjYjA4OTNjYzE0ZmMyOWM2NzBmNzE4NTgwOWM5MGQ2;KAMEZAWA Hiroyuki;Linus Torvalds;"memcg: fix oom kill behavior In current page-fault code, handle_mm_fault() -> ... -> mem_cgroup_charge() -> map page or handle error. -> check return code. If page fault's return code is VM_FAULT_OOM, page_fault_out_of_memory() is called. But if it's caused by memcg, OOM should have been already invoked. Then, I added a patch: a636b327f731143ccc544b966cfd8de6cb6d72c6. That patch records last_oom_jiffies for memcg's sub-hierarchy and prevents page_fault_out_of_memory from being invoked in near future. But Nishimura-san reported that check by jiffies is not enough when the system is terribly heavy. This patch changes memcg's oom logic as. * If memcg causes OOM-kill, continue to retry. * remove jiffies check which is used now. * add memcg-oom-lock which works like perzone oom lock. * If current is killed(as a process), bypass charge. Something more sophisticated can be added but this pactch does fundamental things. TODO: - add oom notifier - add permemcg disable-oom-kill flag and freezer at oom. - more chances for wake up oom waiter (when changing memory limit etc..) Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Tested-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/867578cbccb0893cc14fc29c670f7185809c90d6;memcg: fix oom kill behavior;yes;yes;no 296;MDY6Q29tbWl0MjMyNTI5ODo4Njc1NzhjYmNjYjA4OTNjYzE0ZmMyOWM2NzBmNzE4NTgwOWM5MGQ2;KAMEZAWA Hiroyuki;Linus Torvalds;"memcg: fix oom kill behavior In current page-fault code, handle_mm_fault() -> ... -> mem_cgroup_charge() -> map page or handle error. -> check return code. If page fault's return code is VM_FAULT_OOM, page_fault_out_of_memory() is called. But if it's caused by memcg, OOM should have been already invoked. Then, I added a patch: a636b327f731143ccc544b966cfd8de6cb6d72c6. That patch records last_oom_jiffies for memcg's sub-hierarchy and prevents page_fault_out_of_memory from being invoked in near future. But Nishimura-san reported that check by jiffies is not enough when the system is terribly heavy. This patch changes memcg's oom logic as. * If memcg causes OOM-kill, continue to retry. * remove jiffies check which is used now. * add memcg-oom-lock which works like perzone oom lock. * If current is killed(as a process), bypass charge. Something more sophisticated can be added but this pactch does fundamental things. TODO: - add oom notifier - add permemcg disable-oom-kill flag and freezer at oom. - more chances for wake up oom waiter (when changing memory limit etc..) Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Tested-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/867578cbccb0893cc14fc29c670f7185809c90d6;"In current page-fault code, handle_mm_fault() -> mem_cgroup_charge() -> map page or handle error";no;no;yes 296;MDY6Q29tbWl0MjMyNTI5ODo4Njc1NzhjYmNjYjA4OTNjYzE0ZmMyOWM2NzBmNzE4NTgwOWM5MGQ2;KAMEZAWA Hiroyuki;Linus Torvalds;"memcg: fix oom kill behavior In current page-fault code, handle_mm_fault() -> ... -> mem_cgroup_charge() -> map page or handle error. -> check return code. If page fault's return code is VM_FAULT_OOM, page_fault_out_of_memory() is called. But if it's caused by memcg, OOM should have been already invoked. Then, I added a patch: a636b327f731143ccc544b966cfd8de6cb6d72c6. That patch records last_oom_jiffies for memcg's sub-hierarchy and prevents page_fault_out_of_memory from being invoked in near future. But Nishimura-san reported that check by jiffies is not enough when the system is terribly heavy. This patch changes memcg's oom logic as. * If memcg causes OOM-kill, continue to retry. * remove jiffies check which is used now. * add memcg-oom-lock which works like perzone oom lock. * If current is killed(as a process), bypass charge. Something more sophisticated can be added but this pactch does fundamental things. TODO: - add oom notifier - add permemcg disable-oom-kill flag and freezer at oom. - more chances for wake up oom waiter (when changing memory limit etc..) Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Tested-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/867578cbccb0893cc14fc29c670f7185809c90d6; -> check return code;no;no;yes 296;MDY6Q29tbWl0MjMyNTI5ODo4Njc1NzhjYmNjYjA4OTNjYzE0ZmMyOWM2NzBmNzE4NTgwOWM5MGQ2;KAMEZAWA Hiroyuki;Linus Torvalds;"memcg: fix oom kill behavior In current page-fault code, handle_mm_fault() -> ... -> mem_cgroup_charge() -> map page or handle error. -> check return code. If page fault's return code is VM_FAULT_OOM, page_fault_out_of_memory() is called. But if it's caused by memcg, OOM should have been already invoked. Then, I added a patch: a636b327f731143ccc544b966cfd8de6cb6d72c6. That patch records last_oom_jiffies for memcg's sub-hierarchy and prevents page_fault_out_of_memory from being invoked in near future. But Nishimura-san reported that check by jiffies is not enough when the system is terribly heavy. This patch changes memcg's oom logic as. * If memcg causes OOM-kill, continue to retry. * remove jiffies check which is used now. * add memcg-oom-lock which works like perzone oom lock. * If current is killed(as a process), bypass charge. Something more sophisticated can be added but this pactch does fundamental things. TODO: - add oom notifier - add permemcg disable-oom-kill flag and freezer at oom. - more chances for wake up oom waiter (when changing memory limit etc..) Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Tested-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/867578cbccb0893cc14fc29c670f7185809c90d6;"If page fault's return code is VM_FAULT_OOM, page_fault_out_of_memory() is called";no;no;yes 296;MDY6Q29tbWl0MjMyNTI5ODo4Njc1NzhjYmNjYjA4OTNjYzE0ZmMyOWM2NzBmNzE4NTgwOWM5MGQ2;KAMEZAWA Hiroyuki;Linus Torvalds;"memcg: fix oom kill behavior In current page-fault code, handle_mm_fault() -> ... -> mem_cgroup_charge() -> map page or handle error. -> check return code. If page fault's return code is VM_FAULT_OOM, page_fault_out_of_memory() is called. But if it's caused by memcg, OOM should have been already invoked. Then, I added a patch: a636b327f731143ccc544b966cfd8de6cb6d72c6. That patch records last_oom_jiffies for memcg's sub-hierarchy and prevents page_fault_out_of_memory from being invoked in near future. But Nishimura-san reported that check by jiffies is not enough when the system is terribly heavy. This patch changes memcg's oom logic as. * If memcg causes OOM-kill, continue to retry. * remove jiffies check which is used now. * add memcg-oom-lock which works like perzone oom lock. * If current is killed(as a process), bypass charge. Something more sophisticated can be added but this pactch does fundamental things. TODO: - add oom notifier - add permemcg disable-oom-kill flag and freezer at oom. - more chances for wake up oom waiter (when changing memory limit etc..) Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Tested-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/867578cbccb0893cc14fc29c670f7185809c90d6;" But if it's caused by memcg, OOM should have been already invoked";no;yes;yes 296;MDY6Q29tbWl0MjMyNTI5ODo4Njc1NzhjYmNjYjA4OTNjYzE0ZmMyOWM2NzBmNzE4NTgwOWM5MGQ2;KAMEZAWA Hiroyuki;Linus Torvalds;"memcg: fix oom kill behavior In current page-fault code, handle_mm_fault() -> ... -> mem_cgroup_charge() -> map page or handle error. -> check return code. If page fault's return code is VM_FAULT_OOM, page_fault_out_of_memory() is called. But if it's caused by memcg, OOM should have been already invoked. Then, I added a patch: a636b327f731143ccc544b966cfd8de6cb6d72c6. That patch records last_oom_jiffies for memcg's sub-hierarchy and prevents page_fault_out_of_memory from being invoked in near future. But Nishimura-san reported that check by jiffies is not enough when the system is terribly heavy. This patch changes memcg's oom logic as. * If memcg causes OOM-kill, continue to retry. * remove jiffies check which is used now. * add memcg-oom-lock which works like perzone oom lock. * If current is killed(as a process), bypass charge. Something more sophisticated can be added but this pactch does fundamental things. TODO: - add oom notifier - add permemcg disable-oom-kill flag and freezer at oom. - more chances for wake up oom waiter (when changing memory limit etc..) Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Tested-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/867578cbccb0893cc14fc29c670f7185809c90d6;Then, I added a patch: a636b327f731143ccc544b966cfd8de6cb6d72c6;yes;no;yes 296;MDY6Q29tbWl0MjMyNTI5ODo4Njc1NzhjYmNjYjA4OTNjYzE0ZmMyOWM2NzBmNzE4NTgwOWM5MGQ2;KAMEZAWA Hiroyuki;Linus Torvalds;"memcg: fix oom kill behavior In current page-fault code, handle_mm_fault() -> ... -> mem_cgroup_charge() -> map page or handle error. -> check return code. If page fault's return code is VM_FAULT_OOM, page_fault_out_of_memory() is called. But if it's caused by memcg, OOM should have been already invoked. Then, I added a patch: a636b327f731143ccc544b966cfd8de6cb6d72c6. That patch records last_oom_jiffies for memcg's sub-hierarchy and prevents page_fault_out_of_memory from being invoked in near future. But Nishimura-san reported that check by jiffies is not enough when the system is terribly heavy. This patch changes memcg's oom logic as. * If memcg causes OOM-kill, continue to retry. * remove jiffies check which is used now. * add memcg-oom-lock which works like perzone oom lock. * If current is killed(as a process), bypass charge. Something more sophisticated can be added but this pactch does fundamental things. TODO: - add oom notifier - add permemcg disable-oom-kill flag and freezer at oom. - more chances for wake up oom waiter (when changing memory limit etc..) Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Tested-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/867578cbccb0893cc14fc29c670f7185809c90d6;" That patch records last_oom_jiffies for memcg's sub-hierarchy and prevents page_fault_out_of_memory from being invoked in near future";no;no;yes 296;MDY6Q29tbWl0MjMyNTI5ODo4Njc1NzhjYmNjYjA4OTNjYzE0ZmMyOWM2NzBmNzE4NTgwOWM5MGQ2;KAMEZAWA Hiroyuki;Linus Torvalds;"memcg: fix oom kill behavior In current page-fault code, handle_mm_fault() -> ... -> mem_cgroup_charge() -> map page or handle error. -> check return code. If page fault's return code is VM_FAULT_OOM, page_fault_out_of_memory() is called. But if it's caused by memcg, OOM should have been already invoked. Then, I added a patch: a636b327f731143ccc544b966cfd8de6cb6d72c6. That patch records last_oom_jiffies for memcg's sub-hierarchy and prevents page_fault_out_of_memory from being invoked in near future. But Nishimura-san reported that check by jiffies is not enough when the system is terribly heavy. This patch changes memcg's oom logic as. * If memcg causes OOM-kill, continue to retry. * remove jiffies check which is used now. * add memcg-oom-lock which works like perzone oom lock. * If current is killed(as a process), bypass charge. Something more sophisticated can be added but this pactch does fundamental things. TODO: - add oom notifier - add permemcg disable-oom-kill flag and freezer at oom. - more chances for wake up oom waiter (when changing memory limit etc..) Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Tested-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/867578cbccb0893cc14fc29c670f7185809c90d6;"But Nishimura-san reported that check by jiffies is not enough when the system is terribly heavy";no;yes;yes 296;MDY6Q29tbWl0MjMyNTI5ODo4Njc1NzhjYmNjYjA4OTNjYzE0ZmMyOWM2NzBmNzE4NTgwOWM5MGQ2;KAMEZAWA Hiroyuki;Linus Torvalds;"memcg: fix oom kill behavior In current page-fault code, handle_mm_fault() -> ... -> mem_cgroup_charge() -> map page or handle error. -> check return code. If page fault's return code is VM_FAULT_OOM, page_fault_out_of_memory() is called. But if it's caused by memcg, OOM should have been already invoked. Then, I added a patch: a636b327f731143ccc544b966cfd8de6cb6d72c6. That patch records last_oom_jiffies for memcg's sub-hierarchy and prevents page_fault_out_of_memory from being invoked in near future. But Nishimura-san reported that check by jiffies is not enough when the system is terribly heavy. This patch changes memcg's oom logic as. * If memcg causes OOM-kill, continue to retry. * remove jiffies check which is used now. * add memcg-oom-lock which works like perzone oom lock. * If current is killed(as a process), bypass charge. Something more sophisticated can be added but this pactch does fundamental things. TODO: - add oom notifier - add permemcg disable-oom-kill flag and freezer at oom. - more chances for wake up oom waiter (when changing memory limit etc..) Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Tested-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/867578cbccb0893cc14fc29c670f7185809c90d6;This patch changes memcg's oom logic as;yes;no;no 296;MDY6Q29tbWl0MjMyNTI5ODo4Njc1NzhjYmNjYjA4OTNjYzE0ZmMyOWM2NzBmNzE4NTgwOWM5MGQ2;KAMEZAWA Hiroyuki;Linus Torvalds;"memcg: fix oom kill behavior In current page-fault code, handle_mm_fault() -> ... -> mem_cgroup_charge() -> map page or handle error. -> check return code. If page fault's return code is VM_FAULT_OOM, page_fault_out_of_memory() is called. But if it's caused by memcg, OOM should have been already invoked. Then, I added a patch: a636b327f731143ccc544b966cfd8de6cb6d72c6. That patch records last_oom_jiffies for memcg's sub-hierarchy and prevents page_fault_out_of_memory from being invoked in near future. But Nishimura-san reported that check by jiffies is not enough when the system is terribly heavy. This patch changes memcg's oom logic as. * If memcg causes OOM-kill, continue to retry. * remove jiffies check which is used now. * add memcg-oom-lock which works like perzone oom lock. * If current is killed(as a process), bypass charge. Something more sophisticated can be added but this pactch does fundamental things. TODO: - add oom notifier - add permemcg disable-oom-kill flag and freezer at oom. - more chances for wake up oom waiter (when changing memory limit etc..) Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Tested-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/867578cbccb0893cc14fc29c670f7185809c90d6;"Something more sophisticated can be added but this pactch does fundamental things";yes;no;no 296;MDY6Q29tbWl0MjMyNTI5ODo4Njc1NzhjYmNjYjA4OTNjYzE0ZmMyOWM2NzBmNzE4NTgwOWM5MGQ2;KAMEZAWA Hiroyuki;Linus Torvalds;"memcg: fix oom kill behavior In current page-fault code, handle_mm_fault() -> ... -> mem_cgroup_charge() -> map page or handle error. -> check return code. If page fault's return code is VM_FAULT_OOM, page_fault_out_of_memory() is called. But if it's caused by memcg, OOM should have been already invoked. Then, I added a patch: a636b327f731143ccc544b966cfd8de6cb6d72c6. That patch records last_oom_jiffies for memcg's sub-hierarchy and prevents page_fault_out_of_memory from being invoked in near future. But Nishimura-san reported that check by jiffies is not enough when the system is terribly heavy. This patch changes memcg's oom logic as. * If memcg causes OOM-kill, continue to retry. * remove jiffies check which is used now. * add memcg-oom-lock which works like perzone oom lock. * If current is killed(as a process), bypass charge. Something more sophisticated can be added but this pactch does fundamental things. TODO: - add oom notifier - add permemcg disable-oom-kill flag and freezer at oom. - more chances for wake up oom waiter (when changing memory limit etc..) Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Tested-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/867578cbccb0893cc14fc29c670f7185809c90d6;TODO;yes;no;no 296;MDY6Q29tbWl0MjMyNTI5ODo4Njc1NzhjYmNjYjA4OTNjYzE0ZmMyOWM2NzBmNzE4NTgwOWM5MGQ2;KAMEZAWA Hiroyuki;Linus Torvalds;"memcg: fix oom kill behavior In current page-fault code, handle_mm_fault() -> ... -> mem_cgroup_charge() -> map page or handle error. -> check return code. If page fault's return code is VM_FAULT_OOM, page_fault_out_of_memory() is called. But if it's caused by memcg, OOM should have been already invoked. Then, I added a patch: a636b327f731143ccc544b966cfd8de6cb6d72c6. That patch records last_oom_jiffies for memcg's sub-hierarchy and prevents page_fault_out_of_memory from being invoked in near future. But Nishimura-san reported that check by jiffies is not enough when the system is terribly heavy. This patch changes memcg's oom logic as. * If memcg causes OOM-kill, continue to retry. * remove jiffies check which is used now. * add memcg-oom-lock which works like perzone oom lock. * If current is killed(as a process), bypass charge. Something more sophisticated can be added but this pactch does fundamental things. TODO: - add oom notifier - add permemcg disable-oom-kill flag and freezer at oom. - more chances for wake up oom waiter (when changing memory limit etc..) Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Tested-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/867578cbccb0893cc14fc29c670f7185809c90d6;" - add oom notifier - add permemcg disable-oom-kill flag and freezer at oom";yes;no;no 296;MDY6Q29tbWl0MjMyNTI5ODo4Njc1NzhjYmNjYjA4OTNjYzE0ZmMyOWM2NzBmNzE4NTgwOWM5MGQ2;KAMEZAWA Hiroyuki;Linus Torvalds;"memcg: fix oom kill behavior In current page-fault code, handle_mm_fault() -> ... -> mem_cgroup_charge() -> map page or handle error. -> check return code. If page fault's return code is VM_FAULT_OOM, page_fault_out_of_memory() is called. But if it's caused by memcg, OOM should have been already invoked. Then, I added a patch: a636b327f731143ccc544b966cfd8de6cb6d72c6. That patch records last_oom_jiffies for memcg's sub-hierarchy and prevents page_fault_out_of_memory from being invoked in near future. But Nishimura-san reported that check by jiffies is not enough when the system is terribly heavy. This patch changes memcg's oom logic as. * If memcg causes OOM-kill, continue to retry. * remove jiffies check which is used now. * add memcg-oom-lock which works like perzone oom lock. * If current is killed(as a process), bypass charge. Something more sophisticated can be added but this pactch does fundamental things. TODO: - add oom notifier - add permemcg disable-oom-kill flag and freezer at oom. - more chances for wake up oom waiter (when changing memory limit etc..) Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Tested-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/867578cbccb0893cc14fc29c670f7185809c90d6; - more chances for wake up oom waiter (when changing memory limit etc..);yes;no;no 297;MDY6Q29tbWl0MjMyNTI5ODpkYWFmMWU2ODg3NGMwNzhhMTVhZTZhZTgyNzc1MTgzOWM0ZDgxNzM5;KAMEZAWA Hiroyuki;Linus Torvalds;"memcg: handle panic_on_oom=always case Presently, if panic_on_oom=2, the whole system panics even if the oom happend in some special situation (as cpuset, mempolicy....). Then, panic_on_oom=2 means painc_on_oom_always. Now, memcg doesn't check panic_on_oom flag. This patch adds a check. BTW, how it's useful ? kdump+panic_on_oom=2 is the last tool to investigate what happens in oom-ed system. When a task is killed, the sysytem recovers and there will be few hint to know what happnes. In mission critical system, oom should never happen. Then, panic_on_oom=2+kdump is useful to avoid next OOM by knowing precise information via snapshot. TODO: - For memcg, it's for isolate system's memory usage, oom-notiifer and freeze_at_oom (or rest_at_oom) should be implemented. Then, management daemon can do similar jobs (as kdump) or taking snapshot per cgroup. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: Nick Piggin <npiggin@suse.de> Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/daaf1e68874c078a15ae6ae827751839c4d81739;memcg: handle panic_on_oom=always case;yes;yes;no 297;MDY6Q29tbWl0MjMyNTI5ODpkYWFmMWU2ODg3NGMwNzhhMTVhZTZhZTgyNzc1MTgzOWM0ZDgxNzM5;KAMEZAWA Hiroyuki;Linus Torvalds;"memcg: handle panic_on_oom=always case Presently, if panic_on_oom=2, the whole system panics even if the oom happend in some special situation (as cpuset, mempolicy....). Then, panic_on_oom=2 means painc_on_oom_always. Now, memcg doesn't check panic_on_oom flag. This patch adds a check. BTW, how it's useful ? kdump+panic_on_oom=2 is the last tool to investigate what happens in oom-ed system. When a task is killed, the sysytem recovers and there will be few hint to know what happnes. In mission critical system, oom should never happen. Then, panic_on_oom=2+kdump is useful to avoid next OOM by knowing precise information via snapshot. TODO: - For memcg, it's for isolate system's memory usage, oom-notiifer and freeze_at_oom (or rest_at_oom) should be implemented. Then, management daemon can do similar jobs (as kdump) or taking snapshot per cgroup. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: Nick Piggin <npiggin@suse.de> Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/daaf1e68874c078a15ae6ae827751839c4d81739;"Presently, if panic_on_oom=2, the whole system panics even if the oom happend in some special situation (as cpuset, mempolicy....)";no;no;yes 297;MDY6Q29tbWl0MjMyNTI5ODpkYWFmMWU2ODg3NGMwNzhhMTVhZTZhZTgyNzc1MTgzOWM0ZDgxNzM5;KAMEZAWA Hiroyuki;Linus Torvalds;"memcg: handle panic_on_oom=always case Presently, if panic_on_oom=2, the whole system panics even if the oom happend in some special situation (as cpuset, mempolicy....). Then, panic_on_oom=2 means painc_on_oom_always. Now, memcg doesn't check panic_on_oom flag. This patch adds a check. BTW, how it's useful ? kdump+panic_on_oom=2 is the last tool to investigate what happens in oom-ed system. When a task is killed, the sysytem recovers and there will be few hint to know what happnes. In mission critical system, oom should never happen. Then, panic_on_oom=2+kdump is useful to avoid next OOM by knowing precise information via snapshot. TODO: - For memcg, it's for isolate system's memory usage, oom-notiifer and freeze_at_oom (or rest_at_oom) should be implemented. Then, management daemon can do similar jobs (as kdump) or taking snapshot per cgroup. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: Nick Piggin <npiggin@suse.de> Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/daaf1e68874c078a15ae6ae827751839c4d81739;" Then, panic_on_oom=2 means painc_on_oom_always";no;no;yes 297;MDY6Q29tbWl0MjMyNTI5ODpkYWFmMWU2ODg3NGMwNzhhMTVhZTZhZTgyNzc1MTgzOWM0ZDgxNzM5;KAMEZAWA Hiroyuki;Linus Torvalds;"memcg: handle panic_on_oom=always case Presently, if panic_on_oom=2, the whole system panics even if the oom happend in some special situation (as cpuset, mempolicy....). Then, panic_on_oom=2 means painc_on_oom_always. Now, memcg doesn't check panic_on_oom flag. This patch adds a check. BTW, how it's useful ? kdump+panic_on_oom=2 is the last tool to investigate what happens in oom-ed system. When a task is killed, the sysytem recovers and there will be few hint to know what happnes. In mission critical system, oom should never happen. Then, panic_on_oom=2+kdump is useful to avoid next OOM by knowing precise information via snapshot. TODO: - For memcg, it's for isolate system's memory usage, oom-notiifer and freeze_at_oom (or rest_at_oom) should be implemented. Then, management daemon can do similar jobs (as kdump) or taking snapshot per cgroup. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: Nick Piggin <npiggin@suse.de> Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/daaf1e68874c078a15ae6ae827751839c4d81739;Now, memcg doesn't check panic_on_oom flag;no;yes;yes 297;MDY6Q29tbWl0MjMyNTI5ODpkYWFmMWU2ODg3NGMwNzhhMTVhZTZhZTgyNzc1MTgzOWM0ZDgxNzM5;KAMEZAWA Hiroyuki;Linus Torvalds;"memcg: handle panic_on_oom=always case Presently, if panic_on_oom=2, the whole system panics even if the oom happend in some special situation (as cpuset, mempolicy....). Then, panic_on_oom=2 means painc_on_oom_always. Now, memcg doesn't check panic_on_oom flag. This patch adds a check. BTW, how it's useful ? kdump+panic_on_oom=2 is the last tool to investigate what happens in oom-ed system. When a task is killed, the sysytem recovers and there will be few hint to know what happnes. In mission critical system, oom should never happen. Then, panic_on_oom=2+kdump is useful to avoid next OOM by knowing precise information via snapshot. TODO: - For memcg, it's for isolate system's memory usage, oom-notiifer and freeze_at_oom (or rest_at_oom) should be implemented. Then, management daemon can do similar jobs (as kdump) or taking snapshot per cgroup. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: Nick Piggin <npiggin@suse.de> Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/daaf1e68874c078a15ae6ae827751839c4d81739;This patch adds a check;yes;no;no 297;MDY6Q29tbWl0MjMyNTI5ODpkYWFmMWU2ODg3NGMwNzhhMTVhZTZhZTgyNzc1MTgzOWM0ZDgxNzM5;KAMEZAWA Hiroyuki;Linus Torvalds;"memcg: handle panic_on_oom=always case Presently, if panic_on_oom=2, the whole system panics even if the oom happend in some special situation (as cpuset, mempolicy....). Then, panic_on_oom=2 means painc_on_oom_always. Now, memcg doesn't check panic_on_oom flag. This patch adds a check. BTW, how it's useful ? kdump+panic_on_oom=2 is the last tool to investigate what happens in oom-ed system. When a task is killed, the sysytem recovers and there will be few hint to know what happnes. In mission critical system, oom should never happen. Then, panic_on_oom=2+kdump is useful to avoid next OOM by knowing precise information via snapshot. TODO: - For memcg, it's for isolate system's memory usage, oom-notiifer and freeze_at_oom (or rest_at_oom) should be implemented. Then, management daemon can do similar jobs (as kdump) or taking snapshot per cgroup. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: Nick Piggin <npiggin@suse.de> Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/daaf1e68874c078a15ae6ae827751839c4d81739;"BTW, how it's useful ? kdump+panic_on_oom=2 is the last tool to investigate what happens in oom-ed system";no;yes;yes 297;MDY6Q29tbWl0MjMyNTI5ODpkYWFmMWU2ODg3NGMwNzhhMTVhZTZhZTgyNzc1MTgzOWM0ZDgxNzM5;KAMEZAWA Hiroyuki;Linus Torvalds;"memcg: handle panic_on_oom=always case Presently, if panic_on_oom=2, the whole system panics even if the oom happend in some special situation (as cpuset, mempolicy....). Then, panic_on_oom=2 means painc_on_oom_always. Now, memcg doesn't check panic_on_oom flag. This patch adds a check. BTW, how it's useful ? kdump+panic_on_oom=2 is the last tool to investigate what happens in oom-ed system. When a task is killed, the sysytem recovers and there will be few hint to know what happnes. In mission critical system, oom should never happen. Then, panic_on_oom=2+kdump is useful to avoid next OOM by knowing precise information via snapshot. TODO: - For memcg, it's for isolate system's memory usage, oom-notiifer and freeze_at_oom (or rest_at_oom) should be implemented. Then, management daemon can do similar jobs (as kdump) or taking snapshot per cgroup. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: Nick Piggin <npiggin@suse.de> Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/daaf1e68874c078a15ae6ae827751839c4d81739;" When a task is killed, the sysytem recovers and there will be few hint to know what happnes";no;yes;yes 297;MDY6Q29tbWl0MjMyNTI5ODpkYWFmMWU2ODg3NGMwNzhhMTVhZTZhZTgyNzc1MTgzOWM0ZDgxNzM5;KAMEZAWA Hiroyuki;Linus Torvalds;"memcg: handle panic_on_oom=always case Presently, if panic_on_oom=2, the whole system panics even if the oom happend in some special situation (as cpuset, mempolicy....). Then, panic_on_oom=2 means painc_on_oom_always. Now, memcg doesn't check panic_on_oom flag. This patch adds a check. BTW, how it's useful ? kdump+panic_on_oom=2 is the last tool to investigate what happens in oom-ed system. When a task is killed, the sysytem recovers and there will be few hint to know what happnes. In mission critical system, oom should never happen. Then, panic_on_oom=2+kdump is useful to avoid next OOM by knowing precise information via snapshot. TODO: - For memcg, it's for isolate system's memory usage, oom-notiifer and freeze_at_oom (or rest_at_oom) should be implemented. Then, management daemon can do similar jobs (as kdump) or taking snapshot per cgroup. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: Nick Piggin <npiggin@suse.de> Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/daaf1e68874c078a15ae6ae827751839c4d81739;" In mission critical system, oom should never happen";no;no;yes 297;MDY6Q29tbWl0MjMyNTI5ODpkYWFmMWU2ODg3NGMwNzhhMTVhZTZhZTgyNzc1MTgzOWM0ZDgxNzM5;KAMEZAWA Hiroyuki;Linus Torvalds;"memcg: handle panic_on_oom=always case Presently, if panic_on_oom=2, the whole system panics even if the oom happend in some special situation (as cpuset, mempolicy....). Then, panic_on_oom=2 means painc_on_oom_always. Now, memcg doesn't check panic_on_oom flag. This patch adds a check. BTW, how it's useful ? kdump+panic_on_oom=2 is the last tool to investigate what happens in oom-ed system. When a task is killed, the sysytem recovers and there will be few hint to know what happnes. In mission critical system, oom should never happen. Then, panic_on_oom=2+kdump is useful to avoid next OOM by knowing precise information via snapshot. TODO: - For memcg, it's for isolate system's memory usage, oom-notiifer and freeze_at_oom (or rest_at_oom) should be implemented. Then, management daemon can do similar jobs (as kdump) or taking snapshot per cgroup. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: Nick Piggin <npiggin@suse.de> Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/daaf1e68874c078a15ae6ae827751839c4d81739;" Then, panic_on_oom=2+kdump is useful to avoid next OOM by knowing precise information via snapshot";no;yes;yes 297;MDY6Q29tbWl0MjMyNTI5ODpkYWFmMWU2ODg3NGMwNzhhMTVhZTZhZTgyNzc1MTgzOWM0ZDgxNzM5;KAMEZAWA Hiroyuki;Linus Torvalds;"memcg: handle panic_on_oom=always case Presently, if panic_on_oom=2, the whole system panics even if the oom happend in some special situation (as cpuset, mempolicy....). Then, panic_on_oom=2 means painc_on_oom_always. Now, memcg doesn't check panic_on_oom flag. This patch adds a check. BTW, how it's useful ? kdump+panic_on_oom=2 is the last tool to investigate what happens in oom-ed system. When a task is killed, the sysytem recovers and there will be few hint to know what happnes. In mission critical system, oom should never happen. Then, panic_on_oom=2+kdump is useful to avoid next OOM by knowing precise information via snapshot. TODO: - For memcg, it's for isolate system's memory usage, oom-notiifer and freeze_at_oom (or rest_at_oom) should be implemented. Then, management daemon can do similar jobs (as kdump) or taking snapshot per cgroup. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: Nick Piggin <npiggin@suse.de> Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/daaf1e68874c078a15ae6ae827751839c4d81739;TODO;yes;no;no 297;MDY6Q29tbWl0MjMyNTI5ODpkYWFmMWU2ODg3NGMwNzhhMTVhZTZhZTgyNzc1MTgzOWM0ZDgxNzM5;KAMEZAWA Hiroyuki;Linus Torvalds;"memcg: handle panic_on_oom=always case Presently, if panic_on_oom=2, the whole system panics even if the oom happend in some special situation (as cpuset, mempolicy....). Then, panic_on_oom=2 means painc_on_oom_always. Now, memcg doesn't check panic_on_oom flag. This patch adds a check. BTW, how it's useful ? kdump+panic_on_oom=2 is the last tool to investigate what happens in oom-ed system. When a task is killed, the sysytem recovers and there will be few hint to know what happnes. In mission critical system, oom should never happen. Then, panic_on_oom=2+kdump is useful to avoid next OOM by knowing precise information via snapshot. TODO: - For memcg, it's for isolate system's memory usage, oom-notiifer and freeze_at_oom (or rest_at_oom) should be implemented. Then, management daemon can do similar jobs (as kdump) or taking snapshot per cgroup. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: Nick Piggin <npiggin@suse.de> Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/daaf1e68874c078a15ae6ae827751839c4d81739;" - For memcg, it's for isolate system's memory usage, oom-notiifer and freeze_at_oom (or rest_at_oom) should be implemented";yes;no;no 297;MDY6Q29tbWl0MjMyNTI5ODpkYWFmMWU2ODg3NGMwNzhhMTVhZTZhZTgyNzc1MTgzOWM0ZDgxNzM5;KAMEZAWA Hiroyuki;Linus Torvalds;"memcg: handle panic_on_oom=always case Presently, if panic_on_oom=2, the whole system panics even if the oom happend in some special situation (as cpuset, mempolicy....). Then, panic_on_oom=2 means painc_on_oom_always. Now, memcg doesn't check panic_on_oom flag. This patch adds a check. BTW, how it's useful ? kdump+panic_on_oom=2 is the last tool to investigate what happens in oom-ed system. When a task is killed, the sysytem recovers and there will be few hint to know what happnes. In mission critical system, oom should never happen. Then, panic_on_oom=2+kdump is useful to avoid next OOM by knowing precise information via snapshot. TODO: - For memcg, it's for isolate system's memory usage, oom-notiifer and freeze_at_oom (or rest_at_oom) should be implemented. Then, management daemon can do similar jobs (as kdump) or taking snapshot per cgroup. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: Nick Piggin <npiggin@suse.de> Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/daaf1e68874c078a15ae6ae827751839c4d81739;"Then, management daemon can do similar jobs (as kdump) or taking snapshot per cgroup.";yes;no;no 298;MDY6Q29tbWl0MjMyNTI5ODpkNTU5ZGIwODZmZjViZTliY2MyNTllNWFhNTBiZjNkODgxZWFmMWQx;KAMEZAWA Hiroyuki;Linus Torvalds;"mm: clean up mm_counter Presently, per-mm statistics counter is defined by macro in sched.h This patch modifies it to - defined in mm.h as inlinf functions - use array instead of macro's name creation. This patch is for reducing patch size in future patch to modify implementation of per-mm counter. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/d559db086ff5be9bcc259e5aa50bf3d881eaf1d1;mm: clean up mm_counter;yes;yes;no 298;MDY6Q29tbWl0MjMyNTI5ODpkNTU5ZGIwODZmZjViZTliY2MyNTllNWFhNTBiZjNkODgxZWFmMWQx;KAMEZAWA Hiroyuki;Linus Torvalds;"mm: clean up mm_counter Presently, per-mm statistics counter is defined by macro in sched.h This patch modifies it to - defined in mm.h as inlinf functions - use array instead of macro's name creation. This patch is for reducing patch size in future patch to modify implementation of per-mm counter. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/d559db086ff5be9bcc259e5aa50bf3d881eaf1d1;"Presently, per-mm statistics counter is defined by macro in sched.h This patch modifies it to - defined in mm.h as inlinf functions - use array instead of macro's name creation";yes;no;yes 298;MDY6Q29tbWl0MjMyNTI5ODpkNTU5ZGIwODZmZjViZTliY2MyNTllNWFhNTBiZjNkODgxZWFmMWQx;KAMEZAWA Hiroyuki;Linus Torvalds;"mm: clean up mm_counter Presently, per-mm statistics counter is defined by macro in sched.h This patch modifies it to - defined in mm.h as inlinf functions - use array instead of macro's name creation. This patch is for reducing patch size in future patch to modify implementation of per-mm counter. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/d559db086ff5be9bcc259e5aa50bf3d881eaf1d1;"This patch is for reducing patch size in future patch to modify implementation of per-mm counter.";yes;yes;no 299;MDY6Q29tbWl0MjMyNTI5ODo1YTJkNDE5NjFkZDY4MTViODc0YjVjMGFmZWMwYWM5NmNkOTBlZWE0;KAMEZAWA Hiroyuki;Linus Torvalds;"memcg: fix oom killing a child process in an other cgroup Presently the oom-killer is memcg aware and it finds the worst process from processes under memcg(s) in oom. Then, it kills victim's child first. It may kill a child in another cgroup and may not be any help for recovery. And it will break the assumption users have. This patch fixes it. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: David Rientjes <rientjes@google.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/5a2d41961dd6815b874b5c0afec0ac96cd90eea4;memcg: fix oom killing a child process in an other cgroup;yes;yes;no 299;MDY6Q29tbWl0MjMyNTI5ODo1YTJkNDE5NjFkZDY4MTViODc0YjVjMGFmZWMwYWM5NmNkOTBlZWE0;KAMEZAWA Hiroyuki;Linus Torvalds;"memcg: fix oom killing a child process in an other cgroup Presently the oom-killer is memcg aware and it finds the worst process from processes under memcg(s) in oom. Then, it kills victim's child first. It may kill a child in another cgroup and may not be any help for recovery. And it will break the assumption users have. This patch fixes it. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: David Rientjes <rientjes@google.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/5a2d41961dd6815b874b5c0afec0ac96cd90eea4;"Presently the oom-killer is memcg aware and it finds the worst process from processes under memcg(s) in oom";no;no;yes 299;MDY6Q29tbWl0MjMyNTI5ODo1YTJkNDE5NjFkZDY4MTViODc0YjVjMGFmZWMwYWM5NmNkOTBlZWE0;KAMEZAWA Hiroyuki;Linus Torvalds;"memcg: fix oom killing a child process in an other cgroup Presently the oom-killer is memcg aware and it finds the worst process from processes under memcg(s) in oom. Then, it kills victim's child first. It may kill a child in another cgroup and may not be any help for recovery. And it will break the assumption users have. This patch fixes it. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: David Rientjes <rientjes@google.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/5a2d41961dd6815b874b5c0afec0ac96cd90eea4;" Then, it kills victim's child first";no;no;yes 299;MDY6Q29tbWl0MjMyNTI5ODo1YTJkNDE5NjFkZDY4MTViODc0YjVjMGFmZWMwYWM5NmNkOTBlZWE0;KAMEZAWA Hiroyuki;Linus Torvalds;"memcg: fix oom killing a child process in an other cgroup Presently the oom-killer is memcg aware and it finds the worst process from processes under memcg(s) in oom. Then, it kills victim's child first. It may kill a child in another cgroup and may not be any help for recovery. And it will break the assumption users have. This patch fixes it. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: David Rientjes <rientjes@google.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/5a2d41961dd6815b874b5c0afec0ac96cd90eea4;"It may kill a child in another cgroup and may not be any help for recovery";no;yes;yes 299;MDY6Q29tbWl0MjMyNTI5ODo1YTJkNDE5NjFkZDY4MTViODc0YjVjMGFmZWMwYWM5NmNkOTBlZWE0;KAMEZAWA Hiroyuki;Linus Torvalds;"memcg: fix oom killing a child process in an other cgroup Presently the oom-killer is memcg aware and it finds the worst process from processes under memcg(s) in oom. Then, it kills victim's child first. It may kill a child in another cgroup and may not be any help for recovery. And it will break the assumption users have. This patch fixes it. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: David Rientjes <rientjes@google.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/5a2d41961dd6815b874b5c0afec0ac96cd90eea4; And it will break the assumption users have;no;yes;yes 299;MDY6Q29tbWl0MjMyNTI5ODo1YTJkNDE5NjFkZDY4MTViODc0YjVjMGFmZWMwYWM5NmNkOTBlZWE0;KAMEZAWA Hiroyuki;Linus Torvalds;"memcg: fix oom killing a child process in an other cgroup Presently the oom-killer is memcg aware and it finds the worst process from processes under memcg(s) in oom. Then, it kills victim's child first. It may kill a child in another cgroup and may not be any help for recovery. And it will break the assumption users have. This patch fixes it. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: David Rientjes <rientjes@google.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/5a2d41961dd6815b874b5c0afec0ac96cd90eea4;This patch fixes it.;yes;yes;no 300;MDY6Q29tbWl0MjMyNTI5ODpkMzFmNTZkYmY4YmFmYWFjYjBjNjE3ZjlhNmYxMzc0OThkNWM3YWVk;Daisuke Nishimura;Linus Torvalds;"memcg: avoid oom-killing innocent task in case of use_hierarchy task_in_mem_cgroup(), which is called by select_bad_process() to check whether a task can be a candidate for being oom-killed from memcg's limit, checks ""curr->use_hierarchy""(""curr"" is the mem_cgroup the task belongs to). But this check return true(it's false positive) when: <some path>/aa use_hierarchy == 0 <- hitting limit <some path>/aa/00 use_hierarchy == 1 <- the task belongs to This leads to killing an innocent task in aa/00. This patch is a fix for this bug. And this patch also fixes the arg for mem_cgroup_print_oom_info(). We should print information of mem_cgroup which the task being killed, not current, belongs to. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/d31f56dbf8bafaacb0c617f9a6f137498d5c7aed;memcg: avoid oom-killing innocent task in case of use_hierarchy;yes;yes;no 300;MDY6Q29tbWl0MjMyNTI5ODpkMzFmNTZkYmY4YmFmYWFjYjBjNjE3ZjlhNmYxMzc0OThkNWM3YWVk;Daisuke Nishimura;Linus Torvalds;"memcg: avoid oom-killing innocent task in case of use_hierarchy task_in_mem_cgroup(), which is called by select_bad_process() to check whether a task can be a candidate for being oom-killed from memcg's limit, checks ""curr->use_hierarchy""(""curr"" is the mem_cgroup the task belongs to). But this check return true(it's false positive) when: <some path>/aa use_hierarchy == 0 <- hitting limit <some path>/aa/00 use_hierarchy == 1 <- the task belongs to This leads to killing an innocent task in aa/00. This patch is a fix for this bug. And this patch also fixes the arg for mem_cgroup_print_oom_info(). We should print information of mem_cgroup which the task being killed, not current, belongs to. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/d31f56dbf8bafaacb0c617f9a6f137498d5c7aed;"task_in_mem_cgroup(), which is called by select_bad_process() to check whether a task can be a candidate for being oom-killed from memcg's limit, checks ""curr->use_hierarchy""(""curr"" is the mem_cgroup the task belongs to)";no;no;yes 300;MDY6Q29tbWl0MjMyNTI5ODpkMzFmNTZkYmY4YmFmYWFjYjBjNjE3ZjlhNmYxMzc0OThkNWM3YWVk;Daisuke Nishimura;Linus Torvalds;"memcg: avoid oom-killing innocent task in case of use_hierarchy task_in_mem_cgroup(), which is called by select_bad_process() to check whether a task can be a candidate for being oom-killed from memcg's limit, checks ""curr->use_hierarchy""(""curr"" is the mem_cgroup the task belongs to). But this check return true(it's false positive) when: <some path>/aa use_hierarchy == 0 <- hitting limit <some path>/aa/00 use_hierarchy == 1 <- the task belongs to This leads to killing an innocent task in aa/00. This patch is a fix for this bug. And this patch also fixes the arg for mem_cgroup_print_oom_info(). We should print information of mem_cgroup which the task being killed, not current, belongs to. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/d31f56dbf8bafaacb0c617f9a6f137498d5c7aed;But this check return true(it's false positive) when;no;yes;yes 300;MDY6Q29tbWl0MjMyNTI5ODpkMzFmNTZkYmY4YmFmYWFjYjBjNjE3ZjlhNmYxMzc0OThkNWM3YWVk;Daisuke Nishimura;Linus Torvalds;"memcg: avoid oom-killing innocent task in case of use_hierarchy task_in_mem_cgroup(), which is called by select_bad_process() to check whether a task can be a candidate for being oom-killed from memcg's limit, checks ""curr->use_hierarchy""(""curr"" is the mem_cgroup the task belongs to). But this check return true(it's false positive) when: <some path>/aa use_hierarchy == 0 <- hitting limit <some path>/aa/00 use_hierarchy == 1 <- the task belongs to This leads to killing an innocent task in aa/00. This patch is a fix for this bug. And this patch also fixes the arg for mem_cgroup_print_oom_info(). We should print information of mem_cgroup which the task being killed, not current, belongs to. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/d31f56dbf8bafaacb0c617f9a6f137498d5c7aed;" <some path>/aa use_hierarchy == 0 <- hitting limit <some path>/aa/00 use_hierarchy == 1 <- the task belongs to This leads to killing an innocent task in aa/00";no;yes;yes 300;MDY6Q29tbWl0MjMyNTI5ODpkMzFmNTZkYmY4YmFmYWFjYjBjNjE3ZjlhNmYxMzc0OThkNWM3YWVk;Daisuke Nishimura;Linus Torvalds;"memcg: avoid oom-killing innocent task in case of use_hierarchy task_in_mem_cgroup(), which is called by select_bad_process() to check whether a task can be a candidate for being oom-killed from memcg's limit, checks ""curr->use_hierarchy""(""curr"" is the mem_cgroup the task belongs to). But this check return true(it's false positive) when: <some path>/aa use_hierarchy == 0 <- hitting limit <some path>/aa/00 use_hierarchy == 1 <- the task belongs to This leads to killing an innocent task in aa/00. This patch is a fix for this bug. And this patch also fixes the arg for mem_cgroup_print_oom_info(). We should print information of mem_cgroup which the task being killed, not current, belongs to. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/d31f56dbf8bafaacb0c617f9a6f137498d5c7aed;" This patch is a fix for this bug";yes;yes;no 300;MDY6Q29tbWl0MjMyNTI5ODpkMzFmNTZkYmY4YmFmYWFjYjBjNjE3ZjlhNmYxMzc0OThkNWM3YWVk;Daisuke Nishimura;Linus Torvalds;"memcg: avoid oom-killing innocent task in case of use_hierarchy task_in_mem_cgroup(), which is called by select_bad_process() to check whether a task can be a candidate for being oom-killed from memcg's limit, checks ""curr->use_hierarchy""(""curr"" is the mem_cgroup the task belongs to). But this check return true(it's false positive) when: <some path>/aa use_hierarchy == 0 <- hitting limit <some path>/aa/00 use_hierarchy == 1 <- the task belongs to This leads to killing an innocent task in aa/00. This patch is a fix for this bug. And this patch also fixes the arg for mem_cgroup_print_oom_info(). We should print information of mem_cgroup which the task being killed, not current, belongs to. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/d31f56dbf8bafaacb0c617f9a6f137498d5c7aed;" And this patch also fixes the arg for mem_cgroup_print_oom_info()";yes;yes;no 300;MDY6Q29tbWl0MjMyNTI5ODpkMzFmNTZkYmY4YmFmYWFjYjBjNjE3ZjlhNmYxMzc0OThkNWM3YWVk;Daisuke Nishimura;Linus Torvalds;"memcg: avoid oom-killing innocent task in case of use_hierarchy task_in_mem_cgroup(), which is called by select_bad_process() to check whether a task can be a candidate for being oom-killed from memcg's limit, checks ""curr->use_hierarchy""(""curr"" is the mem_cgroup the task belongs to). But this check return true(it's false positive) when: <some path>/aa use_hierarchy == 0 <- hitting limit <some path>/aa/00 use_hierarchy == 1 <- the task belongs to This leads to killing an innocent task in aa/00. This patch is a fix for this bug. And this patch also fixes the arg for mem_cgroup_print_oom_info(). We should print information of mem_cgroup which the task being killed, not current, belongs to. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/d31f56dbf8bafaacb0c617f9a6f137498d5c7aed;" We should print information of mem_cgroup which the task being killed, not current, belongs to.";yes;yes;no 301;MDY6Q29tbWl0MjMyNTI5ODo0MzY1YTU2NzZmYTNhYTFkNWFlNmM5MGMyMmEwMDQ0ZjA5YmE1ODRl;KAMEZAWA Hiroyuki;Linus Torvalds;"oom-kill: fix NUMA constraint check with nodemask Fix node-oriented allocation handling in oom-kill.c I myself think of this as a bugfix not as an ehnancement. In these days, things are changed as - alloc_pages() eats nodemask as its arguments, __alloc_pages_nodemask(). - mempolicy don't maintain its own private zonelists. (And cpuset doesn't use nodemask for __alloc_pages_nodemask()) So, current oom-killer's check function is wrong. This patch does - check nodemask, if nodemask && nodemask doesn't cover all node_states[N_HIGH_MEMORY], this is CONSTRAINT_MEMORY_POLICY. - Scan all zonelist under nodemask, if it hits cpuset's wall this faiulre is from cpuset. And - modifies the caller of out_of_memory not to call oom if __GFP_THISNODE. This doesn't change ""current"" behavior. If callers use __GFP_THISNODE it should handle ""page allocation failure"" by itself. - handle __GFP_NOFAIL+__GFP_THISNODE path. This is something like a FIXME but this gfpmask is not used now. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hioryu@jp.fujitsu.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/4365a5676fa3aa1d5ae6c90c22a0044f09ba584e;oom-kill: fix NUMA constraint check with nodemask;yes;yes;no 301;MDY6Q29tbWl0MjMyNTI5ODo0MzY1YTU2NzZmYTNhYTFkNWFlNmM5MGMyMmEwMDQ0ZjA5YmE1ODRl;KAMEZAWA Hiroyuki;Linus Torvalds;"oom-kill: fix NUMA constraint check with nodemask Fix node-oriented allocation handling in oom-kill.c I myself think of this as a bugfix not as an ehnancement. In these days, things are changed as - alloc_pages() eats nodemask as its arguments, __alloc_pages_nodemask(). - mempolicy don't maintain its own private zonelists. (And cpuset doesn't use nodemask for __alloc_pages_nodemask()) So, current oom-killer's check function is wrong. This patch does - check nodemask, if nodemask && nodemask doesn't cover all node_states[N_HIGH_MEMORY], this is CONSTRAINT_MEMORY_POLICY. - Scan all zonelist under nodemask, if it hits cpuset's wall this faiulre is from cpuset. And - modifies the caller of out_of_memory not to call oom if __GFP_THISNODE. This doesn't change ""current"" behavior. If callers use __GFP_THISNODE it should handle ""page allocation failure"" by itself. - handle __GFP_NOFAIL+__GFP_THISNODE path. This is something like a FIXME but this gfpmask is not used now. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hioryu@jp.fujitsu.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/4365a5676fa3aa1d5ae6c90c22a0044f09ba584e;"Fix node-oriented allocation handling in oom-kill.c I myself think of this as a bugfix not as an ehnancement";yes;yes;no 301;MDY6Q29tbWl0MjMyNTI5ODo0MzY1YTU2NzZmYTNhYTFkNWFlNmM5MGMyMmEwMDQ0ZjA5YmE1ODRl;KAMEZAWA Hiroyuki;Linus Torvalds;"oom-kill: fix NUMA constraint check with nodemask Fix node-oriented allocation handling in oom-kill.c I myself think of this as a bugfix not as an ehnancement. In these days, things are changed as - alloc_pages() eats nodemask as its arguments, __alloc_pages_nodemask(). - mempolicy don't maintain its own private zonelists. (And cpuset doesn't use nodemask for __alloc_pages_nodemask()) So, current oom-killer's check function is wrong. This patch does - check nodemask, if nodemask && nodemask doesn't cover all node_states[N_HIGH_MEMORY], this is CONSTRAINT_MEMORY_POLICY. - Scan all zonelist under nodemask, if it hits cpuset's wall this faiulre is from cpuset. And - modifies the caller of out_of_memory not to call oom if __GFP_THISNODE. This doesn't change ""current"" behavior. If callers use __GFP_THISNODE it should handle ""page allocation failure"" by itself. - handle __GFP_NOFAIL+__GFP_THISNODE path. This is something like a FIXME but this gfpmask is not used now. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hioryu@jp.fujitsu.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/4365a5676fa3aa1d5ae6c90c22a0044f09ba584e;"In these days, things are changed as - alloc_pages() eats nodemask as its arguments, __alloc_pages_nodemask()";no;no;yes 301;MDY6Q29tbWl0MjMyNTI5ODo0MzY1YTU2NzZmYTNhYTFkNWFlNmM5MGMyMmEwMDQ0ZjA5YmE1ODRl;KAMEZAWA Hiroyuki;Linus Torvalds;"oom-kill: fix NUMA constraint check with nodemask Fix node-oriented allocation handling in oom-kill.c I myself think of this as a bugfix not as an ehnancement. In these days, things are changed as - alloc_pages() eats nodemask as its arguments, __alloc_pages_nodemask(). - mempolicy don't maintain its own private zonelists. (And cpuset doesn't use nodemask for __alloc_pages_nodemask()) So, current oom-killer's check function is wrong. This patch does - check nodemask, if nodemask && nodemask doesn't cover all node_states[N_HIGH_MEMORY], this is CONSTRAINT_MEMORY_POLICY. - Scan all zonelist under nodemask, if it hits cpuset's wall this faiulre is from cpuset. And - modifies the caller of out_of_memory not to call oom if __GFP_THISNODE. This doesn't change ""current"" behavior. If callers use __GFP_THISNODE it should handle ""page allocation failure"" by itself. - handle __GFP_NOFAIL+__GFP_THISNODE path. This is something like a FIXME but this gfpmask is not used now. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hioryu@jp.fujitsu.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/4365a5676fa3aa1d5ae6c90c22a0044f09ba584e; - mempolicy don't maintain its own private zonelists;no;no;yes 301;MDY6Q29tbWl0MjMyNTI5ODo0MzY1YTU2NzZmYTNhYTFkNWFlNmM5MGMyMmEwMDQ0ZjA5YmE1ODRl;KAMEZAWA Hiroyuki;Linus Torvalds;"oom-kill: fix NUMA constraint check with nodemask Fix node-oriented allocation handling in oom-kill.c I myself think of this as a bugfix not as an ehnancement. In these days, things are changed as - alloc_pages() eats nodemask as its arguments, __alloc_pages_nodemask(). - mempolicy don't maintain its own private zonelists. (And cpuset doesn't use nodemask for __alloc_pages_nodemask()) So, current oom-killer's check function is wrong. This patch does - check nodemask, if nodemask && nodemask doesn't cover all node_states[N_HIGH_MEMORY], this is CONSTRAINT_MEMORY_POLICY. - Scan all zonelist under nodemask, if it hits cpuset's wall this faiulre is from cpuset. And - modifies the caller of out_of_memory not to call oom if __GFP_THISNODE. This doesn't change ""current"" behavior. If callers use __GFP_THISNODE it should handle ""page allocation failure"" by itself. - handle __GFP_NOFAIL+__GFP_THISNODE path. This is something like a FIXME but this gfpmask is not used now. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hioryu@jp.fujitsu.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/4365a5676fa3aa1d5ae6c90c22a0044f09ba584e;" (And cpuset doesn't use nodemask for __alloc_pages_nodemask()) So, current oom-killer's check function is wrong";no;yes;yes 301;MDY6Q29tbWl0MjMyNTI5ODo0MzY1YTU2NzZmYTNhYTFkNWFlNmM5MGMyMmEwMDQ0ZjA5YmE1ODRl;KAMEZAWA Hiroyuki;Linus Torvalds;"oom-kill: fix NUMA constraint check with nodemask Fix node-oriented allocation handling in oom-kill.c I myself think of this as a bugfix not as an ehnancement. In these days, things are changed as - alloc_pages() eats nodemask as its arguments, __alloc_pages_nodemask(). - mempolicy don't maintain its own private zonelists. (And cpuset doesn't use nodemask for __alloc_pages_nodemask()) So, current oom-killer's check function is wrong. This patch does - check nodemask, if nodemask && nodemask doesn't cover all node_states[N_HIGH_MEMORY], this is CONSTRAINT_MEMORY_POLICY. - Scan all zonelist under nodemask, if it hits cpuset's wall this faiulre is from cpuset. And - modifies the caller of out_of_memory not to call oom if __GFP_THISNODE. This doesn't change ""current"" behavior. If callers use __GFP_THISNODE it should handle ""page allocation failure"" by itself. - handle __GFP_NOFAIL+__GFP_THISNODE path. This is something like a FIXME but this gfpmask is not used now. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hioryu@jp.fujitsu.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/4365a5676fa3aa1d5ae6c90c22a0044f09ba584e;"This patch does - check nodemask, if nodemask && nodemask doesn't cover all node_states[N_HIGH_MEMORY], this is CONSTRAINT_MEMORY_POLICY";yes;no;no 301;MDY6Q29tbWl0MjMyNTI5ODo0MzY1YTU2NzZmYTNhYTFkNWFlNmM5MGMyMmEwMDQ0ZjA5YmE1ODRl;KAMEZAWA Hiroyuki;Linus Torvalds;"oom-kill: fix NUMA constraint check with nodemask Fix node-oriented allocation handling in oom-kill.c I myself think of this as a bugfix not as an ehnancement. In these days, things are changed as - alloc_pages() eats nodemask as its arguments, __alloc_pages_nodemask(). - mempolicy don't maintain its own private zonelists. (And cpuset doesn't use nodemask for __alloc_pages_nodemask()) So, current oom-killer's check function is wrong. This patch does - check nodemask, if nodemask && nodemask doesn't cover all node_states[N_HIGH_MEMORY], this is CONSTRAINT_MEMORY_POLICY. - Scan all zonelist under nodemask, if it hits cpuset's wall this faiulre is from cpuset. And - modifies the caller of out_of_memory not to call oom if __GFP_THISNODE. This doesn't change ""current"" behavior. If callers use __GFP_THISNODE it should handle ""page allocation failure"" by itself. - handle __GFP_NOFAIL+__GFP_THISNODE path. This is something like a FIXME but this gfpmask is not used now. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hioryu@jp.fujitsu.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/4365a5676fa3aa1d5ae6c90c22a0044f09ba584e;" - Scan all zonelist under nodemask, if it hits cpuset's wall this faiulre is from cpuset";yes;no;no 301;MDY6Q29tbWl0MjMyNTI5ODo0MzY1YTU2NzZmYTNhYTFkNWFlNmM5MGMyMmEwMDQ0ZjA5YmE1ODRl;KAMEZAWA Hiroyuki;Linus Torvalds;"oom-kill: fix NUMA constraint check with nodemask Fix node-oriented allocation handling in oom-kill.c I myself think of this as a bugfix not as an ehnancement. In these days, things are changed as - alloc_pages() eats nodemask as its arguments, __alloc_pages_nodemask(). - mempolicy don't maintain its own private zonelists. (And cpuset doesn't use nodemask for __alloc_pages_nodemask()) So, current oom-killer's check function is wrong. This patch does - check nodemask, if nodemask && nodemask doesn't cover all node_states[N_HIGH_MEMORY], this is CONSTRAINT_MEMORY_POLICY. - Scan all zonelist under nodemask, if it hits cpuset's wall this faiulre is from cpuset. And - modifies the caller of out_of_memory not to call oom if __GFP_THISNODE. This doesn't change ""current"" behavior. If callers use __GFP_THISNODE it should handle ""page allocation failure"" by itself. - handle __GFP_NOFAIL+__GFP_THISNODE path. This is something like a FIXME but this gfpmask is not used now. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hioryu@jp.fujitsu.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/4365a5676fa3aa1d5ae6c90c22a0044f09ba584e; - modifies the caller of out_of_memory not to call oom if __GFP_THISNODE;yes;no;no 301;MDY6Q29tbWl0MjMyNTI5ODo0MzY1YTU2NzZmYTNhYTFkNWFlNmM5MGMyMmEwMDQ0ZjA5YmE1ODRl;KAMEZAWA Hiroyuki;Linus Torvalds;"oom-kill: fix NUMA constraint check with nodemask Fix node-oriented allocation handling in oom-kill.c I myself think of this as a bugfix not as an ehnancement. In these days, things are changed as - alloc_pages() eats nodemask as its arguments, __alloc_pages_nodemask(). - mempolicy don't maintain its own private zonelists. (And cpuset doesn't use nodemask for __alloc_pages_nodemask()) So, current oom-killer's check function is wrong. This patch does - check nodemask, if nodemask && nodemask doesn't cover all node_states[N_HIGH_MEMORY], this is CONSTRAINT_MEMORY_POLICY. - Scan all zonelist under nodemask, if it hits cpuset's wall this faiulre is from cpuset. And - modifies the caller of out_of_memory not to call oom if __GFP_THISNODE. This doesn't change ""current"" behavior. If callers use __GFP_THISNODE it should handle ""page allocation failure"" by itself. - handle __GFP_NOFAIL+__GFP_THISNODE path. This is something like a FIXME but this gfpmask is not used now. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hioryu@jp.fujitsu.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/4365a5676fa3aa1d5ae6c90c22a0044f09ba584e;" This doesn't change ""current"" behavior";yes;no;no 301;MDY6Q29tbWl0MjMyNTI5ODo0MzY1YTU2NzZmYTNhYTFkNWFlNmM5MGMyMmEwMDQ0ZjA5YmE1ODRl;KAMEZAWA Hiroyuki;Linus Torvalds;"oom-kill: fix NUMA constraint check with nodemask Fix node-oriented allocation handling in oom-kill.c I myself think of this as a bugfix not as an ehnancement. In these days, things are changed as - alloc_pages() eats nodemask as its arguments, __alloc_pages_nodemask(). - mempolicy don't maintain its own private zonelists. (And cpuset doesn't use nodemask for __alloc_pages_nodemask()) So, current oom-killer's check function is wrong. This patch does - check nodemask, if nodemask && nodemask doesn't cover all node_states[N_HIGH_MEMORY], this is CONSTRAINT_MEMORY_POLICY. - Scan all zonelist under nodemask, if it hits cpuset's wall this faiulre is from cpuset. And - modifies the caller of out_of_memory not to call oom if __GFP_THISNODE. This doesn't change ""current"" behavior. If callers use __GFP_THISNODE it should handle ""page allocation failure"" by itself. - handle __GFP_NOFAIL+__GFP_THISNODE path. This is something like a FIXME but this gfpmask is not used now. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hioryu@jp.fujitsu.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/4365a5676fa3aa1d5ae6c90c22a0044f09ba584e;"If callers use __GFP_THISNODE it should handle ""page allocation failure"" by itself";yes;no;yes 301;MDY6Q29tbWl0MjMyNTI5ODo0MzY1YTU2NzZmYTNhYTFkNWFlNmM5MGMyMmEwMDQ0ZjA5YmE1ODRl;KAMEZAWA Hiroyuki;Linus Torvalds;"oom-kill: fix NUMA constraint check with nodemask Fix node-oriented allocation handling in oom-kill.c I myself think of this as a bugfix not as an ehnancement. In these days, things are changed as - alloc_pages() eats nodemask as its arguments, __alloc_pages_nodemask(). - mempolicy don't maintain its own private zonelists. (And cpuset doesn't use nodemask for __alloc_pages_nodemask()) So, current oom-killer's check function is wrong. This patch does - check nodemask, if nodemask && nodemask doesn't cover all node_states[N_HIGH_MEMORY], this is CONSTRAINT_MEMORY_POLICY. - Scan all zonelist under nodemask, if it hits cpuset's wall this faiulre is from cpuset. And - modifies the caller of out_of_memory not to call oom if __GFP_THISNODE. This doesn't change ""current"" behavior. If callers use __GFP_THISNODE it should handle ""page allocation failure"" by itself. - handle __GFP_NOFAIL+__GFP_THISNODE path. This is something like a FIXME but this gfpmask is not used now. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hioryu@jp.fujitsu.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/4365a5676fa3aa1d5ae6c90c22a0044f09ba584e; - handle __GFP_NOFAIL+__GFP_THISNODE path;yes;no;no 301;MDY6Q29tbWl0MjMyNTI5ODo0MzY1YTU2NzZmYTNhYTFkNWFlNmM5MGMyMmEwMDQ0ZjA5YmE1ODRl;KAMEZAWA Hiroyuki;Linus Torvalds;"oom-kill: fix NUMA constraint check with nodemask Fix node-oriented allocation handling in oom-kill.c I myself think of this as a bugfix not as an ehnancement. In these days, things are changed as - alloc_pages() eats nodemask as its arguments, __alloc_pages_nodemask(). - mempolicy don't maintain its own private zonelists. (And cpuset doesn't use nodemask for __alloc_pages_nodemask()) So, current oom-killer's check function is wrong. This patch does - check nodemask, if nodemask && nodemask doesn't cover all node_states[N_HIGH_MEMORY], this is CONSTRAINT_MEMORY_POLICY. - Scan all zonelist under nodemask, if it hits cpuset's wall this faiulre is from cpuset. And - modifies the caller of out_of_memory not to call oom if __GFP_THISNODE. This doesn't change ""current"" behavior. If callers use __GFP_THISNODE it should handle ""page allocation failure"" by itself. - handle __GFP_NOFAIL+__GFP_THISNODE path. This is something like a FIXME but this gfpmask is not used now. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hioryu@jp.fujitsu.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/4365a5676fa3aa1d5ae6c90c22a0044f09ba584e; This is something like a FIXME but this gfpmask is not used now.;yes;no;no 302;MDY6Q29tbWl0MjMyNTI5ODozYjQ3OThjYmMxM2RkOGQxMTUwYWE2Mzc3Zjk3ZjBlMTE0NTBhNjdk;KOSAKI Motohiro;Linus Torvalds;"oom-kill: show virtual size and rss information of the killed process In a typical oom analysis scenario, we frequently want to know whether the killed process has a memory leak or not at the first step. This patch adds vsz and rss information to the oom log to help this analysis. To save time for the debugging. example: =================================================================== rsyslogd invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0 Pid: 1308, comm: rsyslogd Not tainted 2.6.32-rc6 #24 Call Trace: [<ffffffff8132e35b>] ?_spin_unlock+0x2b/0x40 [<ffffffff810f186e>] oom_kill_process+0xbe/0x2b0 (snip) 492283 pages non-shared Out of memory: kill process 2341 (memhog) score 527276 or a child Killed process 2341 (memhog) vsz:1054552kB, anon-rss:970588kB, file-rss:4kB =========================================================================== ^ | here [rientjes@google.com: fix race, add pid & comm to message] Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3b4798cbc13dd8d1150aa6377f97f0e11450a67d;oom-kill: show virtual size and rss information of the killed process;yes;no;no 302;MDY6Q29tbWl0MjMyNTI5ODozYjQ3OThjYmMxM2RkOGQxMTUwYWE2Mzc3Zjk3ZjBlMTE0NTBhNjdk;KOSAKI Motohiro;Linus Torvalds;"oom-kill: show virtual size and rss information of the killed process In a typical oom analysis scenario, we frequently want to know whether the killed process has a memory leak or not at the first step. This patch adds vsz and rss information to the oom log to help this analysis. To save time for the debugging. example: =================================================================== rsyslogd invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0 Pid: 1308, comm: rsyslogd Not tainted 2.6.32-rc6 #24 Call Trace: [<ffffffff8132e35b>] ?_spin_unlock+0x2b/0x40 [<ffffffff810f186e>] oom_kill_process+0xbe/0x2b0 (snip) 492283 pages non-shared Out of memory: kill process 2341 (memhog) score 527276 or a child Killed process 2341 (memhog) vsz:1054552kB, anon-rss:970588kB, file-rss:4kB =========================================================================== ^ | here [rientjes@google.com: fix race, add pid & comm to message] Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3b4798cbc13dd8d1150aa6377f97f0e11450a67d;"In a typical oom analysis scenario, we frequently want to know whether the killed process has a memory leak or not at the first step";no;yes;yes 302;MDY6Q29tbWl0MjMyNTI5ODozYjQ3OThjYmMxM2RkOGQxMTUwYWE2Mzc3Zjk3ZjBlMTE0NTBhNjdk;KOSAKI Motohiro;Linus Torvalds;"oom-kill: show virtual size and rss information of the killed process In a typical oom analysis scenario, we frequently want to know whether the killed process has a memory leak or not at the first step. This patch adds vsz and rss information to the oom log to help this analysis. To save time for the debugging. example: =================================================================== rsyslogd invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0 Pid: 1308, comm: rsyslogd Not tainted 2.6.32-rc6 #24 Call Trace: [<ffffffff8132e35b>] ?_spin_unlock+0x2b/0x40 [<ffffffff810f186e>] oom_kill_process+0xbe/0x2b0 (snip) 492283 pages non-shared Out of memory: kill process 2341 (memhog) score 527276 or a child Killed process 2341 (memhog) vsz:1054552kB, anon-rss:970588kB, file-rss:4kB =========================================================================== ^ | here [rientjes@google.com: fix race, add pid & comm to message] Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3b4798cbc13dd8d1150aa6377f97f0e11450a67d;" This patch adds vsz and rss information to the oom log to help this analysis";yes;yes;no 302;MDY6Q29tbWl0MjMyNTI5ODozYjQ3OThjYmMxM2RkOGQxMTUwYWE2Mzc3Zjk3ZjBlMTE0NTBhNjdk;KOSAKI Motohiro;Linus Torvalds;"oom-kill: show virtual size and rss information of the killed process In a typical oom analysis scenario, we frequently want to know whether the killed process has a memory leak or not at the first step. This patch adds vsz and rss information to the oom log to help this analysis. To save time for the debugging. example: =================================================================== rsyslogd invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0 Pid: 1308, comm: rsyslogd Not tainted 2.6.32-rc6 #24 Call Trace: [<ffffffff8132e35b>] ?_spin_unlock+0x2b/0x40 [<ffffffff810f186e>] oom_kill_process+0xbe/0x2b0 (snip) 492283 pages non-shared Out of memory: kill process 2341 (memhog) score 527276 or a child Killed process 2341 (memhog) vsz:1054552kB, anon-rss:970588kB, file-rss:4kB =========================================================================== ^ | here [rientjes@google.com: fix race, add pid & comm to message] Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3b4798cbc13dd8d1150aa6377f97f0e11450a67d;" To save time for the debugging";no;yes;no 302;MDY6Q29tbWl0MjMyNTI5ODozYjQ3OThjYmMxM2RkOGQxMTUwYWE2Mzc3Zjk3ZjBlMTE0NTBhNjdk;KOSAKI Motohiro;Linus Torvalds;"oom-kill: show virtual size and rss information of the killed process In a typical oom analysis scenario, we frequently want to know whether the killed process has a memory leak or not at the first step. This patch adds vsz and rss information to the oom log to help this analysis. To save time for the debugging. example: =================================================================== rsyslogd invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0 Pid: 1308, comm: rsyslogd Not tainted 2.6.32-rc6 #24 Call Trace: [<ffffffff8132e35b>] ?_spin_unlock+0x2b/0x40 [<ffffffff810f186e>] oom_kill_process+0xbe/0x2b0 (snip) 492283 pages non-shared Out of memory: kill process 2341 (memhog) score 527276 or a child Killed process 2341 (memhog) vsz:1054552kB, anon-rss:970588kB, file-rss:4kB =========================================================================== ^ | here [rientjes@google.com: fix race, add pid & comm to message] Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3b4798cbc13dd8d1150aa6377f97f0e11450a67d;example;no;no;yes 303;MDY6Q29tbWl0MjMyNTI5ODoxYjYwNGQ3NWJiYjZlMjg2MjhjNWE5NWE0MzM0MzI5NzNjMzNkNTgx;David Rientjes;Linus Torvalds;"oom: dump stack and VM state when oom killer panics The oom killer header, including information such as the allocation order and gfp mask, current's cpuset and memory controller, call trace, and VM state information is currently only shown when the oom killer has selected a task to kill. This information is omitted, however, when the oom killer panics either because of panic_on_oom sysctl settings or when no killable task was found. It is still relevant to know crucial pieces of information such as the allocation order and VM state when diagnosing such issues, especially at boot. This patch displays the oom killer header whenever it panics so that bug reports can include pertinent information to debug the issue, if possible. Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/1b604d75bbb6e28628c5a95a433432973c33d581;oom: dump stack and VM state when oom killer panics;yes;no;no 303;MDY6Q29tbWl0MjMyNTI5ODoxYjYwNGQ3NWJiYjZlMjg2MjhjNWE5NWE0MzM0MzI5NzNjMzNkNTgx;David Rientjes;Linus Torvalds;"oom: dump stack and VM state when oom killer panics The oom killer header, including information such as the allocation order and gfp mask, current's cpuset and memory controller, call trace, and VM state information is currently only shown when the oom killer has selected a task to kill. This information is omitted, however, when the oom killer panics either because of panic_on_oom sysctl settings or when no killable task was found. It is still relevant to know crucial pieces of information such as the allocation order and VM state when diagnosing such issues, especially at boot. This patch displays the oom killer header whenever it panics so that bug reports can include pertinent information to debug the issue, if possible. Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/1b604d75bbb6e28628c5a95a433432973c33d581;"The oom killer header, including information such as the allocation order and gfp mask, current's cpuset and memory controller, call trace, and VM state information is currently only shown when the oom killer has selected a task to kill";no;no;yes 303;MDY6Q29tbWl0MjMyNTI5ODoxYjYwNGQ3NWJiYjZlMjg2MjhjNWE5NWE0MzM0MzI5NzNjMzNkNTgx;David Rientjes;Linus Torvalds;"oom: dump stack and VM state when oom killer panics The oom killer header, including information such as the allocation order and gfp mask, current's cpuset and memory controller, call trace, and VM state information is currently only shown when the oom killer has selected a task to kill. This information is omitted, however, when the oom killer panics either because of panic_on_oom sysctl settings or when no killable task was found. It is still relevant to know crucial pieces of information such as the allocation order and VM state when diagnosing such issues, especially at boot. This patch displays the oom killer header whenever it panics so that bug reports can include pertinent information to debug the issue, if possible. Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/1b604d75bbb6e28628c5a95a433432973c33d581;"This information is omitted, however, when the oom killer panics either because of panic_on_oom sysctl settings or when no killable task was found";no;yes;yes 303;MDY6Q29tbWl0MjMyNTI5ODoxYjYwNGQ3NWJiYjZlMjg2MjhjNWE5NWE0MzM0MzI5NzNjMzNkNTgx;David Rientjes;Linus Torvalds;"oom: dump stack and VM state when oom killer panics The oom killer header, including information such as the allocation order and gfp mask, current's cpuset and memory controller, call trace, and VM state information is currently only shown when the oom killer has selected a task to kill. This information is omitted, however, when the oom killer panics either because of panic_on_oom sysctl settings or when no killable task was found. It is still relevant to know crucial pieces of information such as the allocation order and VM state when diagnosing such issues, especially at boot. This patch displays the oom killer header whenever it panics so that bug reports can include pertinent information to debug the issue, if possible. Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/1b604d75bbb6e28628c5a95a433432973c33d581;" It is still relevant to know crucial pieces of information such as the allocation order and VM state when diagnosing such issues, especially at boot";no;yes;yes 303;MDY6Q29tbWl0MjMyNTI5ODoxYjYwNGQ3NWJiYjZlMjg2MjhjNWE5NWE0MzM0MzI5NzNjMzNkNTgx;David Rientjes;Linus Torvalds;"oom: dump stack and VM state when oom killer panics The oom killer header, including information such as the allocation order and gfp mask, current's cpuset and memory controller, call trace, and VM state information is currently only shown when the oom killer has selected a task to kill. This information is omitted, however, when the oom killer panics either because of panic_on_oom sysctl settings or when no killable task was found. It is still relevant to know crucial pieces of information such as the allocation order and VM state when diagnosing such issues, especially at boot. This patch displays the oom killer header whenever it panics so that bug reports can include pertinent information to debug the issue, if possible. Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/1b604d75bbb6e28628c5a95a433432973c33d581;"This patch displays the oom killer header whenever it panics so that bug reports can include pertinent information to debug the issue, if possible.";yes;yes;no 304;MDY6Q29tbWl0MjMyNTI5ODo4YzVjZDZmM2ExNzIxMDg1NjUyZGEyMDRkNDU0YWY0ZjhiOTJlZGEy;KOSAKI Motohiro;Linus Torvalds;"oom: oom_kill doesn't kill vfork parent (or child) Current oom_kill doesn't only kill the victim process, but also kill all thas shread the same mm. it mean vfork parent will be killed. This is definitely incorrect. another process have another oom_adj. we shouldn't ignore their oom_adj (it might have OOM_DISABLE). following caller hit the minefield. =============================== switch (constraint) { case CONSTRAINT_MEMORY_POLICY: oom_kill_process(current, gfp_mask, order, 0, NULL, ""No available memory (MPOL_BIND)""); break; Note: force_sig(SIGKILL) send SIGKILL to all thread in the process. We don't need to care multi thread in here. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/8c5cd6f3a1721085652da204d454af4f8b92eda2;oom: oom_kill doesn't kill vfork parent (or child);yes;no;no 304;MDY6Q29tbWl0MjMyNTI5ODo4YzVjZDZmM2ExNzIxMDg1NjUyZGEyMDRkNDU0YWY0ZjhiOTJlZGEy;KOSAKI Motohiro;Linus Torvalds;"oom: oom_kill doesn't kill vfork parent (or child) Current oom_kill doesn't only kill the victim process, but also kill all thas shread the same mm. it mean vfork parent will be killed. This is definitely incorrect. another process have another oom_adj. we shouldn't ignore their oom_adj (it might have OOM_DISABLE). following caller hit the minefield. =============================== switch (constraint) { case CONSTRAINT_MEMORY_POLICY: oom_kill_process(current, gfp_mask, order, 0, NULL, ""No available memory (MPOL_BIND)""); break; Note: force_sig(SIGKILL) send SIGKILL to all thread in the process. We don't need to care multi thread in here. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/8c5cd6f3a1721085652da204d454af4f8b92eda2;"Current oom_kill doesn't only kill the victim process, but also kill all thas shread the same mm";no;no;yes 304;MDY6Q29tbWl0MjMyNTI5ODo4YzVjZDZmM2ExNzIxMDg1NjUyZGEyMDRkNDU0YWY0ZjhiOTJlZGEy;KOSAKI Motohiro;Linus Torvalds;"oom: oom_kill doesn't kill vfork parent (or child) Current oom_kill doesn't only kill the victim process, but also kill all thas shread the same mm. it mean vfork parent will be killed. This is definitely incorrect. another process have another oom_adj. we shouldn't ignore their oom_adj (it might have OOM_DISABLE). following caller hit the minefield. =============================== switch (constraint) { case CONSTRAINT_MEMORY_POLICY: oom_kill_process(current, gfp_mask, order, 0, NULL, ""No available memory (MPOL_BIND)""); break; Note: force_sig(SIGKILL) send SIGKILL to all thread in the process. We don't need to care multi thread in here. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/8c5cd6f3a1721085652da204d454af4f8b92eda2; it mean vfork parent will be killed;no;no;yes 304;MDY6Q29tbWl0MjMyNTI5ODo4YzVjZDZmM2ExNzIxMDg1NjUyZGEyMDRkNDU0YWY0ZjhiOTJlZGEy;KOSAKI Motohiro;Linus Torvalds;"oom: oom_kill doesn't kill vfork parent (or child) Current oom_kill doesn't only kill the victim process, but also kill all thas shread the same mm. it mean vfork parent will be killed. This is definitely incorrect. another process have another oom_adj. we shouldn't ignore their oom_adj (it might have OOM_DISABLE). following caller hit the minefield. =============================== switch (constraint) { case CONSTRAINT_MEMORY_POLICY: oom_kill_process(current, gfp_mask, order, 0, NULL, ""No available memory (MPOL_BIND)""); break; Note: force_sig(SIGKILL) send SIGKILL to all thread in the process. We don't need to care multi thread in here. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/8c5cd6f3a1721085652da204d454af4f8b92eda2;This is definitely incorrect;no;yes;no 304;MDY6Q29tbWl0MjMyNTI5ODo4YzVjZDZmM2ExNzIxMDg1NjUyZGEyMDRkNDU0YWY0ZjhiOTJlZGEy;KOSAKI Motohiro;Linus Torvalds;"oom: oom_kill doesn't kill vfork parent (or child) Current oom_kill doesn't only kill the victim process, but also kill all thas shread the same mm. it mean vfork parent will be killed. This is definitely incorrect. another process have another oom_adj. we shouldn't ignore their oom_adj (it might have OOM_DISABLE). following caller hit the minefield. =============================== switch (constraint) { case CONSTRAINT_MEMORY_POLICY: oom_kill_process(current, gfp_mask, order, 0, NULL, ""No available memory (MPOL_BIND)""); break; Note: force_sig(SIGKILL) send SIGKILL to all thread in the process. We don't need to care multi thread in here. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/8c5cd6f3a1721085652da204d454af4f8b92eda2; another process have another oom_adj;yes;no;yes 304;MDY6Q29tbWl0MjMyNTI5ODo4YzVjZDZmM2ExNzIxMDg1NjUyZGEyMDRkNDU0YWY0ZjhiOTJlZGEy;KOSAKI Motohiro;Linus Torvalds;"oom: oom_kill doesn't kill vfork parent (or child) Current oom_kill doesn't only kill the victim process, but also kill all thas shread the same mm. it mean vfork parent will be killed. This is definitely incorrect. another process have another oom_adj. we shouldn't ignore their oom_adj (it might have OOM_DISABLE). following caller hit the minefield. =============================== switch (constraint) { case CONSTRAINT_MEMORY_POLICY: oom_kill_process(current, gfp_mask, order, 0, NULL, ""No available memory (MPOL_BIND)""); break; Note: force_sig(SIGKILL) send SIGKILL to all thread in the process. We don't need to care multi thread in here. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/8c5cd6f3a1721085652da204d454af4f8b92eda2;" we shouldn't ignore their oom_adj (it might have OOM_DISABLE)";yes;no;yes 304;MDY6Q29tbWl0MjMyNTI5ODo4YzVjZDZmM2ExNzIxMDg1NjUyZGEyMDRkNDU0YWY0ZjhiOTJlZGEy;KOSAKI Motohiro;Linus Torvalds;"oom: oom_kill doesn't kill vfork parent (or child) Current oom_kill doesn't only kill the victim process, but also kill all thas shread the same mm. it mean vfork parent will be killed. This is definitely incorrect. another process have another oom_adj. we shouldn't ignore their oom_adj (it might have OOM_DISABLE). following caller hit the minefield. =============================== switch (constraint) { case CONSTRAINT_MEMORY_POLICY: oom_kill_process(current, gfp_mask, order, 0, NULL, ""No available memory (MPOL_BIND)""); break; Note: force_sig(SIGKILL) send SIGKILL to all thread in the process. We don't need to care multi thread in here. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/8c5cd6f3a1721085652da204d454af4f8b92eda2;following caller hit the minefield;yes;no;yes 304;MDY6Q29tbWl0MjMyNTI5ODo4YzVjZDZmM2ExNzIxMDg1NjUyZGEyMDRkNDU0YWY0ZjhiOTJlZGEy;KOSAKI Motohiro;Linus Torvalds;"oom: oom_kill doesn't kill vfork parent (or child) Current oom_kill doesn't only kill the victim process, but also kill all thas shread the same mm. it mean vfork parent will be killed. This is definitely incorrect. another process have another oom_adj. we shouldn't ignore their oom_adj (it might have OOM_DISABLE). following caller hit the minefield. =============================== switch (constraint) { case CONSTRAINT_MEMORY_POLICY: oom_kill_process(current, gfp_mask, order, 0, NULL, ""No available memory (MPOL_BIND)""); break; Note: force_sig(SIGKILL) send SIGKILL to all thread in the process. We don't need to care multi thread in here. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/8c5cd6f3a1721085652da204d454af4f8b92eda2;" oom_kill_process(current, gfp_mask, order, 0, NULL, Note: force_sig(SIGKILL) send SIGKILL to all thread in the process";yes;no;yes 304;MDY6Q29tbWl0MjMyNTI5ODo4YzVjZDZmM2ExNzIxMDg1NjUyZGEyMDRkNDU0YWY0ZjhiOTJlZGEy;KOSAKI Motohiro;Linus Torvalds;"oom: oom_kill doesn't kill vfork parent (or child) Current oom_kill doesn't only kill the victim process, but also kill all thas shread the same mm. it mean vfork parent will be killed. This is definitely incorrect. another process have another oom_adj. we shouldn't ignore their oom_adj (it might have OOM_DISABLE). following caller hit the minefield. =============================== switch (constraint) { case CONSTRAINT_MEMORY_POLICY: oom_kill_process(current, gfp_mask, order, 0, NULL, ""No available memory (MPOL_BIND)""); break; Note: force_sig(SIGKILL) send SIGKILL to all thread in the process. We don't need to care multi thread in here. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/8c5cd6f3a1721085652da204d454af4f8b92eda2;We don't need to care multi thread in here.;yes;yes;yes 305;MDY6Q29tbWl0MjMyNTI5ODo0OTU3ODlhNTFhOTFjYjhjMDE1ZDhkNzdmZWNiYWMxY2FmMjBiMTg2;KOSAKI Motohiro;Linus Torvalds;"oom: make oom_score to per-process value oom-killer kills a process, not task. Then oom_score should be calculated as per-process too. it makes consistency more and makes speed up select_bad_process(). Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/495789a51a91cb8c015d8d77fecbac1caf20b186;oom: make oom_score to per-process value;yes;no;no 305;MDY6Q29tbWl0MjMyNTI5ODo0OTU3ODlhNTFhOTFjYjhjMDE1ZDhkNzdmZWNiYWMxY2FmMjBiMTg2;KOSAKI Motohiro;Linus Torvalds;"oom: make oom_score to per-process value oom-killer kills a process, not task. Then oom_score should be calculated as per-process too. it makes consistency more and makes speed up select_bad_process(). Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/495789a51a91cb8c015d8d77fecbac1caf20b186;oom-killer kills a process, not task;no;yes;yes 305;MDY6Q29tbWl0MjMyNTI5ODo0OTU3ODlhNTFhOTFjYjhjMDE1ZDhkNzdmZWNiYWMxY2FmMjBiMTg2;KOSAKI Motohiro;Linus Torvalds;"oom: make oom_score to per-process value oom-killer kills a process, not task. Then oom_score should be calculated as per-process too. it makes consistency more and makes speed up select_bad_process(). Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/495789a51a91cb8c015d8d77fecbac1caf20b186;" Then oom_score should be calculated as per-process too";yes;yes;no 305;MDY6Q29tbWl0MjMyNTI5ODo0OTU3ODlhNTFhOTFjYjhjMDE1ZDhkNzdmZWNiYWMxY2FmMjBiMTg2;KOSAKI Motohiro;Linus Torvalds;"oom: make oom_score to per-process value oom-killer kills a process, not task. Then oom_score should be calculated as per-process too. it makes consistency more and makes speed up select_bad_process(). Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/495789a51a91cb8c015d8d77fecbac1caf20b186;" it makes consistency more and makes speed up select_bad_process().";yes;yes;no 306;MDY6Q29tbWl0MjMyNTI5ODoyOGI4M2M1MTkzZTdhYjk1MWU0MDIyNTIyNzhmMmNjNzlkYzRkMjk4;KOSAKI Motohiro;Linus Torvalds;"oom: move oom_adj value from task_struct to signal_struct Currently, OOM logic callflow is here. __out_of_memory() select_bad_process() for each task badness() calculate badness of one task oom_kill_process() search child oom_kill_task() kill target task and mm shared tasks with it example, process-A have two thread, thread-A and thread-B and it have very fat memory and each thread have following oom_adj and oom_score. thread-A: oom_adj = OOM_DISABLE, oom_score = 0 thread-B: oom_adj = 0, oom_score = very-high Then, select_bad_process() select thread-B, but oom_kill_task() refuse kill the task because thread-A have OOM_DISABLE. Thus __out_of_memory() call select_bad_process() again. but select_bad_process() select the same task. It mean kernel fall in livelock. The fact is, select_bad_process() must select killable task. otherwise OOM logic go into livelock. And root cause is, oom_adj shouldn't be per-thread value. it should be per-process value because OOM-killer kill a process, not thread. Thus This patch moves oomkilladj (now more appropriately named oom_adj) from struct task_struct to struct signal_struct. it naturally prevent select_bad_process() choose wrong task. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/28b83c5193e7ab951e402252278f2cc79dc4d298;oom: move oom_adj value from task_struct to signal_struct;yes;no;no 306;MDY6Q29tbWl0MjMyNTI5ODoyOGI4M2M1MTkzZTdhYjk1MWU0MDIyNTIyNzhmMmNjNzlkYzRkMjk4;KOSAKI Motohiro;Linus Torvalds;"oom: move oom_adj value from task_struct to signal_struct Currently, OOM logic callflow is here. __out_of_memory() select_bad_process() for each task badness() calculate badness of one task oom_kill_process() search child oom_kill_task() kill target task and mm shared tasks with it example, process-A have two thread, thread-A and thread-B and it have very fat memory and each thread have following oom_adj and oom_score. thread-A: oom_adj = OOM_DISABLE, oom_score = 0 thread-B: oom_adj = 0, oom_score = very-high Then, select_bad_process() select thread-B, but oom_kill_task() refuse kill the task because thread-A have OOM_DISABLE. Thus __out_of_memory() call select_bad_process() again. but select_bad_process() select the same task. It mean kernel fall in livelock. The fact is, select_bad_process() must select killable task. otherwise OOM logic go into livelock. And root cause is, oom_adj shouldn't be per-thread value. it should be per-process value because OOM-killer kill a process, not thread. Thus This patch moves oomkilladj (now more appropriately named oom_adj) from struct task_struct to struct signal_struct. it naturally prevent select_bad_process() choose wrong task. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/28b83c5193e7ab951e402252278f2cc79dc4d298;Currently, OOM logic callflow is here;no;no;yes 306;MDY6Q29tbWl0MjMyNTI5ODoyOGI4M2M1MTkzZTdhYjk1MWU0MDIyNTIyNzhmMmNjNzlkYzRkMjk4;KOSAKI Motohiro;Linus Torvalds;"oom: move oom_adj value from task_struct to signal_struct Currently, OOM logic callflow is here. __out_of_memory() select_bad_process() for each task badness() calculate badness of one task oom_kill_process() search child oom_kill_task() kill target task and mm shared tasks with it example, process-A have two thread, thread-A and thread-B and it have very fat memory and each thread have following oom_adj and oom_score. thread-A: oom_adj = OOM_DISABLE, oom_score = 0 thread-B: oom_adj = 0, oom_score = very-high Then, select_bad_process() select thread-B, but oom_kill_task() refuse kill the task because thread-A have OOM_DISABLE. Thus __out_of_memory() call select_bad_process() again. but select_bad_process() select the same task. It mean kernel fall in livelock. The fact is, select_bad_process() must select killable task. otherwise OOM logic go into livelock. And root cause is, oom_adj shouldn't be per-thread value. it should be per-process value because OOM-killer kill a process, not thread. Thus This patch moves oomkilladj (now more appropriately named oom_adj) from struct task_struct to struct signal_struct. it naturally prevent select_bad_process() choose wrong task. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/28b83c5193e7ab951e402252278f2cc79dc4d298;" __out_of_memory() select_bad_process() for each task badness() calculate badness of one task oom_kill_process() search child oom_kill_task() kill target task and mm shared tasks with it example, process-A have two thread, thread-A and thread-B and it have very fat memory and each thread have following oom_adj and oom_score";no;no;yes 306;MDY6Q29tbWl0MjMyNTI5ODoyOGI4M2M1MTkzZTdhYjk1MWU0MDIyNTIyNzhmMmNjNzlkYzRkMjk4;KOSAKI Motohiro;Linus Torvalds;"oom: move oom_adj value from task_struct to signal_struct Currently, OOM logic callflow is here. __out_of_memory() select_bad_process() for each task badness() calculate badness of one task oom_kill_process() search child oom_kill_task() kill target task and mm shared tasks with it example, process-A have two thread, thread-A and thread-B and it have very fat memory and each thread have following oom_adj and oom_score. thread-A: oom_adj = OOM_DISABLE, oom_score = 0 thread-B: oom_adj = 0, oom_score = very-high Then, select_bad_process() select thread-B, but oom_kill_task() refuse kill the task because thread-A have OOM_DISABLE. Thus __out_of_memory() call select_bad_process() again. but select_bad_process() select the same task. It mean kernel fall in livelock. The fact is, select_bad_process() must select killable task. otherwise OOM logic go into livelock. And root cause is, oom_adj shouldn't be per-thread value. it should be per-process value because OOM-killer kill a process, not thread. Thus This patch moves oomkilladj (now more appropriately named oom_adj) from struct task_struct to struct signal_struct. it naturally prevent select_bad_process() choose wrong task. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/28b83c5193e7ab951e402252278f2cc79dc4d298;" thread-A: oom_adj = OOM_DISABLE, oom_score = 0 thread-B: oom_adj = 0, oom_score = very-high Then, select_bad_process() select thread-B, but oom_kill_task() refuse kill the task because thread-A have OOM_DISABLE";no;no;yes 306;MDY6Q29tbWl0MjMyNTI5ODoyOGI4M2M1MTkzZTdhYjk1MWU0MDIyNTIyNzhmMmNjNzlkYzRkMjk4;KOSAKI Motohiro;Linus Torvalds;"oom: move oom_adj value from task_struct to signal_struct Currently, OOM logic callflow is here. __out_of_memory() select_bad_process() for each task badness() calculate badness of one task oom_kill_process() search child oom_kill_task() kill target task and mm shared tasks with it example, process-A have two thread, thread-A and thread-B and it have very fat memory and each thread have following oom_adj and oom_score. thread-A: oom_adj = OOM_DISABLE, oom_score = 0 thread-B: oom_adj = 0, oom_score = very-high Then, select_bad_process() select thread-B, but oom_kill_task() refuse kill the task because thread-A have OOM_DISABLE. Thus __out_of_memory() call select_bad_process() again. but select_bad_process() select the same task. It mean kernel fall in livelock. The fact is, select_bad_process() must select killable task. otherwise OOM logic go into livelock. And root cause is, oom_adj shouldn't be per-thread value. it should be per-process value because OOM-killer kill a process, not thread. Thus This patch moves oomkilladj (now more appropriately named oom_adj) from struct task_struct to struct signal_struct. it naturally prevent select_bad_process() choose wrong task. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/28b83c5193e7ab951e402252278f2cc79dc4d298;" Thus __out_of_memory() call select_bad_process() again";no;no;yes 306;MDY6Q29tbWl0MjMyNTI5ODoyOGI4M2M1MTkzZTdhYjk1MWU0MDIyNTIyNzhmMmNjNzlkYzRkMjk4;KOSAKI Motohiro;Linus Torvalds;"oom: move oom_adj value from task_struct to signal_struct Currently, OOM logic callflow is here. __out_of_memory() select_bad_process() for each task badness() calculate badness of one task oom_kill_process() search child oom_kill_task() kill target task and mm shared tasks with it example, process-A have two thread, thread-A and thread-B and it have very fat memory and each thread have following oom_adj and oom_score. thread-A: oom_adj = OOM_DISABLE, oom_score = 0 thread-B: oom_adj = 0, oom_score = very-high Then, select_bad_process() select thread-B, but oom_kill_task() refuse kill the task because thread-A have OOM_DISABLE. Thus __out_of_memory() call select_bad_process() again. but select_bad_process() select the same task. It mean kernel fall in livelock. The fact is, select_bad_process() must select killable task. otherwise OOM logic go into livelock. And root cause is, oom_adj shouldn't be per-thread value. it should be per-process value because OOM-killer kill a process, not thread. Thus This patch moves oomkilladj (now more appropriately named oom_adj) from struct task_struct to struct signal_struct. it naturally prevent select_bad_process() choose wrong task. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/28b83c5193e7ab951e402252278f2cc79dc4d298;" but select_bad_process() select the same task";no;yes;yes 306;MDY6Q29tbWl0MjMyNTI5ODoyOGI4M2M1MTkzZTdhYjk1MWU0MDIyNTIyNzhmMmNjNzlkYzRkMjk4;KOSAKI Motohiro;Linus Torvalds;"oom: move oom_adj value from task_struct to signal_struct Currently, OOM logic callflow is here. __out_of_memory() select_bad_process() for each task badness() calculate badness of one task oom_kill_process() search child oom_kill_task() kill target task and mm shared tasks with it example, process-A have two thread, thread-A and thread-B and it have very fat memory and each thread have following oom_adj and oom_score. thread-A: oom_adj = OOM_DISABLE, oom_score = 0 thread-B: oom_adj = 0, oom_score = very-high Then, select_bad_process() select thread-B, but oom_kill_task() refuse kill the task because thread-A have OOM_DISABLE. Thus __out_of_memory() call select_bad_process() again. but select_bad_process() select the same task. It mean kernel fall in livelock. The fact is, select_bad_process() must select killable task. otherwise OOM logic go into livelock. And root cause is, oom_adj shouldn't be per-thread value. it should be per-process value because OOM-killer kill a process, not thread. Thus This patch moves oomkilladj (now more appropriately named oom_adj) from struct task_struct to struct signal_struct. it naturally prevent select_bad_process() choose wrong task. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/28b83c5193e7ab951e402252278f2cc79dc4d298; It mean kernel fall in livelock;no;yes;yes 306;MDY6Q29tbWl0MjMyNTI5ODoyOGI4M2M1MTkzZTdhYjk1MWU0MDIyNTIyNzhmMmNjNzlkYzRkMjk4;KOSAKI Motohiro;Linus Torvalds;"oom: move oom_adj value from task_struct to signal_struct Currently, OOM logic callflow is here. __out_of_memory() select_bad_process() for each task badness() calculate badness of one task oom_kill_process() search child oom_kill_task() kill target task and mm shared tasks with it example, process-A have two thread, thread-A and thread-B and it have very fat memory and each thread have following oom_adj and oom_score. thread-A: oom_adj = OOM_DISABLE, oom_score = 0 thread-B: oom_adj = 0, oom_score = very-high Then, select_bad_process() select thread-B, but oom_kill_task() refuse kill the task because thread-A have OOM_DISABLE. Thus __out_of_memory() call select_bad_process() again. but select_bad_process() select the same task. It mean kernel fall in livelock. The fact is, select_bad_process() must select killable task. otherwise OOM logic go into livelock. And root cause is, oom_adj shouldn't be per-thread value. it should be per-process value because OOM-killer kill a process, not thread. Thus This patch moves oomkilladj (now more appropriately named oom_adj) from struct task_struct to struct signal_struct. it naturally prevent select_bad_process() choose wrong task. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/28b83c5193e7ab951e402252278f2cc79dc4d298;The fact is, select_bad_process() must select killable task;yes;yes;yes 306;MDY6Q29tbWl0MjMyNTI5ODoyOGI4M2M1MTkzZTdhYjk1MWU0MDIyNTIyNzhmMmNjNzlkYzRkMjk4;KOSAKI Motohiro;Linus Torvalds;"oom: move oom_adj value from task_struct to signal_struct Currently, OOM logic callflow is here. __out_of_memory() select_bad_process() for each task badness() calculate badness of one task oom_kill_process() search child oom_kill_task() kill target task and mm shared tasks with it example, process-A have two thread, thread-A and thread-B and it have very fat memory and each thread have following oom_adj and oom_score. thread-A: oom_adj = OOM_DISABLE, oom_score = 0 thread-B: oom_adj = 0, oom_score = very-high Then, select_bad_process() select thread-B, but oom_kill_task() refuse kill the task because thread-A have OOM_DISABLE. Thus __out_of_memory() call select_bad_process() again. but select_bad_process() select the same task. It mean kernel fall in livelock. The fact is, select_bad_process() must select killable task. otherwise OOM logic go into livelock. And root cause is, oom_adj shouldn't be per-thread value. it should be per-process value because OOM-killer kill a process, not thread. Thus This patch moves oomkilladj (now more appropriately named oom_adj) from struct task_struct to struct signal_struct. it naturally prevent select_bad_process() choose wrong task. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/28b83c5193e7ab951e402252278f2cc79dc4d298;" otherwise OOM logic go into livelock";no;yes;yes 306;MDY6Q29tbWl0MjMyNTI5ODoyOGI4M2M1MTkzZTdhYjk1MWU0MDIyNTIyNzhmMmNjNzlkYzRkMjk4;KOSAKI Motohiro;Linus Torvalds;"oom: move oom_adj value from task_struct to signal_struct Currently, OOM logic callflow is here. __out_of_memory() select_bad_process() for each task badness() calculate badness of one task oom_kill_process() search child oom_kill_task() kill target task and mm shared tasks with it example, process-A have two thread, thread-A and thread-B and it have very fat memory and each thread have following oom_adj and oom_score. thread-A: oom_adj = OOM_DISABLE, oom_score = 0 thread-B: oom_adj = 0, oom_score = very-high Then, select_bad_process() select thread-B, but oom_kill_task() refuse kill the task because thread-A have OOM_DISABLE. Thus __out_of_memory() call select_bad_process() again. but select_bad_process() select the same task. It mean kernel fall in livelock. The fact is, select_bad_process() must select killable task. otherwise OOM logic go into livelock. And root cause is, oom_adj shouldn't be per-thread value. it should be per-process value because OOM-killer kill a process, not thread. Thus This patch moves oomkilladj (now more appropriately named oom_adj) from struct task_struct to struct signal_struct. it naturally prevent select_bad_process() choose wrong task. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/28b83c5193e7ab951e402252278f2cc79dc4d298;And root cause is, oom_adj shouldn't be per-thread value;no;yes;yes 306;MDY6Q29tbWl0MjMyNTI5ODoyOGI4M2M1MTkzZTdhYjk1MWU0MDIyNTIyNzhmMmNjNzlkYzRkMjk4;KOSAKI Motohiro;Linus Torvalds;"oom: move oom_adj value from task_struct to signal_struct Currently, OOM logic callflow is here. __out_of_memory() select_bad_process() for each task badness() calculate badness of one task oom_kill_process() search child oom_kill_task() kill target task and mm shared tasks with it example, process-A have two thread, thread-A and thread-B and it have very fat memory and each thread have following oom_adj and oom_score. thread-A: oom_adj = OOM_DISABLE, oom_score = 0 thread-B: oom_adj = 0, oom_score = very-high Then, select_bad_process() select thread-B, but oom_kill_task() refuse kill the task because thread-A have OOM_DISABLE. Thus __out_of_memory() call select_bad_process() again. but select_bad_process() select the same task. It mean kernel fall in livelock. The fact is, select_bad_process() must select killable task. otherwise OOM logic go into livelock. And root cause is, oom_adj shouldn't be per-thread value. it should be per-process value because OOM-killer kill a process, not thread. Thus This patch moves oomkilladj (now more appropriately named oom_adj) from struct task_struct to struct signal_struct. it naturally prevent select_bad_process() choose wrong task. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/28b83c5193e7ab951e402252278f2cc79dc4d298;" it should be per-process value because OOM-killer kill a process, not thread";yes;yes;yes 306;MDY6Q29tbWl0MjMyNTI5ODoyOGI4M2M1MTkzZTdhYjk1MWU0MDIyNTIyNzhmMmNjNzlkYzRkMjk4;KOSAKI Motohiro;Linus Torvalds;"oom: move oom_adj value from task_struct to signal_struct Currently, OOM logic callflow is here. __out_of_memory() select_bad_process() for each task badness() calculate badness of one task oom_kill_process() search child oom_kill_task() kill target task and mm shared tasks with it example, process-A have two thread, thread-A and thread-B and it have very fat memory and each thread have following oom_adj and oom_score. thread-A: oom_adj = OOM_DISABLE, oom_score = 0 thread-B: oom_adj = 0, oom_score = very-high Then, select_bad_process() select thread-B, but oom_kill_task() refuse kill the task because thread-A have OOM_DISABLE. Thus __out_of_memory() call select_bad_process() again. but select_bad_process() select the same task. It mean kernel fall in livelock. The fact is, select_bad_process() must select killable task. otherwise OOM logic go into livelock. And root cause is, oom_adj shouldn't be per-thread value. it should be per-process value because OOM-killer kill a process, not thread. Thus This patch moves oomkilladj (now more appropriately named oom_adj) from struct task_struct to struct signal_struct. it naturally prevent select_bad_process() choose wrong task. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/28b83c5193e7ab951e402252278f2cc79dc4d298;" Thus This patch moves oomkilladj (now more appropriately named oom_adj) from struct task_struct to struct signal_struct";yes;yes;no 306;MDY6Q29tbWl0MjMyNTI5ODoyOGI4M2M1MTkzZTdhYjk1MWU0MDIyNTIyNzhmMmNjNzlkYzRkMjk4;KOSAKI Motohiro;Linus Torvalds;"oom: move oom_adj value from task_struct to signal_struct Currently, OOM logic callflow is here. __out_of_memory() select_bad_process() for each task badness() calculate badness of one task oom_kill_process() search child oom_kill_task() kill target task and mm shared tasks with it example, process-A have two thread, thread-A and thread-B and it have very fat memory and each thread have following oom_adj and oom_score. thread-A: oom_adj = OOM_DISABLE, oom_score = 0 thread-B: oom_adj = 0, oom_score = very-high Then, select_bad_process() select thread-B, but oom_kill_task() refuse kill the task because thread-A have OOM_DISABLE. Thus __out_of_memory() call select_bad_process() again. but select_bad_process() select the same task. It mean kernel fall in livelock. The fact is, select_bad_process() must select killable task. otherwise OOM logic go into livelock. And root cause is, oom_adj shouldn't be per-thread value. it should be per-process value because OOM-killer kill a process, not thread. Thus This patch moves oomkilladj (now more appropriately named oom_adj) from struct task_struct to struct signal_struct. it naturally prevent select_bad_process() choose wrong task. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/28b83c5193e7ab951e402252278f2cc79dc4d298;" it naturally prevent select_bad_process() choose wrong task.";yes;yes;no 307;MDY6Q29tbWl0MjMyNTI5ODozNTQ1MWJlZWNiZDdjODZjZTMyNDlkNTQzNTk0NTE3YTVmZTlhMGNk;Hugh Dickins;Linus Torvalds;"ksm: unmerge is an origin of OOMs Just as the swapoff system call allocates many pages of RAM to various processes, perhaps triggering OOM, so ""echo 2 >/sys/kernel/mm/ksm/run"" (unmerge) is liable to allocate many pages of RAM to various processes, perhaps triggering OOM; and each is normally run from a modest admin process (swapoff or shell), easily repeated until it succeeds. So treat unmerge_and_remove_all_rmap_items() in the same way that we treat try_to_unuse(): generalize PF_SWAPOFF to PF_OOM_ORIGIN, and bracket both with that, to ask the OOM killer to kill them first, to prevent them from spawning more and more OOM kills. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Acked-by: Izik Eidus <ieidus@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/35451beecbd7c86ce3249d543594517a5fe9a0cd;ksm: unmerge is an origin of OOMs;yes;no;no 307;MDY6Q29tbWl0MjMyNTI5ODozNTQ1MWJlZWNiZDdjODZjZTMyNDlkNTQzNTk0NTE3YTVmZTlhMGNk;Hugh Dickins;Linus Torvalds;"ksm: unmerge is an origin of OOMs Just as the swapoff system call allocates many pages of RAM to various processes, perhaps triggering OOM, so ""echo 2 >/sys/kernel/mm/ksm/run"" (unmerge) is liable to allocate many pages of RAM to various processes, perhaps triggering OOM; and each is normally run from a modest admin process (swapoff or shell), easily repeated until it succeeds. So treat unmerge_and_remove_all_rmap_items() in the same way that we treat try_to_unuse(): generalize PF_SWAPOFF to PF_OOM_ORIGIN, and bracket both with that, to ask the OOM killer to kill them first, to prevent them from spawning more and more OOM kills. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Acked-by: Izik Eidus <ieidus@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/35451beecbd7c86ce3249d543594517a5fe9a0cd;"Just as the swapoff system call allocates many pages of RAM to various processes, perhaps triggering OOM, so ""echo 2 >/sys/kernel/mm/ksm/run"" (unmerge) is liable to allocate many pages of RAM to various processes, perhaps triggering OOM; and each is normally run from a modest admin process (swapoff or shell), easily repeated until it succeeds";no;yes;yes 307;MDY6Q29tbWl0MjMyNTI5ODozNTQ1MWJlZWNiZDdjODZjZTMyNDlkNTQzNTk0NTE3YTVmZTlhMGNk;Hugh Dickins;Linus Torvalds;"ksm: unmerge is an origin of OOMs Just as the swapoff system call allocates many pages of RAM to various processes, perhaps triggering OOM, so ""echo 2 >/sys/kernel/mm/ksm/run"" (unmerge) is liable to allocate many pages of RAM to various processes, perhaps triggering OOM; and each is normally run from a modest admin process (swapoff or shell), easily repeated until it succeeds. So treat unmerge_and_remove_all_rmap_items() in the same way that we treat try_to_unuse(): generalize PF_SWAPOFF to PF_OOM_ORIGIN, and bracket both with that, to ask the OOM killer to kill them first, to prevent them from spawning more and more OOM kills. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Acked-by: Izik Eidus <ieidus@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/35451beecbd7c86ce3249d543594517a5fe9a0cd;"So treat unmerge_and_remove_all_rmap_items() in the same way that we treat try_to_unuse(): generalize PF_SWAPOFF to PF_OOM_ORIGIN, and bracket both with that, to ask the OOM killer to kill them first, to prevent them from spawning more and more OOM kills.";yes;yes;no 308;MDY6Q29tbWl0MjMyNTI5ODowNzUzYmEwMWUxMjYwMjBiZjBmODE1MDkzNDkwM2I0ODkzNWI2OTdk;KOSAKI Motohiro;Linus Torvalds;"mm: revert ""oom: move oom_adj value"" The commit 2ff05b2b (oom: move oom_adj value) moveed the oom_adj value to the mm_struct. It was a very good first step for sanitize OOM. However Paul Menage reported the commit makes regression to his job scheduler. Current OOM logic can kill OOM_DISABLED process. Why? His program has the code of similar to the following. ... set_oom_adj(OOM_DISABLE); /* The job scheduler never killed by oom */ ... if (vfork() == 0) { set_oom_adj(0); /* Invoked child can be killed */ execve(""foo-bar-cmd""); } .... vfork() parent and child are shared the same mm_struct. then above set_oom_adj(0) doesn't only change oom_adj for vfork() child, it's also change oom_adj for vfork() parent. Then, vfork() parent (job scheduler) lost OOM immune and it was killed. Actually, fork-setting-exec idiom is very frequently used in userland program. We must not break this assumption. Then, this patch revert commit 2ff05b2b and related commit. Reverted commit list --------------------- - commit 2ff05b2b4e (oom: move oom_adj value from task_struct to mm_struct) - commit 4d8b9135c3 (oom: avoid unnecessary mm locking and scanning for OOM_DISABLE) - commit 8123681022 (oom: only oom kill exiting tasks with attached memory) - commit 933b787b57 (mm: copy over oom_adj value at fork time) Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/0753ba01e126020bf0f8150934903b48935b697d;"mm: revert ""oom: move oom_adj value""";yes;no;no 308;MDY6Q29tbWl0MjMyNTI5ODowNzUzYmEwMWUxMjYwMjBiZjBmODE1MDkzNDkwM2I0ODkzNWI2OTdk;KOSAKI Motohiro;Linus Torvalds;"mm: revert ""oom: move oom_adj value"" The commit 2ff05b2b (oom: move oom_adj value) moveed the oom_adj value to the mm_struct. It was a very good first step for sanitize OOM. However Paul Menage reported the commit makes regression to his job scheduler. Current OOM logic can kill OOM_DISABLED process. Why? His program has the code of similar to the following. ... set_oom_adj(OOM_DISABLE); /* The job scheduler never killed by oom */ ... if (vfork() == 0) { set_oom_adj(0); /* Invoked child can be killed */ execve(""foo-bar-cmd""); } .... vfork() parent and child are shared the same mm_struct. then above set_oom_adj(0) doesn't only change oom_adj for vfork() child, it's also change oom_adj for vfork() parent. Then, vfork() parent (job scheduler) lost OOM immune and it was killed. Actually, fork-setting-exec idiom is very frequently used in userland program. We must not break this assumption. Then, this patch revert commit 2ff05b2b and related commit. Reverted commit list --------------------- - commit 2ff05b2b4e (oom: move oom_adj value from task_struct to mm_struct) - commit 4d8b9135c3 (oom: avoid unnecessary mm locking and scanning for OOM_DISABLE) - commit 8123681022 (oom: only oom kill exiting tasks with attached memory) - commit 933b787b57 (mm: copy over oom_adj value at fork time) Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/0753ba01e126020bf0f8150934903b48935b697d;"The commit 2ff05b2b (oom: move oom_adj value) moveed the oom_adj value to the mm_struct";no;no;yes 308;MDY6Q29tbWl0MjMyNTI5ODowNzUzYmEwMWUxMjYwMjBiZjBmODE1MDkzNDkwM2I0ODkzNWI2OTdk;KOSAKI Motohiro;Linus Torvalds;"mm: revert ""oom: move oom_adj value"" The commit 2ff05b2b (oom: move oom_adj value) moveed the oom_adj value to the mm_struct. It was a very good first step for sanitize OOM. However Paul Menage reported the commit makes regression to his job scheduler. Current OOM logic can kill OOM_DISABLED process. Why? His program has the code of similar to the following. ... set_oom_adj(OOM_DISABLE); /* The job scheduler never killed by oom */ ... if (vfork() == 0) { set_oom_adj(0); /* Invoked child can be killed */ execve(""foo-bar-cmd""); } .... vfork() parent and child are shared the same mm_struct. then above set_oom_adj(0) doesn't only change oom_adj for vfork() child, it's also change oom_adj for vfork() parent. Then, vfork() parent (job scheduler) lost OOM immune and it was killed. Actually, fork-setting-exec idiom is very frequently used in userland program. We must not break this assumption. Then, this patch revert commit 2ff05b2b and related commit. Reverted commit list --------------------- - commit 2ff05b2b4e (oom: move oom_adj value from task_struct to mm_struct) - commit 4d8b9135c3 (oom: avoid unnecessary mm locking and scanning for OOM_DISABLE) - commit 8123681022 (oom: only oom kill exiting tasks with attached memory) - commit 933b787b57 (mm: copy over oom_adj value at fork time) Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/0753ba01e126020bf0f8150934903b48935b697d; It was a very good first step for sanitize OOM;no;no;yes 308;MDY6Q29tbWl0MjMyNTI5ODowNzUzYmEwMWUxMjYwMjBiZjBmODE1MDkzNDkwM2I0ODkzNWI2OTdk;KOSAKI Motohiro;Linus Torvalds;"mm: revert ""oom: move oom_adj value"" The commit 2ff05b2b (oom: move oom_adj value) moveed the oom_adj value to the mm_struct. It was a very good first step for sanitize OOM. However Paul Menage reported the commit makes regression to his job scheduler. Current OOM logic can kill OOM_DISABLED process. Why? His program has the code of similar to the following. ... set_oom_adj(OOM_DISABLE); /* The job scheduler never killed by oom */ ... if (vfork() == 0) { set_oom_adj(0); /* Invoked child can be killed */ execve(""foo-bar-cmd""); } .... vfork() parent and child are shared the same mm_struct. then above set_oom_adj(0) doesn't only change oom_adj for vfork() child, it's also change oom_adj for vfork() parent. Then, vfork() parent (job scheduler) lost OOM immune and it was killed. Actually, fork-setting-exec idiom is very frequently used in userland program. We must not break this assumption. Then, this patch revert commit 2ff05b2b and related commit. Reverted commit list --------------------- - commit 2ff05b2b4e (oom: move oom_adj value from task_struct to mm_struct) - commit 4d8b9135c3 (oom: avoid unnecessary mm locking and scanning for OOM_DISABLE) - commit 8123681022 (oom: only oom kill exiting tasks with attached memory) - commit 933b787b57 (mm: copy over oom_adj value at fork time) Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/0753ba01e126020bf0f8150934903b48935b697d;"However Paul Menage reported the commit makes regression to his job scheduler";no;yes;yes 308;MDY6Q29tbWl0MjMyNTI5ODowNzUzYmEwMWUxMjYwMjBiZjBmODE1MDkzNDkwM2I0ODkzNWI2OTdk;KOSAKI Motohiro;Linus Torvalds;"mm: revert ""oom: move oom_adj value"" The commit 2ff05b2b (oom: move oom_adj value) moveed the oom_adj value to the mm_struct. It was a very good first step for sanitize OOM. However Paul Menage reported the commit makes regression to his job scheduler. Current OOM logic can kill OOM_DISABLED process. Why? His program has the code of similar to the following. ... set_oom_adj(OOM_DISABLE); /* The job scheduler never killed by oom */ ... if (vfork() == 0) { set_oom_adj(0); /* Invoked child can be killed */ execve(""foo-bar-cmd""); } .... vfork() parent and child are shared the same mm_struct. then above set_oom_adj(0) doesn't only change oom_adj for vfork() child, it's also change oom_adj for vfork() parent. Then, vfork() parent (job scheduler) lost OOM immune and it was killed. Actually, fork-setting-exec idiom is very frequently used in userland program. We must not break this assumption. Then, this patch revert commit 2ff05b2b and related commit. Reverted commit list --------------------- - commit 2ff05b2b4e (oom: move oom_adj value from task_struct to mm_struct) - commit 4d8b9135c3 (oom: avoid unnecessary mm locking and scanning for OOM_DISABLE) - commit 8123681022 (oom: only oom kill exiting tasks with attached memory) - commit 933b787b57 (mm: copy over oom_adj value at fork time) Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/0753ba01e126020bf0f8150934903b48935b697d; Current OOM logic can kill OOM_DISABLED process;no;yes;yes 308;MDY6Q29tbWl0MjMyNTI5ODowNzUzYmEwMWUxMjYwMjBiZjBmODE1MDkzNDkwM2I0ODkzNWI2OTdk;KOSAKI Motohiro;Linus Torvalds;"mm: revert ""oom: move oom_adj value"" The commit 2ff05b2b (oom: move oom_adj value) moveed the oom_adj value to the mm_struct. It was a very good first step for sanitize OOM. However Paul Menage reported the commit makes regression to his job scheduler. Current OOM logic can kill OOM_DISABLED process. Why? His program has the code of similar to the following. ... set_oom_adj(OOM_DISABLE); /* The job scheduler never killed by oom */ ... if (vfork() == 0) { set_oom_adj(0); /* Invoked child can be killed */ execve(""foo-bar-cmd""); } .... vfork() parent and child are shared the same mm_struct. then above set_oom_adj(0) doesn't only change oom_adj for vfork() child, it's also change oom_adj for vfork() parent. Then, vfork() parent (job scheduler) lost OOM immune and it was killed. Actually, fork-setting-exec idiom is very frequently used in userland program. We must not break this assumption. Then, this patch revert commit 2ff05b2b and related commit. Reverted commit list --------------------- - commit 2ff05b2b4e (oom: move oom_adj value from task_struct to mm_struct) - commit 4d8b9135c3 (oom: avoid unnecessary mm locking and scanning for OOM_DISABLE) - commit 8123681022 (oom: only oom kill exiting tasks with attached memory) - commit 933b787b57 (mm: copy over oom_adj value at fork time) Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/0753ba01e126020bf0f8150934903b48935b697d;Why? His program has the code of similar to the following;no;yes;yes 308;MDY6Q29tbWl0MjMyNTI5ODowNzUzYmEwMWUxMjYwMjBiZjBmODE1MDkzNDkwM2I0ODkzNWI2OTdk;KOSAKI Motohiro;Linus Torvalds;"mm: revert ""oom: move oom_adj value"" The commit 2ff05b2b (oom: move oom_adj value) moveed the oom_adj value to the mm_struct. It was a very good first step for sanitize OOM. However Paul Menage reported the commit makes regression to his job scheduler. Current OOM logic can kill OOM_DISABLED process. Why? His program has the code of similar to the following. ... set_oom_adj(OOM_DISABLE); /* The job scheduler never killed by oom */ ... if (vfork() == 0) { set_oom_adj(0); /* Invoked child can be killed */ execve(""foo-bar-cmd""); } .... vfork() parent and child are shared the same mm_struct. then above set_oom_adj(0) doesn't only change oom_adj for vfork() child, it's also change oom_adj for vfork() parent. Then, vfork() parent (job scheduler) lost OOM immune and it was killed. Actually, fork-setting-exec idiom is very frequently used in userland program. We must not break this assumption. Then, this patch revert commit 2ff05b2b and related commit. Reverted commit list --------------------- - commit 2ff05b2b4e (oom: move oom_adj value from task_struct to mm_struct) - commit 4d8b9135c3 (oom: avoid unnecessary mm locking and scanning for OOM_DISABLE) - commit 8123681022 (oom: only oom kill exiting tasks with attached memory) - commit 933b787b57 (mm: copy over oom_adj value at fork time) Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/0753ba01e126020bf0f8150934903b48935b697d;vfork() parent and child are shared the same mm_struct;no;yes;yes 308;MDY6Q29tbWl0MjMyNTI5ODowNzUzYmEwMWUxMjYwMjBiZjBmODE1MDkzNDkwM2I0ODkzNWI2OTdk;KOSAKI Motohiro;Linus Torvalds;"mm: revert ""oom: move oom_adj value"" The commit 2ff05b2b (oom: move oom_adj value) moveed the oom_adj value to the mm_struct. It was a very good first step for sanitize OOM. However Paul Menage reported the commit makes regression to his job scheduler. Current OOM logic can kill OOM_DISABLED process. Why? His program has the code of similar to the following. ... set_oom_adj(OOM_DISABLE); /* The job scheduler never killed by oom */ ... if (vfork() == 0) { set_oom_adj(0); /* Invoked child can be killed */ execve(""foo-bar-cmd""); } .... vfork() parent and child are shared the same mm_struct. then above set_oom_adj(0) doesn't only change oom_adj for vfork() child, it's also change oom_adj for vfork() parent. Then, vfork() parent (job scheduler) lost OOM immune and it was killed. Actually, fork-setting-exec idiom is very frequently used in userland program. We must not break this assumption. Then, this patch revert commit 2ff05b2b and related commit. Reverted commit list --------------------- - commit 2ff05b2b4e (oom: move oom_adj value from task_struct to mm_struct) - commit 4d8b9135c3 (oom: avoid unnecessary mm locking and scanning for OOM_DISABLE) - commit 8123681022 (oom: only oom kill exiting tasks with attached memory) - commit 933b787b57 (mm: copy over oom_adj value at fork time) Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/0753ba01e126020bf0f8150934903b48935b697d;" then above set_oom_adj(0) doesn't only change oom_adj for vfork() child, it's also change oom_adj for vfork() parent";no;yes;yes 308;MDY6Q29tbWl0MjMyNTI5ODowNzUzYmEwMWUxMjYwMjBiZjBmODE1MDkzNDkwM2I0ODkzNWI2OTdk;KOSAKI Motohiro;Linus Torvalds;"mm: revert ""oom: move oom_adj value"" The commit 2ff05b2b (oom: move oom_adj value) moveed the oom_adj value to the mm_struct. It was a very good first step for sanitize OOM. However Paul Menage reported the commit makes regression to his job scheduler. Current OOM logic can kill OOM_DISABLED process. Why? His program has the code of similar to the following. ... set_oom_adj(OOM_DISABLE); /* The job scheduler never killed by oom */ ... if (vfork() == 0) { set_oom_adj(0); /* Invoked child can be killed */ execve(""foo-bar-cmd""); } .... vfork() parent and child are shared the same mm_struct. then above set_oom_adj(0) doesn't only change oom_adj for vfork() child, it's also change oom_adj for vfork() parent. Then, vfork() parent (job scheduler) lost OOM immune and it was killed. Actually, fork-setting-exec idiom is very frequently used in userland program. We must not break this assumption. Then, this patch revert commit 2ff05b2b and related commit. Reverted commit list --------------------- - commit 2ff05b2b4e (oom: move oom_adj value from task_struct to mm_struct) - commit 4d8b9135c3 (oom: avoid unnecessary mm locking and scanning for OOM_DISABLE) - commit 8123681022 (oom: only oom kill exiting tasks with attached memory) - commit 933b787b57 (mm: copy over oom_adj value at fork time) Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/0753ba01e126020bf0f8150934903b48935b697d;" Then, vfork() parent (job scheduler) lost OOM immune and it was killed";no;yes;yes 308;MDY6Q29tbWl0MjMyNTI5ODowNzUzYmEwMWUxMjYwMjBiZjBmODE1MDkzNDkwM2I0ODkzNWI2OTdk;KOSAKI Motohiro;Linus Torvalds;"mm: revert ""oom: move oom_adj value"" The commit 2ff05b2b (oom: move oom_adj value) moveed the oom_adj value to the mm_struct. It was a very good first step for sanitize OOM. However Paul Menage reported the commit makes regression to his job scheduler. Current OOM logic can kill OOM_DISABLED process. Why? His program has the code of similar to the following. ... set_oom_adj(OOM_DISABLE); /* The job scheduler never killed by oom */ ... if (vfork() == 0) { set_oom_adj(0); /* Invoked child can be killed */ execve(""foo-bar-cmd""); } .... vfork() parent and child are shared the same mm_struct. then above set_oom_adj(0) doesn't only change oom_adj for vfork() child, it's also change oom_adj for vfork() parent. Then, vfork() parent (job scheduler) lost OOM immune and it was killed. Actually, fork-setting-exec idiom is very frequently used in userland program. We must not break this assumption. Then, this patch revert commit 2ff05b2b and related commit. Reverted commit list --------------------- - commit 2ff05b2b4e (oom: move oom_adj value from task_struct to mm_struct) - commit 4d8b9135c3 (oom: avoid unnecessary mm locking and scanning for OOM_DISABLE) - commit 8123681022 (oom: only oom kill exiting tasks with attached memory) - commit 933b787b57 (mm: copy over oom_adj value at fork time) Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/0753ba01e126020bf0f8150934903b48935b697d;Actually, fork-setting-exec idiom is very frequently used in userland program;no;no;yes 308;MDY6Q29tbWl0MjMyNTI5ODowNzUzYmEwMWUxMjYwMjBiZjBmODE1MDkzNDkwM2I0ODkzNWI2OTdk;KOSAKI Motohiro;Linus Torvalds;"mm: revert ""oom: move oom_adj value"" The commit 2ff05b2b (oom: move oom_adj value) moveed the oom_adj value to the mm_struct. It was a very good first step for sanitize OOM. However Paul Menage reported the commit makes regression to his job scheduler. Current OOM logic can kill OOM_DISABLED process. Why? His program has the code of similar to the following. ... set_oom_adj(OOM_DISABLE); /* The job scheduler never killed by oom */ ... if (vfork() == 0) { set_oom_adj(0); /* Invoked child can be killed */ execve(""foo-bar-cmd""); } .... vfork() parent and child are shared the same mm_struct. then above set_oom_adj(0) doesn't only change oom_adj for vfork() child, it's also change oom_adj for vfork() parent. Then, vfork() parent (job scheduler) lost OOM immune and it was killed. Actually, fork-setting-exec idiom is very frequently used in userland program. We must not break this assumption. Then, this patch revert commit 2ff05b2b and related commit. Reverted commit list --------------------- - commit 2ff05b2b4e (oom: move oom_adj value from task_struct to mm_struct) - commit 4d8b9135c3 (oom: avoid unnecessary mm locking and scanning for OOM_DISABLE) - commit 8123681022 (oom: only oom kill exiting tasks with attached memory) - commit 933b787b57 (mm: copy over oom_adj value at fork time) Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/0753ba01e126020bf0f8150934903b48935b697d;We must not break this assumption;yes;yes;no 308;MDY6Q29tbWl0MjMyNTI5ODowNzUzYmEwMWUxMjYwMjBiZjBmODE1MDkzNDkwM2I0ODkzNWI2OTdk;KOSAKI Motohiro;Linus Torvalds;"mm: revert ""oom: move oom_adj value"" The commit 2ff05b2b (oom: move oom_adj value) moveed the oom_adj value to the mm_struct. It was a very good first step for sanitize OOM. However Paul Menage reported the commit makes regression to his job scheduler. Current OOM logic can kill OOM_DISABLED process. Why? His program has the code of similar to the following. ... set_oom_adj(OOM_DISABLE); /* The job scheduler never killed by oom */ ... if (vfork() == 0) { set_oom_adj(0); /* Invoked child can be killed */ execve(""foo-bar-cmd""); } .... vfork() parent and child are shared the same mm_struct. then above set_oom_adj(0) doesn't only change oom_adj for vfork() child, it's also change oom_adj for vfork() parent. Then, vfork() parent (job scheduler) lost OOM immune and it was killed. Actually, fork-setting-exec idiom is very frequently used in userland program. We must not break this assumption. Then, this patch revert commit 2ff05b2b and related commit. Reverted commit list --------------------- - commit 2ff05b2b4e (oom: move oom_adj value from task_struct to mm_struct) - commit 4d8b9135c3 (oom: avoid unnecessary mm locking and scanning for OOM_DISABLE) - commit 8123681022 (oom: only oom kill exiting tasks with attached memory) - commit 933b787b57 (mm: copy over oom_adj value at fork time) Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/0753ba01e126020bf0f8150934903b48935b697d;Then, this patch revert commit 2ff05b2b and related commit;yes;no;no 308;MDY6Q29tbWl0MjMyNTI5ODowNzUzYmEwMWUxMjYwMjBiZjBmODE1MDkzNDkwM2I0ODkzNWI2OTdk;KOSAKI Motohiro;Linus Torvalds;"mm: revert ""oom: move oom_adj value"" The commit 2ff05b2b (oom: move oom_adj value) moveed the oom_adj value to the mm_struct. It was a very good first step for sanitize OOM. However Paul Menage reported the commit makes regression to his job scheduler. Current OOM logic can kill OOM_DISABLED process. Why? His program has the code of similar to the following. ... set_oom_adj(OOM_DISABLE); /* The job scheduler never killed by oom */ ... if (vfork() == 0) { set_oom_adj(0); /* Invoked child can be killed */ execve(""foo-bar-cmd""); } .... vfork() parent and child are shared the same mm_struct. then above set_oom_adj(0) doesn't only change oom_adj for vfork() child, it's also change oom_adj for vfork() parent. Then, vfork() parent (job scheduler) lost OOM immune and it was killed. Actually, fork-setting-exec idiom is very frequently used in userland program. We must not break this assumption. Then, this patch revert commit 2ff05b2b and related commit. Reverted commit list --------------------- - commit 2ff05b2b4e (oom: move oom_adj value from task_struct to mm_struct) - commit 4d8b9135c3 (oom: avoid unnecessary mm locking and scanning for OOM_DISABLE) - commit 8123681022 (oom: only oom kill exiting tasks with attached memory) - commit 933b787b57 (mm: copy over oom_adj value at fork time) Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/0753ba01e126020bf0f8150934903b48935b697d;"Reverted commit list - commit 2ff05b2b4e (oom: move oom_adj value from task_struct to mm_struct) - commit 4d8b9135c3 (oom: avoid unnecessary mm locking and scanning for OOM_DISABLE) - commit 8123681022 (oom: only oom kill exiting tasks with attached memory) - commit 933b787b57 (mm: copy over oom_adj value at fork time)";yes;no;no 309;MDY6Q29tbWl0MjMyNTI5ODo4MTIzNjgxMDIyNmY3MWJkOWZmNzczMjFjOGU4Mjc2ZGFlN2VmYzYx;David Rientjes;Linus Torvalds;"oom: only oom kill exiting tasks with attached memory When a task is chosen for oom kill and is found to be PF_EXITING, __oom_kill_task() is called to elevate the task's timeslice and give it access to memory reserves so that it may quickly exit. This privilege is unnecessary, however, if the task has already detached its mm. Although its possible for the mm to become detached later since task_lock() is not held, __oom_kill_task() will simply be a no-op in such circumstances. Subsequently, it is no longer necessary to warn about killing mm-less tasks since it is a no-op. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/81236810226f71bd9ff77321c8e8276dae7efc61;oom: only oom kill exiting tasks with attached memory;yes;yes;no 309;MDY6Q29tbWl0MjMyNTI5ODo4MTIzNjgxMDIyNmY3MWJkOWZmNzczMjFjOGU4Mjc2ZGFlN2VmYzYx;David Rientjes;Linus Torvalds;"oom: only oom kill exiting tasks with attached memory When a task is chosen for oom kill and is found to be PF_EXITING, __oom_kill_task() is called to elevate the task's timeslice and give it access to memory reserves so that it may quickly exit. This privilege is unnecessary, however, if the task has already detached its mm. Although its possible for the mm to become detached later since task_lock() is not held, __oom_kill_task() will simply be a no-op in such circumstances. Subsequently, it is no longer necessary to warn about killing mm-less tasks since it is a no-op. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/81236810226f71bd9ff77321c8e8276dae7efc61;"When a task is chosen for oom kill and is found to be PF_EXITING, __oom_kill_task() is called to elevate the task's timeslice and give it access to memory reserves so that it may quickly exit";no;no;yes 309;MDY6Q29tbWl0MjMyNTI5ODo4MTIzNjgxMDIyNmY3MWJkOWZmNzczMjFjOGU4Mjc2ZGFlN2VmYzYx;David Rientjes;Linus Torvalds;"oom: only oom kill exiting tasks with attached memory When a task is chosen for oom kill and is found to be PF_EXITING, __oom_kill_task() is called to elevate the task's timeslice and give it access to memory reserves so that it may quickly exit. This privilege is unnecessary, however, if the task has already detached its mm. Although its possible for the mm to become detached later since task_lock() is not held, __oom_kill_task() will simply be a no-op in such circumstances. Subsequently, it is no longer necessary to warn about killing mm-less tasks since it is a no-op. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/81236810226f71bd9ff77321c8e8276dae7efc61;"This privilege is unnecessary, however, if the task has already detached its mm";no;yes;yes 309;MDY6Q29tbWl0MjMyNTI5ODo4MTIzNjgxMDIyNmY3MWJkOWZmNzczMjFjOGU4Mjc2ZGFlN2VmYzYx;David Rientjes;Linus Torvalds;"oom: only oom kill exiting tasks with attached memory When a task is chosen for oom kill and is found to be PF_EXITING, __oom_kill_task() is called to elevate the task's timeslice and give it access to memory reserves so that it may quickly exit. This privilege is unnecessary, however, if the task has already detached its mm. Although its possible for the mm to become detached later since task_lock() is not held, __oom_kill_task() will simply be a no-op in such circumstances. Subsequently, it is no longer necessary to warn about killing mm-less tasks since it is a no-op. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/81236810226f71bd9ff77321c8e8276dae7efc61;" Although its possible for the mm to become detached later since task_lock() is not held, __oom_kill_task() will simply be a no-op in such circumstances";yes;yes;yes 309;MDY6Q29tbWl0MjMyNTI5ODo4MTIzNjgxMDIyNmY3MWJkOWZmNzczMjFjOGU4Mjc2ZGFlN2VmYzYx;David Rientjes;Linus Torvalds;"oom: only oom kill exiting tasks with attached memory When a task is chosen for oom kill and is found to be PF_EXITING, __oom_kill_task() is called to elevate the task's timeslice and give it access to memory reserves so that it may quickly exit. This privilege is unnecessary, however, if the task has already detached its mm. Although its possible for the mm to become detached later since task_lock() is not held, __oom_kill_task() will simply be a no-op in such circumstances. Subsequently, it is no longer necessary to warn about killing mm-less tasks since it is a no-op. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/81236810226f71bd9ff77321c8e8276dae7efc61;"Subsequently, it is no longer necessary to warn about killing mm-less tasks since it is a no-op.";yes;yes;yes 310;MDY6Q29tbWl0MjMyNTI5ODo0ZDhiOTEzNWMzMGNjYmU0NmU2MjFmZWZkODYyOTY5ODE5MDAzZmQ2;David Rientjes;Linus Torvalds;"oom: avoid unnecessary mm locking and scanning for OOM_DISABLE This moves the check for OOM_DISABLE to the badness heuristic so it is only necessary to hold task_lock() once. If the mm is OOM_DISABLE, the score is 0, which is also correctly exported via /proc/pid/oom_score. This requires that tasks with badness scores of 0 are prohibited from being oom killed, which makes sense since they would not allow for future memory freeing anyway. Since the oom_adj value is a characteristic of an mm and not a task, it is no longer necessary to check the oom_adj value for threads sharing the same memory (except when simply issuing SIGKILLs for threads in other thread groups). Cc: Nick Piggin <npiggin@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/4d8b9135c30ccbe46e621fefd862969819003fd6;oom: avoid unnecessary mm locking and scanning for OOM_DISABLE;yes;yes;no 310;MDY6Q29tbWl0MjMyNTI5ODo0ZDhiOTEzNWMzMGNjYmU0NmU2MjFmZWZkODYyOTY5ODE5MDAzZmQ2;David Rientjes;Linus Torvalds;"oom: avoid unnecessary mm locking and scanning for OOM_DISABLE This moves the check for OOM_DISABLE to the badness heuristic so it is only necessary to hold task_lock() once. If the mm is OOM_DISABLE, the score is 0, which is also correctly exported via /proc/pid/oom_score. This requires that tasks with badness scores of 0 are prohibited from being oom killed, which makes sense since they would not allow for future memory freeing anyway. Since the oom_adj value is a characteristic of an mm and not a task, it is no longer necessary to check the oom_adj value for threads sharing the same memory (except when simply issuing SIGKILLs for threads in other thread groups). Cc: Nick Piggin <npiggin@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/4d8b9135c30ccbe46e621fefd862969819003fd6;"This moves the check for OOM_DISABLE to the badness heuristic so it is only necessary to hold task_lock() once";yes;yes;no 310;MDY6Q29tbWl0MjMyNTI5ODo0ZDhiOTEzNWMzMGNjYmU0NmU2MjFmZWZkODYyOTY5ODE5MDAzZmQ2;David Rientjes;Linus Torvalds;"oom: avoid unnecessary mm locking and scanning for OOM_DISABLE This moves the check for OOM_DISABLE to the badness heuristic so it is only necessary to hold task_lock() once. If the mm is OOM_DISABLE, the score is 0, which is also correctly exported via /proc/pid/oom_score. This requires that tasks with badness scores of 0 are prohibited from being oom killed, which makes sense since they would not allow for future memory freeing anyway. Since the oom_adj value is a characteristic of an mm and not a task, it is no longer necessary to check the oom_adj value for threads sharing the same memory (except when simply issuing SIGKILLs for threads in other thread groups). Cc: Nick Piggin <npiggin@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/4d8b9135c30ccbe46e621fefd862969819003fd6;" If the mm is OOM_DISABLE, the score is 0, which is also correctly exported via /proc/pid/oom_score";yes;yes;yes 310;MDY6Q29tbWl0MjMyNTI5ODo0ZDhiOTEzNWMzMGNjYmU0NmU2MjFmZWZkODYyOTY5ODE5MDAzZmQ2;David Rientjes;Linus Torvalds;"oom: avoid unnecessary mm locking and scanning for OOM_DISABLE This moves the check for OOM_DISABLE to the badness heuristic so it is only necessary to hold task_lock() once. If the mm is OOM_DISABLE, the score is 0, which is also correctly exported via /proc/pid/oom_score. This requires that tasks with badness scores of 0 are prohibited from being oom killed, which makes sense since they would not allow for future memory freeing anyway. Since the oom_adj value is a characteristic of an mm and not a task, it is no longer necessary to check the oom_adj value for threads sharing the same memory (except when simply issuing SIGKILLs for threads in other thread groups). Cc: Nick Piggin <npiggin@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/4d8b9135c30ccbe46e621fefd862969819003fd6;"This requires that tasks with badness scores of 0 are prohibited from being oom killed, which makes sense since they would not allow for future memory freeing anyway";yes;yes;yes 310;MDY6Q29tbWl0MjMyNTI5ODo0ZDhiOTEzNWMzMGNjYmU0NmU2MjFmZWZkODYyOTY5ODE5MDAzZmQ2;David Rientjes;Linus Torvalds;"oom: avoid unnecessary mm locking and scanning for OOM_DISABLE This moves the check for OOM_DISABLE to the badness heuristic so it is only necessary to hold task_lock() once. If the mm is OOM_DISABLE, the score is 0, which is also correctly exported via /proc/pid/oom_score. This requires that tasks with badness scores of 0 are prohibited from being oom killed, which makes sense since they would not allow for future memory freeing anyway. Since the oom_adj value is a characteristic of an mm and not a task, it is no longer necessary to check the oom_adj value for threads sharing the same memory (except when simply issuing SIGKILLs for threads in other thread groups). Cc: Nick Piggin <npiggin@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/4d8b9135c30ccbe46e621fefd862969819003fd6;"Since the oom_adj value is a characteristic of an mm and not a task, it is no longer necessary to check the oom_adj value for threads sharing the same memory (except when simply issuing SIGKILLs for threads in other thread groups).";yes;yes;yes 311;MDY6Q29tbWl0MjMyNTI5ODoyZmYwNWIyYjRlYWMyZTYzZDM0NWZjNzMxZWExNTFhMDYwMjQ3ZjUz;David Rientjes;Linus Torvalds;"oom: move oom_adj value from task_struct to mm_struct The per-task oom_adj value is a characteristic of its mm more than the task itself since it's not possible to oom kill any thread that shares the mm. If a task were to be killed while attached to an mm that could not be freed because another thread were set to OOM_DISABLE, it would have needlessly been terminated since there is no potential for future memory freeing. This patch moves oomkilladj (now more appropriately named oom_adj) from struct task_struct to struct mm_struct. This requires task_lock() on a task to check its oom_adj value to protect against exec, but it's already necessary to take the lock when dereferencing the mm to find the total VM size for the badness heuristic. This fixes a livelock if the oom killer chooses a task and another thread sharing the same memory has an oom_adj value of OOM_DISABLE. This occurs because oom_kill_task() repeatedly returns 1 and refuses to kill the chosen task while select_bad_process() will repeatedly choose the same task during the next retry. Taking task_lock() in select_bad_process() to check for OOM_DISABLE and in oom_kill_task() to check for threads sharing the same memory will be removed in the next patch in this series where it will no longer be necessary. Writing to /proc/pid/oom_adj for a kthread will now return -EINVAL since these threads are immune from oom killing already. They simply report an oom_adj value of OOM_DISABLE. Cc: Nick Piggin <npiggin@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/2ff05b2b4eac2e63d345fc731ea151a060247f53;oom: move oom_adj value from task_struct to mm_struct;yes;no;no 311;MDY6Q29tbWl0MjMyNTI5ODoyZmYwNWIyYjRlYWMyZTYzZDM0NWZjNzMxZWExNTFhMDYwMjQ3ZjUz;David Rientjes;Linus Torvalds;"oom: move oom_adj value from task_struct to mm_struct The per-task oom_adj value is a characteristic of its mm more than the task itself since it's not possible to oom kill any thread that shares the mm. If a task were to be killed while attached to an mm that could not be freed because another thread were set to OOM_DISABLE, it would have needlessly been terminated since there is no potential for future memory freeing. This patch moves oomkilladj (now more appropriately named oom_adj) from struct task_struct to struct mm_struct. This requires task_lock() on a task to check its oom_adj value to protect against exec, but it's already necessary to take the lock when dereferencing the mm to find the total VM size for the badness heuristic. This fixes a livelock if the oom killer chooses a task and another thread sharing the same memory has an oom_adj value of OOM_DISABLE. This occurs because oom_kill_task() repeatedly returns 1 and refuses to kill the chosen task while select_bad_process() will repeatedly choose the same task during the next retry. Taking task_lock() in select_bad_process() to check for OOM_DISABLE and in oom_kill_task() to check for threads sharing the same memory will be removed in the next patch in this series where it will no longer be necessary. Writing to /proc/pid/oom_adj for a kthread will now return -EINVAL since these threads are immune from oom killing already. They simply report an oom_adj value of OOM_DISABLE. Cc: Nick Piggin <npiggin@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/2ff05b2b4eac2e63d345fc731ea151a060247f53;"The per-task oom_adj value is a characteristic of its mm more than the task itself since it's not possible to oom kill any thread that shares the mm";no;no;yes 311;MDY6Q29tbWl0MjMyNTI5ODoyZmYwNWIyYjRlYWMyZTYzZDM0NWZjNzMxZWExNTFhMDYwMjQ3ZjUz;David Rientjes;Linus Torvalds;"oom: move oom_adj value from task_struct to mm_struct The per-task oom_adj value is a characteristic of its mm more than the task itself since it's not possible to oom kill any thread that shares the mm. If a task were to be killed while attached to an mm that could not be freed because another thread were set to OOM_DISABLE, it would have needlessly been terminated since there is no potential for future memory freeing. This patch moves oomkilladj (now more appropriately named oom_adj) from struct task_struct to struct mm_struct. This requires task_lock() on a task to check its oom_adj value to protect against exec, but it's already necessary to take the lock when dereferencing the mm to find the total VM size for the badness heuristic. This fixes a livelock if the oom killer chooses a task and another thread sharing the same memory has an oom_adj value of OOM_DISABLE. This occurs because oom_kill_task() repeatedly returns 1 and refuses to kill the chosen task while select_bad_process() will repeatedly choose the same task during the next retry. Taking task_lock() in select_bad_process() to check for OOM_DISABLE and in oom_kill_task() to check for threads sharing the same memory will be removed in the next patch in this series where it will no longer be necessary. Writing to /proc/pid/oom_adj for a kthread will now return -EINVAL since these threads are immune from oom killing already. They simply report an oom_adj value of OOM_DISABLE. Cc: Nick Piggin <npiggin@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/2ff05b2b4eac2e63d345fc731ea151a060247f53;" If a task were to be killed while attached to an mm that could not be freed because another thread were set to OOM_DISABLE, it would have needlessly been terminated since there is no potential for future memory freeing";no;yes;yes 311;MDY6Q29tbWl0MjMyNTI5ODoyZmYwNWIyYjRlYWMyZTYzZDM0NWZjNzMxZWExNTFhMDYwMjQ3ZjUz;David Rientjes;Linus Torvalds;"oom: move oom_adj value from task_struct to mm_struct The per-task oom_adj value is a characteristic of its mm more than the task itself since it's not possible to oom kill any thread that shares the mm. If a task were to be killed while attached to an mm that could not be freed because another thread were set to OOM_DISABLE, it would have needlessly been terminated since there is no potential for future memory freeing. This patch moves oomkilladj (now more appropriately named oom_adj) from struct task_struct to struct mm_struct. This requires task_lock() on a task to check its oom_adj value to protect against exec, but it's already necessary to take the lock when dereferencing the mm to find the total VM size for the badness heuristic. This fixes a livelock if the oom killer chooses a task and another thread sharing the same memory has an oom_adj value of OOM_DISABLE. This occurs because oom_kill_task() repeatedly returns 1 and refuses to kill the chosen task while select_bad_process() will repeatedly choose the same task during the next retry. Taking task_lock() in select_bad_process() to check for OOM_DISABLE and in oom_kill_task() to check for threads sharing the same memory will be removed in the next patch in this series where it will no longer be necessary. Writing to /proc/pid/oom_adj for a kthread will now return -EINVAL since these threads are immune from oom killing already. They simply report an oom_adj value of OOM_DISABLE. Cc: Nick Piggin <npiggin@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/2ff05b2b4eac2e63d345fc731ea151a060247f53;"This patch moves oomkilladj (now more appropriately named oom_adj) from struct task_struct to struct mm_struct";yes;yes;no 311;MDY6Q29tbWl0MjMyNTI5ODoyZmYwNWIyYjRlYWMyZTYzZDM0NWZjNzMxZWExNTFhMDYwMjQ3ZjUz;David Rientjes;Linus Torvalds;"oom: move oom_adj value from task_struct to mm_struct The per-task oom_adj value is a characteristic of its mm more than the task itself since it's not possible to oom kill any thread that shares the mm. If a task were to be killed while attached to an mm that could not be freed because another thread were set to OOM_DISABLE, it would have needlessly been terminated since there is no potential for future memory freeing. This patch moves oomkilladj (now more appropriately named oom_adj) from struct task_struct to struct mm_struct. This requires task_lock() on a task to check its oom_adj value to protect against exec, but it's already necessary to take the lock when dereferencing the mm to find the total VM size for the badness heuristic. This fixes a livelock if the oom killer chooses a task and another thread sharing the same memory has an oom_adj value of OOM_DISABLE. This occurs because oom_kill_task() repeatedly returns 1 and refuses to kill the chosen task while select_bad_process() will repeatedly choose the same task during the next retry. Taking task_lock() in select_bad_process() to check for OOM_DISABLE and in oom_kill_task() to check for threads sharing the same memory will be removed in the next patch in this series where it will no longer be necessary. Writing to /proc/pid/oom_adj for a kthread will now return -EINVAL since these threads are immune from oom killing already. They simply report an oom_adj value of OOM_DISABLE. Cc: Nick Piggin <npiggin@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/2ff05b2b4eac2e63d345fc731ea151a060247f53;" This requires task_lock() on a task to check its oom_adj value to protect against exec, but it's already necessary to take the lock when dereferencing the mm to find the total VM size for the badness heuristic";yes;yes;yes 311;MDY6Q29tbWl0MjMyNTI5ODoyZmYwNWIyYjRlYWMyZTYzZDM0NWZjNzMxZWExNTFhMDYwMjQ3ZjUz;David Rientjes;Linus Torvalds;"oom: move oom_adj value from task_struct to mm_struct The per-task oom_adj value is a characteristic of its mm more than the task itself since it's not possible to oom kill any thread that shares the mm. If a task were to be killed while attached to an mm that could not be freed because another thread were set to OOM_DISABLE, it would have needlessly been terminated since there is no potential for future memory freeing. This patch moves oomkilladj (now more appropriately named oom_adj) from struct task_struct to struct mm_struct. This requires task_lock() on a task to check its oom_adj value to protect against exec, but it's already necessary to take the lock when dereferencing the mm to find the total VM size for the badness heuristic. This fixes a livelock if the oom killer chooses a task and another thread sharing the same memory has an oom_adj value of OOM_DISABLE. This occurs because oom_kill_task() repeatedly returns 1 and refuses to kill the chosen task while select_bad_process() will repeatedly choose the same task during the next retry. Taking task_lock() in select_bad_process() to check for OOM_DISABLE and in oom_kill_task() to check for threads sharing the same memory will be removed in the next patch in this series where it will no longer be necessary. Writing to /proc/pid/oom_adj for a kthread will now return -EINVAL since these threads are immune from oom killing already. They simply report an oom_adj value of OOM_DISABLE. Cc: Nick Piggin <npiggin@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/2ff05b2b4eac2e63d345fc731ea151a060247f53;"This fixes a livelock if the oom killer chooses a task and another thread sharing the same memory has an oom_adj value of OOM_DISABLE";yes;yes;no 311;MDY6Q29tbWl0MjMyNTI5ODoyZmYwNWIyYjRlYWMyZTYzZDM0NWZjNzMxZWExNTFhMDYwMjQ3ZjUz;David Rientjes;Linus Torvalds;"oom: move oom_adj value from task_struct to mm_struct The per-task oom_adj value is a characteristic of its mm more than the task itself since it's not possible to oom kill any thread that shares the mm. If a task were to be killed while attached to an mm that could not be freed because another thread were set to OOM_DISABLE, it would have needlessly been terminated since there is no potential for future memory freeing. This patch moves oomkilladj (now more appropriately named oom_adj) from struct task_struct to struct mm_struct. This requires task_lock() on a task to check its oom_adj value to protect against exec, but it's already necessary to take the lock when dereferencing the mm to find the total VM size for the badness heuristic. This fixes a livelock if the oom killer chooses a task and another thread sharing the same memory has an oom_adj value of OOM_DISABLE. This occurs because oom_kill_task() repeatedly returns 1 and refuses to kill the chosen task while select_bad_process() will repeatedly choose the same task during the next retry. Taking task_lock() in select_bad_process() to check for OOM_DISABLE and in oom_kill_task() to check for threads sharing the same memory will be removed in the next patch in this series where it will no longer be necessary. Writing to /proc/pid/oom_adj for a kthread will now return -EINVAL since these threads are immune from oom killing already. They simply report an oom_adj value of OOM_DISABLE. Cc: Nick Piggin <npiggin@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/2ff05b2b4eac2e63d345fc731ea151a060247f53;" This occurs because oom_kill_task() repeatedly returns 1 and refuses to kill the chosen task while select_bad_process() will repeatedly choose the same task during the next retry";yes;yes;yes 311;MDY6Q29tbWl0MjMyNTI5ODoyZmYwNWIyYjRlYWMyZTYzZDM0NWZjNzMxZWExNTFhMDYwMjQ3ZjUz;David Rientjes;Linus Torvalds;"oom: move oom_adj value from task_struct to mm_struct The per-task oom_adj value is a characteristic of its mm more than the task itself since it's not possible to oom kill any thread that shares the mm. If a task were to be killed while attached to an mm that could not be freed because another thread were set to OOM_DISABLE, it would have needlessly been terminated since there is no potential for future memory freeing. This patch moves oomkilladj (now more appropriately named oom_adj) from struct task_struct to struct mm_struct. This requires task_lock() on a task to check its oom_adj value to protect against exec, but it's already necessary to take the lock when dereferencing the mm to find the total VM size for the badness heuristic. This fixes a livelock if the oom killer chooses a task and another thread sharing the same memory has an oom_adj value of OOM_DISABLE. This occurs because oom_kill_task() repeatedly returns 1 and refuses to kill the chosen task while select_bad_process() will repeatedly choose the same task during the next retry. Taking task_lock() in select_bad_process() to check for OOM_DISABLE and in oom_kill_task() to check for threads sharing the same memory will be removed in the next patch in this series where it will no longer be necessary. Writing to /proc/pid/oom_adj for a kthread will now return -EINVAL since these threads are immune from oom killing already. They simply report an oom_adj value of OOM_DISABLE. Cc: Nick Piggin <npiggin@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/2ff05b2b4eac2e63d345fc731ea151a060247f53;"Taking task_lock() in select_bad_process() to check for OOM_DISABLE and in oom_kill_task() to check for threads sharing the same memory will be removed in the next patch in this series where it will no longer be necessary";yes;yes;no 311;MDY6Q29tbWl0MjMyNTI5ODoyZmYwNWIyYjRlYWMyZTYzZDM0NWZjNzMxZWExNTFhMDYwMjQ3ZjUz;David Rientjes;Linus Torvalds;"oom: move oom_adj value from task_struct to mm_struct The per-task oom_adj value is a characteristic of its mm more than the task itself since it's not possible to oom kill any thread that shares the mm. If a task were to be killed while attached to an mm that could not be freed because another thread were set to OOM_DISABLE, it would have needlessly been terminated since there is no potential for future memory freeing. This patch moves oomkilladj (now more appropriately named oom_adj) from struct task_struct to struct mm_struct. This requires task_lock() on a task to check its oom_adj value to protect against exec, but it's already necessary to take the lock when dereferencing the mm to find the total VM size for the badness heuristic. This fixes a livelock if the oom killer chooses a task and another thread sharing the same memory has an oom_adj value of OOM_DISABLE. This occurs because oom_kill_task() repeatedly returns 1 and refuses to kill the chosen task while select_bad_process() will repeatedly choose the same task during the next retry. Taking task_lock() in select_bad_process() to check for OOM_DISABLE and in oom_kill_task() to check for threads sharing the same memory will be removed in the next patch in this series where it will no longer be necessary. Writing to /proc/pid/oom_adj for a kthread will now return -EINVAL since these threads are immune from oom killing already. They simply report an oom_adj value of OOM_DISABLE. Cc: Nick Piggin <npiggin@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/2ff05b2b4eac2e63d345fc731ea151a060247f53;"Writing to /proc/pid/oom_adj for a kthread will now return -EINVAL since these threads are immune from oom killing already";yes;yes;yes 311;MDY6Q29tbWl0MjMyNTI5ODoyZmYwNWIyYjRlYWMyZTYzZDM0NWZjNzMxZWExNTFhMDYwMjQ3ZjUz;David Rientjes;Linus Torvalds;"oom: move oom_adj value from task_struct to mm_struct The per-task oom_adj value is a characteristic of its mm more than the task itself since it's not possible to oom kill any thread that shares the mm. If a task were to be killed while attached to an mm that could not be freed because another thread were set to OOM_DISABLE, it would have needlessly been terminated since there is no potential for future memory freeing. This patch moves oomkilladj (now more appropriately named oom_adj) from struct task_struct to struct mm_struct. This requires task_lock() on a task to check its oom_adj value to protect against exec, but it's already necessary to take the lock when dereferencing the mm to find the total VM size for the badness heuristic. This fixes a livelock if the oom killer chooses a task and another thread sharing the same memory has an oom_adj value of OOM_DISABLE. This occurs because oom_kill_task() repeatedly returns 1 and refuses to kill the chosen task while select_bad_process() will repeatedly choose the same task during the next retry. Taking task_lock() in select_bad_process() to check for OOM_DISABLE and in oom_kill_task() to check for threads sharing the same memory will be removed in the next patch in this series where it will no longer be necessary. Writing to /proc/pid/oom_adj for a kthread will now return -EINVAL since these threads are immune from oom killing already. They simply report an oom_adj value of OOM_DISABLE. Cc: Nick Piggin <npiggin@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/2ff05b2b4eac2e63d345fc731ea151a060247f53;" They simply report an oom_adj value of OOM_DISABLE.";yes;no;yes 312;MDY6Q29tbWl0MjMyNTI5ODo2ZDI2NjFlZGU1ZjIwZjk2ODQyMmU3OTBhZjMzMzQ5MDhjM2JjODU3;David Rientjes;Linus Torvalds;"oom: fix possible oom_dump_tasks NULL pointer When /proc/sys/vm/oom_dump_tasks is enabled, it is possible to get a NULL pointer for tasks that have detached mm's since task_lock() is not held during the tasklist scan. Add the task_lock(). Acked-by: Nick Piggin <npiggin@suse.de> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/6d2661ede5f20f968422e790af3334908c3bc857;oom: fix possible oom_dump_tasks NULL pointer;yes;yes;no 312;MDY6Q29tbWl0MjMyNTI5ODo2ZDI2NjFlZGU1ZjIwZjk2ODQyMmU3OTBhZjMzMzQ5MDhjM2JjODU3;David Rientjes;Linus Torvalds;"oom: fix possible oom_dump_tasks NULL pointer When /proc/sys/vm/oom_dump_tasks is enabled, it is possible to get a NULL pointer for tasks that have detached mm's since task_lock() is not held during the tasklist scan. Add the task_lock(). Acked-by: Nick Piggin <npiggin@suse.de> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/6d2661ede5f20f968422e790af3334908c3bc857;"When /proc/sys/vm/oom_dump_tasks is enabled, it is possible to get a NULL pointer for tasks that have detached mm's since task_lock() is not held during the tasklist scan";no;yes;yes 312;MDY6Q29tbWl0MjMyNTI5ODo2ZDI2NjFlZGU1ZjIwZjk2ODQyMmU3OTBhZjMzMzQ5MDhjM2JjODU3;David Rientjes;Linus Torvalds;"oom: fix possible oom_dump_tasks NULL pointer When /proc/sys/vm/oom_dump_tasks is enabled, it is possible to get a NULL pointer for tasks that have detached mm's since task_lock() is not held during the tasklist scan. Add the task_lock(). Acked-by: Nick Piggin <npiggin@suse.de> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/6d2661ede5f20f968422e790af3334908c3bc857; Add the task_lock().;yes;no;no 313;MDY6Q29tbWl0MjMyNTI5ODoxODQxMDFiZjE0M2FjOTZkNjJiM2RjYzE3ZTdiMzU1MGY5OGQzMzUw;David Rientjes;Linus Torvalds;"oom: prevent livelock when oom_kill_allocating_task is set When /proc/sys/vm/oom_kill_allocating_task is set for large systems that want to avoid the lengthy tasklist scan, it's possible to livelock if current is ineligible for oom kill. This normally happens when it is set to OOM_DISABLE, but is also possible if any threads are sharing the same ->mm with a different tgid. So change __out_of_memory() to fall back to the full task-list scan if it was unable to kill `current'. Cc: Nick Piggin <npiggin@suse.de> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/184101bf143ac96d62b3dcc17e7b3550f98d3350;oom: prevent livelock when oom_kill_allocating_task is set;yes;yes;no 313;MDY6Q29tbWl0MjMyNTI5ODoxODQxMDFiZjE0M2FjOTZkNjJiM2RjYzE3ZTdiMzU1MGY5OGQzMzUw;David Rientjes;Linus Torvalds;"oom: prevent livelock when oom_kill_allocating_task is set When /proc/sys/vm/oom_kill_allocating_task is set for large systems that want to avoid the lengthy tasklist scan, it's possible to livelock if current is ineligible for oom kill. This normally happens when it is set to OOM_DISABLE, but is also possible if any threads are sharing the same ->mm with a different tgid. So change __out_of_memory() to fall back to the full task-list scan if it was unable to kill `current'. Cc: Nick Piggin <npiggin@suse.de> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/184101bf143ac96d62b3dcc17e7b3550f98d3350;"When /proc/sys/vm/oom_kill_allocating_task is set for large systems that want to avoid the lengthy tasklist scan, it's possible to livelock if current is ineligible for oom kill";no;yes;yes 313;MDY6Q29tbWl0MjMyNTI5ODoxODQxMDFiZjE0M2FjOTZkNjJiM2RjYzE3ZTdiMzU1MGY5OGQzMzUw;David Rientjes;Linus Torvalds;"oom: prevent livelock when oom_kill_allocating_task is set When /proc/sys/vm/oom_kill_allocating_task is set for large systems that want to avoid the lengthy tasklist scan, it's possible to livelock if current is ineligible for oom kill. This normally happens when it is set to OOM_DISABLE, but is also possible if any threads are sharing the same ->mm with a different tgid. So change __out_of_memory() to fall back to the full task-list scan if it was unable to kill `current'. Cc: Nick Piggin <npiggin@suse.de> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/184101bf143ac96d62b3dcc17e7b3550f98d3350;" This normally happens when it is set to OOM_DISABLE, but is also possible if any threads are sharing the same ->mm with a different tgid";no;no;yes 313;MDY6Q29tbWl0MjMyNTI5ODoxODQxMDFiZjE0M2FjOTZkNjJiM2RjYzE3ZTdiMzU1MGY5OGQzMzUw;David Rientjes;Linus Torvalds;"oom: prevent livelock when oom_kill_allocating_task is set When /proc/sys/vm/oom_kill_allocating_task is set for large systems that want to avoid the lengthy tasklist scan, it's possible to livelock if current is ineligible for oom kill. This normally happens when it is set to OOM_DISABLE, but is also possible if any threads are sharing the same ->mm with a different tgid. So change __out_of_memory() to fall back to the full task-list scan if it was unable to kill `current'. Cc: Nick Piggin <npiggin@suse.de> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/184101bf143ac96d62b3dcc17e7b3550f98d3350;"So change __out_of_memory() to fall back to the full task-list scan if it was unable to kill `current'.";yes;no;no 314;MDY6Q29tbWl0MjMyNTI5ODplMjIyNDMyYmZhN2RjZjZlYzAwODYyMmE5NzhjOWYyODRlZDVlM2E5;Balbir Singh;Linus Torvalds;"memcg: show memcg information during OOM Add RSS and swap to OOM output from memcg Display memcg values like failcnt, usage and limit when an OOM occurs due to memcg. Thanks to Johannes Weiner, Li Zefan, David Rientjes, Kamezawa Hiroyuki, Daisuke Nishimura and KOSAKI Motohiro for review. Sample output ------------- Task in /a/x killed as a result of limit of /a memory: usage 1048576kB, limit 1048576kB, failcnt 4183 memory+swap: usage 1400964kB, limit 9007199254740991kB, failcnt 0 [akpm@linux-foundation.org: compilation fix] [akpm@linux-foundation.org: fix kerneldoc and whitespace] [akpm@linux-foundation.org: add printk facility level] Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/e222432bfa7dcf6ec008622a978c9f284ed5e3a9;memcg: show memcg information during OOM;yes;no;no 314;MDY6Q29tbWl0MjMyNTI5ODplMjIyNDMyYmZhN2RjZjZlYzAwODYyMmE5NzhjOWYyODRlZDVlM2E5;Balbir Singh;Linus Torvalds;"memcg: show memcg information during OOM Add RSS and swap to OOM output from memcg Display memcg values like failcnt, usage and limit when an OOM occurs due to memcg. Thanks to Johannes Weiner, Li Zefan, David Rientjes, Kamezawa Hiroyuki, Daisuke Nishimura and KOSAKI Motohiro for review. Sample output ------------- Task in /a/x killed as a result of limit of /a memory: usage 1048576kB, limit 1048576kB, failcnt 4183 memory+swap: usage 1400964kB, limit 9007199254740991kB, failcnt 0 [akpm@linux-foundation.org: compilation fix] [akpm@linux-foundation.org: fix kerneldoc and whitespace] [akpm@linux-foundation.org: add printk facility level] Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/e222432bfa7dcf6ec008622a978c9f284ed5e3a9;"Add RSS and swap to OOM output from memcg Display memcg values like failcnt, usage and limit when an OOM occurs due to memcg";yes;no;no 314;MDY6Q29tbWl0MjMyNTI5ODplMjIyNDMyYmZhN2RjZjZlYzAwODYyMmE5NzhjOWYyODRlZDVlM2E5;Balbir Singh;Linus Torvalds;"memcg: show memcg information during OOM Add RSS and swap to OOM output from memcg Display memcg values like failcnt, usage and limit when an OOM occurs due to memcg. Thanks to Johannes Weiner, Li Zefan, David Rientjes, Kamezawa Hiroyuki, Daisuke Nishimura and KOSAKI Motohiro for review. Sample output ------------- Task in /a/x killed as a result of limit of /a memory: usage 1048576kB, limit 1048576kB, failcnt 4183 memory+swap: usage 1400964kB, limit 9007199254740991kB, failcnt 0 [akpm@linux-foundation.org: compilation fix] [akpm@linux-foundation.org: fix kerneldoc and whitespace] [akpm@linux-foundation.org: add printk facility level] Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/e222432bfa7dcf6ec008622a978c9f284ed5e3a9;"Thanks to Johannes Weiner, Li Zefan, David Rientjes, Kamezawa Hiroyuki, Daisuke Nishimura and KOSAKI Motohiro for review";no;no;no 314;MDY6Q29tbWl0MjMyNTI5ODplMjIyNDMyYmZhN2RjZjZlYzAwODYyMmE5NzhjOWYyODRlZDVlM2E5;Balbir Singh;Linus Torvalds;"memcg: show memcg information during OOM Add RSS and swap to OOM output from memcg Display memcg values like failcnt, usage and limit when an OOM occurs due to memcg. Thanks to Johannes Weiner, Li Zefan, David Rientjes, Kamezawa Hiroyuki, Daisuke Nishimura and KOSAKI Motohiro for review. Sample output ------------- Task in /a/x killed as a result of limit of /a memory: usage 1048576kB, limit 1048576kB, failcnt 4183 memory+swap: usage 1400964kB, limit 9007199254740991kB, failcnt 0 [akpm@linux-foundation.org: compilation fix] [akpm@linux-foundation.org: fix kerneldoc and whitespace] [akpm@linux-foundation.org: add printk facility level] Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/e222432bfa7dcf6ec008622a978c9f284ed5e3a9;"Sample output Task in /a/x killed as a result of limit of /a memory: usage 1048576kB, limit 1048576kB, failcnt 4183 memory+swap: usage 1400964kB, limit 9007199254740991kB, failcnt 0";yes;no;yes 315;MDY6Q29tbWl0MjMyNTI5ODphMTI4ODhmNzcyZGFiNGJmNWU2ZjczNjY4ZGM0ZjVmNjAyNmE3MDE0;Cyrill Gorcunov;Linus Torvalds;"oom_kill: don't call for int_sqrt(0) There is no need to call for int_sqrt if argument is 0. Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <cl@linux-foundation.org> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/a12888f772dab4bf5e6f73668dc4f5f6026a7014;oom_kill: don't call for int_sqrt(0);yes;no;no 315;MDY6Q29tbWl0MjMyNTI5ODphMTI4ODhmNzcyZGFiNGJmNWU2ZjczNjY4ZGM0ZjVmNjAyNmE3MDE0;Cyrill Gorcunov;Linus Torvalds;"oom_kill: don't call for int_sqrt(0) There is no need to call for int_sqrt if argument is 0. Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <cl@linux-foundation.org> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/a12888f772dab4bf5e6f73668dc4f5f6026a7014;There is no need to call for int_sqrt if argument is 0.;yes;yes;no 316;MDY6Q29tbWl0MjMyNTI5ODo3ZjRkNDU0ZGVlMmUwYmRkMjFiYWZkNDEzZDFjNTNlNDQzYTI2NTQw;Daisuke Nishimura;Linus Torvalds;"memcg: avoid deadlock caused by race between oom and cpuset_attach mpol_rebind_mm(), which can be called from cpuset_attach(), does down_write(mm->mmap_sem). This means down_write(mm->mmap_sem) can be called under cgroup_mutex. OTOH, page fault path does down_read(mm->mmap_sem) and calls mem_cgroup_try_charge_xxx(), which may eventually calls mem_cgroup_out_of_memory(). And mem_cgroup_out_of_memory() calls cgroup_lock(). This means cgroup_lock() can be called under down_read(mm->mmap_sem). If those two paths race, deadlock can happen. This patch avoid this deadlock by: - remove cgroup_lock() from mem_cgroup_out_of_memory(). - define new mutex (memcg_tasklist) and serialize mem_cgroup_move_task() (->attach handler of memory cgroup) and mem_cgroup_out_of_memory. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/7f4d454dee2e0bdd21bafd413d1c53e443a26540;memcg: avoid deadlock caused by race between oom and cpuset_attach;yes;yes;no 316;MDY6Q29tbWl0MjMyNTI5ODo3ZjRkNDU0ZGVlMmUwYmRkMjFiYWZkNDEzZDFjNTNlNDQzYTI2NTQw;Daisuke Nishimura;Linus Torvalds;"memcg: avoid deadlock caused by race between oom and cpuset_attach mpol_rebind_mm(), which can be called from cpuset_attach(), does down_write(mm->mmap_sem). This means down_write(mm->mmap_sem) can be called under cgroup_mutex. OTOH, page fault path does down_read(mm->mmap_sem) and calls mem_cgroup_try_charge_xxx(), which may eventually calls mem_cgroup_out_of_memory(). And mem_cgroup_out_of_memory() calls cgroup_lock(). This means cgroup_lock() can be called under down_read(mm->mmap_sem). If those two paths race, deadlock can happen. This patch avoid this deadlock by: - remove cgroup_lock() from mem_cgroup_out_of_memory(). - define new mutex (memcg_tasklist) and serialize mem_cgroup_move_task() (->attach handler of memory cgroup) and mem_cgroup_out_of_memory. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/7f4d454dee2e0bdd21bafd413d1c53e443a26540;"mpol_rebind_mm(), which can be called from cpuset_attach(), does down_write(mm->mmap_sem)";no;no;yes 316;MDY6Q29tbWl0MjMyNTI5ODo3ZjRkNDU0ZGVlMmUwYmRkMjFiYWZkNDEzZDFjNTNlNDQzYTI2NTQw;Daisuke Nishimura;Linus Torvalds;"memcg: avoid deadlock caused by race between oom and cpuset_attach mpol_rebind_mm(), which can be called from cpuset_attach(), does down_write(mm->mmap_sem). This means down_write(mm->mmap_sem) can be called under cgroup_mutex. OTOH, page fault path does down_read(mm->mmap_sem) and calls mem_cgroup_try_charge_xxx(), which may eventually calls mem_cgroup_out_of_memory(). And mem_cgroup_out_of_memory() calls cgroup_lock(). This means cgroup_lock() can be called under down_read(mm->mmap_sem). If those two paths race, deadlock can happen. This patch avoid this deadlock by: - remove cgroup_lock() from mem_cgroup_out_of_memory(). - define new mutex (memcg_tasklist) and serialize mem_cgroup_move_task() (->attach handler of memory cgroup) and mem_cgroup_out_of_memory. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/7f4d454dee2e0bdd21bafd413d1c53e443a26540;" This means down_write(mm->mmap_sem) can be called under cgroup_mutex";no;no;yes 316;MDY6Q29tbWl0MjMyNTI5ODo3ZjRkNDU0ZGVlMmUwYmRkMjFiYWZkNDEzZDFjNTNlNDQzYTI2NTQw;Daisuke Nishimura;Linus Torvalds;"memcg: avoid deadlock caused by race between oom and cpuset_attach mpol_rebind_mm(), which can be called from cpuset_attach(), does down_write(mm->mmap_sem). This means down_write(mm->mmap_sem) can be called under cgroup_mutex. OTOH, page fault path does down_read(mm->mmap_sem) and calls mem_cgroup_try_charge_xxx(), which may eventually calls mem_cgroup_out_of_memory(). And mem_cgroup_out_of_memory() calls cgroup_lock(). This means cgroup_lock() can be called under down_read(mm->mmap_sem). If those two paths race, deadlock can happen. This patch avoid this deadlock by: - remove cgroup_lock() from mem_cgroup_out_of_memory(). - define new mutex (memcg_tasklist) and serialize mem_cgroup_move_task() (->attach handler of memory cgroup) and mem_cgroup_out_of_memory. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/7f4d454dee2e0bdd21bafd413d1c53e443a26540;"OTOH, page fault path does down_read(mm->mmap_sem) and calls mem_cgroup_try_charge_xxx(), which may eventually calls mem_cgroup_out_of_memory()";no;no;yes 316;MDY6Q29tbWl0MjMyNTI5ODo3ZjRkNDU0ZGVlMmUwYmRkMjFiYWZkNDEzZDFjNTNlNDQzYTI2NTQw;Daisuke Nishimura;Linus Torvalds;"memcg: avoid deadlock caused by race between oom and cpuset_attach mpol_rebind_mm(), which can be called from cpuset_attach(), does down_write(mm->mmap_sem). This means down_write(mm->mmap_sem) can be called under cgroup_mutex. OTOH, page fault path does down_read(mm->mmap_sem) and calls mem_cgroup_try_charge_xxx(), which may eventually calls mem_cgroup_out_of_memory(). And mem_cgroup_out_of_memory() calls cgroup_lock(). This means cgroup_lock() can be called under down_read(mm->mmap_sem). If those two paths race, deadlock can happen. This patch avoid this deadlock by: - remove cgroup_lock() from mem_cgroup_out_of_memory(). - define new mutex (memcg_tasklist) and serialize mem_cgroup_move_task() (->attach handler of memory cgroup) and mem_cgroup_out_of_memory. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/7f4d454dee2e0bdd21bafd413d1c53e443a26540;" And mem_cgroup_out_of_memory() calls cgroup_lock()";no;no;yes 316;MDY6Q29tbWl0MjMyNTI5ODo3ZjRkNDU0ZGVlMmUwYmRkMjFiYWZkNDEzZDFjNTNlNDQzYTI2NTQw;Daisuke Nishimura;Linus Torvalds;"memcg: avoid deadlock caused by race between oom and cpuset_attach mpol_rebind_mm(), which can be called from cpuset_attach(), does down_write(mm->mmap_sem). This means down_write(mm->mmap_sem) can be called under cgroup_mutex. OTOH, page fault path does down_read(mm->mmap_sem) and calls mem_cgroup_try_charge_xxx(), which may eventually calls mem_cgroup_out_of_memory(). And mem_cgroup_out_of_memory() calls cgroup_lock(). This means cgroup_lock() can be called under down_read(mm->mmap_sem). If those two paths race, deadlock can happen. This patch avoid this deadlock by: - remove cgroup_lock() from mem_cgroup_out_of_memory(). - define new mutex (memcg_tasklist) and serialize mem_cgroup_move_task() (->attach handler of memory cgroup) and mem_cgroup_out_of_memory. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/7f4d454dee2e0bdd21bafd413d1c53e443a26540;" This means cgroup_lock() can be called under down_read(mm->mmap_sem)";no;no;yes 316;MDY6Q29tbWl0MjMyNTI5ODo3ZjRkNDU0ZGVlMmUwYmRkMjFiYWZkNDEzZDFjNTNlNDQzYTI2NTQw;Daisuke Nishimura;Linus Torvalds;"memcg: avoid deadlock caused by race between oom and cpuset_attach mpol_rebind_mm(), which can be called from cpuset_attach(), does down_write(mm->mmap_sem). This means down_write(mm->mmap_sem) can be called under cgroup_mutex. OTOH, page fault path does down_read(mm->mmap_sem) and calls mem_cgroup_try_charge_xxx(), which may eventually calls mem_cgroup_out_of_memory(). And mem_cgroup_out_of_memory() calls cgroup_lock(). This means cgroup_lock() can be called under down_read(mm->mmap_sem). If those two paths race, deadlock can happen. This patch avoid this deadlock by: - remove cgroup_lock() from mem_cgroup_out_of_memory(). - define new mutex (memcg_tasklist) and serialize mem_cgroup_move_task() (->attach handler of memory cgroup) and mem_cgroup_out_of_memory. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/7f4d454dee2e0bdd21bafd413d1c53e443a26540;If those two paths race, deadlock can happen;no;yes;yes 316;MDY6Q29tbWl0MjMyNTI5ODo3ZjRkNDU0ZGVlMmUwYmRkMjFiYWZkNDEzZDFjNTNlNDQzYTI2NTQw;Daisuke Nishimura;Linus Torvalds;"memcg: avoid deadlock caused by race between oom and cpuset_attach mpol_rebind_mm(), which can be called from cpuset_attach(), does down_write(mm->mmap_sem). This means down_write(mm->mmap_sem) can be called under cgroup_mutex. OTOH, page fault path does down_read(mm->mmap_sem) and calls mem_cgroup_try_charge_xxx(), which may eventually calls mem_cgroup_out_of_memory(). And mem_cgroup_out_of_memory() calls cgroup_lock(). This means cgroup_lock() can be called under down_read(mm->mmap_sem). If those two paths race, deadlock can happen. This patch avoid this deadlock by: - remove cgroup_lock() from mem_cgroup_out_of_memory(). - define new mutex (memcg_tasklist) and serialize mem_cgroup_move_task() (->attach handler of memory cgroup) and mem_cgroup_out_of_memory. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/7f4d454dee2e0bdd21bafd413d1c53e443a26540;This patch avoid this deadlock by;yes;yes;no 316;MDY6Q29tbWl0MjMyNTI5ODo3ZjRkNDU0ZGVlMmUwYmRkMjFiYWZkNDEzZDFjNTNlNDQzYTI2NTQw;Daisuke Nishimura;Linus Torvalds;"memcg: avoid deadlock caused by race between oom and cpuset_attach mpol_rebind_mm(), which can be called from cpuset_attach(), does down_write(mm->mmap_sem). This means down_write(mm->mmap_sem) can be called under cgroup_mutex. OTOH, page fault path does down_read(mm->mmap_sem) and calls mem_cgroup_try_charge_xxx(), which may eventually calls mem_cgroup_out_of_memory(). And mem_cgroup_out_of_memory() calls cgroup_lock(). This means cgroup_lock() can be called under down_read(mm->mmap_sem). If those two paths race, deadlock can happen. This patch avoid this deadlock by: - remove cgroup_lock() from mem_cgroup_out_of_memory(). - define new mutex (memcg_tasklist) and serialize mem_cgroup_move_task() (->attach handler of memory cgroup) and mem_cgroup_out_of_memory. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/7f4d454dee2e0bdd21bafd413d1c53e443a26540; - remove cgroup_lock() from mem_cgroup_out_of_memory();yes;no;no 316;MDY6Q29tbWl0MjMyNTI5ODo3ZjRkNDU0ZGVlMmUwYmRkMjFiYWZkNDEzZDFjNTNlNDQzYTI2NTQw;Daisuke Nishimura;Linus Torvalds;"memcg: avoid deadlock caused by race between oom and cpuset_attach mpol_rebind_mm(), which can be called from cpuset_attach(), does down_write(mm->mmap_sem). This means down_write(mm->mmap_sem) can be called under cgroup_mutex. OTOH, page fault path does down_read(mm->mmap_sem) and calls mem_cgroup_try_charge_xxx(), which may eventually calls mem_cgroup_out_of_memory(). And mem_cgroup_out_of_memory() calls cgroup_lock(). This means cgroup_lock() can be called under down_read(mm->mmap_sem). If those two paths race, deadlock can happen. This patch avoid this deadlock by: - remove cgroup_lock() from mem_cgroup_out_of_memory(). - define new mutex (memcg_tasklist) and serialize mem_cgroup_move_task() (->attach handler of memory cgroup) and mem_cgroup_out_of_memory. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/7f4d454dee2e0bdd21bafd413d1c53e443a26540;" - define new mutex (memcg_tasklist) and serialize mem_cgroup_move_task() (->attach handler of memory cgroup) and mem_cgroup_out_of_memory.";yes;no;no 317;MDY6Q29tbWl0MjMyNTI5ODphNjM2YjMyN2Y3MzExNDNjY2M1NDRiOTY2Y2ZkOGRlNmNiNmQ3MmM2;KAMEZAWA Hiroyuki;Linus Torvalds;"memcg: avoid unnecessary system-wide-oom-killer Current mmtom has new oom function as pagefault_out_of_memory(). It's added for select bad process rathar than killing current. When memcg hit limit and calls OOM at page_fault, this handler called and system-wide-oom handling happens. (means kernel panics if panic_on_oom is true....) To avoid overkill, check memcg's recent behavior before starting system-wide-oom. And this patch also fixes to guarantee ""don't accnout against process with TIF_MEMDIE"". This is necessary for smooth OOM. [akpm@linux-foundation.org: build fix] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Jan Blunck <jblunck@suse.de> Cc: Hirokazu Takahashi <taka@valinux.co.jp> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/a636b327f731143ccc544b966cfd8de6cb6d72c6;memcg: avoid unnecessary system-wide-oom-killer;yes;yes;no 317;MDY6Q29tbWl0MjMyNTI5ODphNjM2YjMyN2Y3MzExNDNjY2M1NDRiOTY2Y2ZkOGRlNmNiNmQ3MmM2;KAMEZAWA Hiroyuki;Linus Torvalds;"memcg: avoid unnecessary system-wide-oom-killer Current mmtom has new oom function as pagefault_out_of_memory(). It's added for select bad process rathar than killing current. When memcg hit limit and calls OOM at page_fault, this handler called and system-wide-oom handling happens. (means kernel panics if panic_on_oom is true....) To avoid overkill, check memcg's recent behavior before starting system-wide-oom. And this patch also fixes to guarantee ""don't accnout against process with TIF_MEMDIE"". This is necessary for smooth OOM. [akpm@linux-foundation.org: build fix] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Jan Blunck <jblunck@suse.de> Cc: Hirokazu Takahashi <taka@valinux.co.jp> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/a636b327f731143ccc544b966cfd8de6cb6d72c6;Current mmtom has new oom function as pagefault_out_of_memory();no;no;yes 317;MDY6Q29tbWl0MjMyNTI5ODphNjM2YjMyN2Y3MzExNDNjY2M1NDRiOTY2Y2ZkOGRlNmNiNmQ3MmM2;KAMEZAWA Hiroyuki;Linus Torvalds;"memcg: avoid unnecessary system-wide-oom-killer Current mmtom has new oom function as pagefault_out_of_memory(). It's added for select bad process rathar than killing current. When memcg hit limit and calls OOM at page_fault, this handler called and system-wide-oom handling happens. (means kernel panics if panic_on_oom is true....) To avoid overkill, check memcg's recent behavior before starting system-wide-oom. And this patch also fixes to guarantee ""don't accnout against process with TIF_MEMDIE"". This is necessary for smooth OOM. [akpm@linux-foundation.org: build fix] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Jan Blunck <jblunck@suse.de> Cc: Hirokazu Takahashi <taka@valinux.co.jp> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/a636b327f731143ccc544b966cfd8de6cb6d72c6;" It's added for select bad process rathar than killing current";no;no;yes 317;MDY6Q29tbWl0MjMyNTI5ODphNjM2YjMyN2Y3MzExNDNjY2M1NDRiOTY2Y2ZkOGRlNmNiNmQ3MmM2;KAMEZAWA Hiroyuki;Linus Torvalds;"memcg: avoid unnecessary system-wide-oom-killer Current mmtom has new oom function as pagefault_out_of_memory(). It's added for select bad process rathar than killing current. When memcg hit limit and calls OOM at page_fault, this handler called and system-wide-oom handling happens. (means kernel panics if panic_on_oom is true....) To avoid overkill, check memcg's recent behavior before starting system-wide-oom. And this patch also fixes to guarantee ""don't accnout against process with TIF_MEMDIE"". This is necessary for smooth OOM. [akpm@linux-foundation.org: build fix] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Jan Blunck <jblunck@suse.de> Cc: Hirokazu Takahashi <taka@valinux.co.jp> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/a636b327f731143ccc544b966cfd8de6cb6d72c6;"When memcg hit limit and calls OOM at page_fault, this handler called and system-wide-oom handling happens";no;no;yes 317;MDY6Q29tbWl0MjMyNTI5ODphNjM2YjMyN2Y3MzExNDNjY2M1NDRiOTY2Y2ZkOGRlNmNiNmQ3MmM2;KAMEZAWA Hiroyuki;Linus Torvalds;"memcg: avoid unnecessary system-wide-oom-killer Current mmtom has new oom function as pagefault_out_of_memory(). It's added for select bad process rathar than killing current. When memcg hit limit and calls OOM at page_fault, this handler called and system-wide-oom handling happens. (means kernel panics if panic_on_oom is true....) To avoid overkill, check memcg's recent behavior before starting system-wide-oom. And this patch also fixes to guarantee ""don't accnout against process with TIF_MEMDIE"". This is necessary for smooth OOM. [akpm@linux-foundation.org: build fix] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Jan Blunck <jblunck@suse.de> Cc: Hirokazu Takahashi <taka@valinux.co.jp> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/a636b327f731143ccc544b966cfd8de6cb6d72c6;" (means kernel panics if panic_on_oom is true....) To avoid overkill, check memcg's recent behavior before starting system-wide-oom";yes;yes;yes 317;MDY6Q29tbWl0MjMyNTI5ODphNjM2YjMyN2Y3MzExNDNjY2M1NDRiOTY2Y2ZkOGRlNmNiNmQ3MmM2;KAMEZAWA Hiroyuki;Linus Torvalds;"memcg: avoid unnecessary system-wide-oom-killer Current mmtom has new oom function as pagefault_out_of_memory(). It's added for select bad process rathar than killing current. When memcg hit limit and calls OOM at page_fault, this handler called and system-wide-oom handling happens. (means kernel panics if panic_on_oom is true....) To avoid overkill, check memcg's recent behavior before starting system-wide-oom. And this patch also fixes to guarantee ""don't accnout against process with TIF_MEMDIE"". This is necessary for smooth OOM. [akpm@linux-foundation.org: build fix] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Jan Blunck <jblunck@suse.de> Cc: Hirokazu Takahashi <taka@valinux.co.jp> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/a636b327f731143ccc544b966cfd8de6cb6d72c6;"And this patch also fixes to guarantee ""don't accnout against process with TIF_MEMDIE""";yes;yes;no 317;MDY6Q29tbWl0MjMyNTI5ODphNjM2YjMyN2Y3MzExNDNjY2M1NDRiOTY2Y2ZkOGRlNmNiNmQ3MmM2;KAMEZAWA Hiroyuki;Linus Torvalds;"memcg: avoid unnecessary system-wide-oom-killer Current mmtom has new oom function as pagefault_out_of_memory(). It's added for select bad process rathar than killing current. When memcg hit limit and calls OOM at page_fault, this handler called and system-wide-oom handling happens. (means kernel panics if panic_on_oom is true....) To avoid overkill, check memcg's recent behavior before starting system-wide-oom. And this patch also fixes to guarantee ""don't accnout against process with TIF_MEMDIE"". This is necessary for smooth OOM. [akpm@linux-foundation.org: build fix] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Jan Blunck <jblunck@suse.de> Cc: Hirokazu Takahashi <taka@valinux.co.jp> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/a636b327f731143ccc544b966cfd8de6cb6d72c6; This is necessary for smooth OOM.;no;yes;no 318;MDY6Q29tbWl0MjMyNTI5ODo3NWFhMTk5NDEwMzU5ZGM1ZmJjZjkwMjVmZjdhZjk4YTlkMjBmMGQ1;David Rientjes;Linus Torvalds;"oom: print triggering task's cpuset and mems allowed When cpusets are enabled, it's necessary to print the triggering task's set of allowable nodes so the subsequently printed meminfo can be interpreted correctly. We also print the task's cpuset name for informational purposes. [rientjes@google.com: task lock current before dereferencing cpuset] Cc: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/75aa199410359dc5fbcf9025ff7af98a9d20f0d5;oom: print triggering task's cpuset and mems allowed;yes;no;no 318;MDY6Q29tbWl0MjMyNTI5ODo3NWFhMTk5NDEwMzU5ZGM1ZmJjZjkwMjVmZjdhZjk4YTlkMjBmMGQ1;David Rientjes;Linus Torvalds;"oom: print triggering task's cpuset and mems allowed When cpusets are enabled, it's necessary to print the triggering task's set of allowable nodes so the subsequently printed meminfo can be interpreted correctly. We also print the task's cpuset name for informational purposes. [rientjes@google.com: task lock current before dereferencing cpuset] Cc: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/75aa199410359dc5fbcf9025ff7af98a9d20f0d5;"When cpusets are enabled, it's necessary to print the triggering task's set of allowable nodes so the subsequently printed meminfo can be interpreted correctly";yes;yes;yes 318;MDY6Q29tbWl0MjMyNTI5ODo3NWFhMTk5NDEwMzU5ZGM1ZmJjZjkwMjVmZjdhZjk4YTlkMjBmMGQ1;David Rientjes;Linus Torvalds;"oom: print triggering task's cpuset and mems allowed When cpusets are enabled, it's necessary to print the triggering task's set of allowable nodes so the subsequently printed meminfo can be interpreted correctly. We also print the task's cpuset name for informational purposes. [rientjes@google.com: task lock current before dereferencing cpuset] Cc: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/75aa199410359dc5fbcf9025ff7af98a9d20f0d5;We also print the task's cpuset name for informational purposes.;yes;yes;no 319;MDY6Q29tbWl0MjMyNTI5ODpjN2Q0Y2FlYjFkNjhkMDdmNzdjYzA5ZmMyMGI3NzU5ZDZkN2FhM2Ix;David Rientjes;Linus Torvalds;"oom: fix zone_scan_mutex name zone_scan_mutex is actually a spinlock, so name it appropriately. Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/c7d4caeb1d68d07f77cc09fc20b7759d6d7aa3b1;oom: fix zone_scan_mutex name;yes;yes;no 319;MDY6Q29tbWl0MjMyNTI5ODpjN2Q0Y2FlYjFkNjhkMDdmNzdjYzA5ZmMyMGI3NzU5ZDZkN2FhM2Ix;David Rientjes;Linus Torvalds;"oom: fix zone_scan_mutex name zone_scan_mutex is actually a spinlock, so name it appropriately. Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/c7d4caeb1d68d07f77cc09fc20b7759d6d7aa3b1;zone_scan_mutex is actually a spinlock, so name it appropriately.;yes;yes;yes 320;MDY6Q29tbWl0MjMyNTI5ODoxYzBmZTZlM2JkYTA0NjQ3MjhjMjNjOGQ4NGFhNDc1NjdlOGI3MTZj;Nick Piggin;Linus Torvalds;"mm: invoke oom-killer from page fault Rather than have the pagefault handler kill a process directly if it gets a VM_FAULT_OOM, have it call into the OOM killer. With increasingly sophisticated oom behaviour (cpusets, memory cgroups, oom killing throttling, oom priority adjustment or selective disabling, panic on oom, etc), it's silly to unconditionally kill the faulting process at page fault time. Create a hook for pagefault oom path to call into instead. Only converted x86 and uml so far. [akpm@linux-foundation.org: make __out_of_memory() static] [akpm@linux-foundation.org: fix comment] Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Jeff Dike <jdike@addtoit.com> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/1c0fe6e3bda0464728c23c8d84aa47567e8b716c;mm: invoke oom-killer from page fault;yes;no;no 320;MDY6Q29tbWl0MjMyNTI5ODoxYzBmZTZlM2JkYTA0NjQ3MjhjMjNjOGQ4NGFhNDc1NjdlOGI3MTZj;Nick Piggin;Linus Torvalds;"mm: invoke oom-killer from page fault Rather than have the pagefault handler kill a process directly if it gets a VM_FAULT_OOM, have it call into the OOM killer. With increasingly sophisticated oom behaviour (cpusets, memory cgroups, oom killing throttling, oom priority adjustment or selective disabling, panic on oom, etc), it's silly to unconditionally kill the faulting process at page fault time. Create a hook for pagefault oom path to call into instead. Only converted x86 and uml so far. [akpm@linux-foundation.org: make __out_of_memory() static] [akpm@linux-foundation.org: fix comment] Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Jeff Dike <jdike@addtoit.com> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/1c0fe6e3bda0464728c23c8d84aa47567e8b716c;"Rather than have the pagefault handler kill a process directly if it gets a VM_FAULT_OOM, have it call into the OOM killer";yes;no;no 320;MDY6Q29tbWl0MjMyNTI5ODoxYzBmZTZlM2JkYTA0NjQ3MjhjMjNjOGQ4NGFhNDc1NjdlOGI3MTZj;Nick Piggin;Linus Torvalds;"mm: invoke oom-killer from page fault Rather than have the pagefault handler kill a process directly if it gets a VM_FAULT_OOM, have it call into the OOM killer. With increasingly sophisticated oom behaviour (cpusets, memory cgroups, oom killing throttling, oom priority adjustment or selective disabling, panic on oom, etc), it's silly to unconditionally kill the faulting process at page fault time. Create a hook for pagefault oom path to call into instead. Only converted x86 and uml so far. [akpm@linux-foundation.org: make __out_of_memory() static] [akpm@linux-foundation.org: fix comment] Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Jeff Dike <jdike@addtoit.com> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/1c0fe6e3bda0464728c23c8d84aa47567e8b716c;"With increasingly sophisticated oom behaviour (cpusets, memory cgroups, oom killing throttling, oom priority adjustment or selective disabling, panic on oom, etc), it's silly to unconditionally kill the faulting process at page fault time";yes;yes;yes 320;MDY6Q29tbWl0MjMyNTI5ODoxYzBmZTZlM2JkYTA0NjQ3MjhjMjNjOGQ4NGFhNDc1NjdlOGI3MTZj;Nick Piggin;Linus Torvalds;"mm: invoke oom-killer from page fault Rather than have the pagefault handler kill a process directly if it gets a VM_FAULT_OOM, have it call into the OOM killer. With increasingly sophisticated oom behaviour (cpusets, memory cgroups, oom killing throttling, oom priority adjustment or selective disabling, panic on oom, etc), it's silly to unconditionally kill the faulting process at page fault time. Create a hook for pagefault oom path to call into instead. Only converted x86 and uml so far. [akpm@linux-foundation.org: make __out_of_memory() static] [akpm@linux-foundation.org: fix comment] Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Jeff Dike <jdike@addtoit.com> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/1c0fe6e3bda0464728c23c8d84aa47567e8b716c;" Create a hook for pagefault oom path to call into instead";yes;no;no 320;MDY6Q29tbWl0MjMyNTI5ODoxYzBmZTZlM2JkYTA0NjQ3MjhjMjNjOGQ4NGFhNDc1NjdlOGI3MTZj;Nick Piggin;Linus Torvalds;"mm: invoke oom-killer from page fault Rather than have the pagefault handler kill a process directly if it gets a VM_FAULT_OOM, have it call into the OOM killer. With increasingly sophisticated oom behaviour (cpusets, memory cgroups, oom killing throttling, oom priority adjustment or selective disabling, panic on oom, etc), it's silly to unconditionally kill the faulting process at page fault time. Create a hook for pagefault oom path to call into instead. Only converted x86 and uml so far. [akpm@linux-foundation.org: make __out_of_memory() static] [akpm@linux-foundation.org: fix comment] Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Jeff Dike <jdike@addtoit.com> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/1c0fe6e3bda0464728c23c8d84aa47567e8b716c;Only converted x86 and uml so far.;yes;no;no 321;MDY6Q29tbWl0MjMyNTI5ODpjNjllOGQ5YzAxZGIyYWRjNTAzNDY0OTkzYzM1ODkwMWM5YWY5ZGU0;David Howells;James Morris;"CRED: Use RCU to access another task's creds and to release a task's own creds Use RCU to access another task's creds and to release a task's own creds. This means that it will be possible for the credentials of a task to be replaced without another task (a) requiring a full lock to read them, and (b) seeing deallocated memory. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: James Morris <jmorris@namei.org> Acked-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: James Morris <jmorris@namei.org>";https://api.github.com/repos/torvalds/linux/git/commits/c69e8d9c01db2adc503464993c358901c9af9de4;CRED: Use RCU to access another task's creds and to release a task's own creds;yes;no;no 321;MDY6Q29tbWl0MjMyNTI5ODpjNjllOGQ5YzAxZGIyYWRjNTAzNDY0OTkzYzM1ODkwMWM5YWY5ZGU0;David Howells;James Morris;"CRED: Use RCU to access another task's creds and to release a task's own creds Use RCU to access another task's creds and to release a task's own creds. This means that it will be possible for the credentials of a task to be replaced without another task (a) requiring a full lock to read them, and (b) seeing deallocated memory. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: James Morris <jmorris@namei.org> Acked-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: James Morris <jmorris@namei.org>";https://api.github.com/repos/torvalds/linux/git/commits/c69e8d9c01db2adc503464993c358901c9af9de4;Use RCU to access another task's creds and to release a task's own creds;yes;no;no 321;MDY6Q29tbWl0MjMyNTI5ODpjNjllOGQ5YzAxZGIyYWRjNTAzNDY0OTkzYzM1ODkwMWM5YWY5ZGU0;David Howells;James Morris;"CRED: Use RCU to access another task's creds and to release a task's own creds Use RCU to access another task's creds and to release a task's own creds. This means that it will be possible for the credentials of a task to be replaced without another task (a) requiring a full lock to read them, and (b) seeing deallocated memory. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: James Morris <jmorris@namei.org> Acked-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: James Morris <jmorris@namei.org>";https://api.github.com/repos/torvalds/linux/git/commits/c69e8d9c01db2adc503464993c358901c9af9de4;"This means that it will be possible for the credentials of a task to be replaced without another task (a) requiring a full lock to read them, and (b) seeing deallocated memory.";yes;yes;yes 322;MDY6Q29tbWl0MjMyNTI5ODpiNmRmZjNlYzVlMTE2ZTNhZjZmNTM3ZDRjYWVkY2FkNmI5ZTUwODJh;David Howells;James Morris;"CRED: Separate task security context from task_struct Separate the task security context from task_struct. At this point, the security data is temporarily embedded in the task_struct with two pointers pointing to it. Note that the Alpha arch is altered as it refers to (E)UID and (E)GID in entry.S via asm-offsets. With comment fixes Signed-off-by: Marc Dionne <marc.c.dionne@gmail.com> Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: James Morris <jmorris@namei.org> Acked-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: James Morris <jmorris@namei.org>";https://api.github.com/repos/torvalds/linux/git/commits/b6dff3ec5e116e3af6f537d4caedcad6b9e5082a;CRED: Separate task security context from task_struct;yes;no;no 322;MDY6Q29tbWl0MjMyNTI5ODpiNmRmZjNlYzVlMTE2ZTNhZjZmNTM3ZDRjYWVkY2FkNmI5ZTUwODJh;David Howells;James Morris;"CRED: Separate task security context from task_struct Separate the task security context from task_struct. At this point, the security data is temporarily embedded in the task_struct with two pointers pointing to it. Note that the Alpha arch is altered as it refers to (E)UID and (E)GID in entry.S via asm-offsets. With comment fixes Signed-off-by: Marc Dionne <marc.c.dionne@gmail.com> Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: James Morris <jmorris@namei.org> Acked-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: James Morris <jmorris@namei.org>";https://api.github.com/repos/torvalds/linux/git/commits/b6dff3ec5e116e3af6f537d4caedcad6b9e5082a;Separate the task security context from task_struct;yes;no;no 322;MDY6Q29tbWl0MjMyNTI5ODpiNmRmZjNlYzVlMTE2ZTNhZjZmNTM3ZDRjYWVkY2FkNmI5ZTUwODJh;David Howells;James Morris;"CRED: Separate task security context from task_struct Separate the task security context from task_struct. At this point, the security data is temporarily embedded in the task_struct with two pointers pointing to it. Note that the Alpha arch is altered as it refers to (E)UID and (E)GID in entry.S via asm-offsets. With comment fixes Signed-off-by: Marc Dionne <marc.c.dionne@gmail.com> Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: James Morris <jmorris@namei.org> Acked-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: James Morris <jmorris@namei.org>";https://api.github.com/repos/torvalds/linux/git/commits/b6dff3ec5e116e3af6f537d4caedcad6b9e5082a;" At this point, the security data is temporarily embedded in the task_struct with two pointers pointing to it";yes;yes;yes 322;MDY6Q29tbWl0MjMyNTI5ODpiNmRmZjNlYzVlMTE2ZTNhZjZmNTM3ZDRjYWVkY2FkNmI5ZTUwODJh;David Howells;James Morris;"CRED: Separate task security context from task_struct Separate the task security context from task_struct. At this point, the security data is temporarily embedded in the task_struct with two pointers pointing to it. Note that the Alpha arch is altered as it refers to (E)UID and (E)GID in entry.S via asm-offsets. With comment fixes Signed-off-by: Marc Dionne <marc.c.dionne@gmail.com> Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: James Morris <jmorris@namei.org> Acked-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: James Morris <jmorris@namei.org>";https://api.github.com/repos/torvalds/linux/git/commits/b6dff3ec5e116e3af6f537d4caedcad6b9e5082a;"Note that the Alpha arch is altered as it refers to (E)UID and (E)GID in entry.S via asm-offsets";yes;no;yes 322;MDY6Q29tbWl0MjMyNTI5ODpiNmRmZjNlYzVlMTE2ZTNhZjZmNTM3ZDRjYWVkY2FkNmI5ZTUwODJh;David Howells;James Morris;"CRED: Separate task security context from task_struct Separate the task security context from task_struct. At this point, the security data is temporarily embedded in the task_struct with two pointers pointing to it. Note that the Alpha arch is altered as it refers to (E)UID and (E)GID in entry.S via asm-offsets. With comment fixes Signed-off-by: Marc Dionne <marc.c.dionne@gmail.com> Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: James Morris <jmorris@namei.org> Acked-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: James Morris <jmorris@namei.org>";https://api.github.com/repos/torvalds/linux/git/commits/b6dff3ec5e116e3af6f537d4caedcad6b9e5082a;With comment fixes ;yes;yes;no 323;MDY6Q29tbWl0MjMyNTI5ODphMmYyOTQ1YTk5MDU3YzdkNDQwNDM0NjU5MDZjNmJiNjNjMzM2OGEw;Eric Paris;James Morris;"The oomkiller calculations make decisions based on capabilities. Since these are not security decisions and LSMs should not record if they fall the request they should use the new has_capability_noaudit() interface so the denials will not be recorded. Signed-off-by: Eric Paris <eparis@redhat.com> Acked-by: Stephen Smalley <sds@tycho.nsa.gov> Signed-off-by: James Morris <jmorris@namei.org>";https://api.github.com/repos/torvalds/linux/git/commits/a2f2945a99057c7d44043465906c6bb63c3368a0;The oomkiller calculations make decisions based on capabilities. Since;no;no;yes 323;MDY6Q29tbWl0MjMyNTI5ODphMmYyOTQ1YTk5MDU3YzdkNDQwNDM0NjU5MDZjNmJiNjNjMzM2OGEw;Eric Paris;James Morris;"The oomkiller calculations make decisions based on capabilities. Since these are not security decisions and LSMs should not record if they fall the request they should use the new has_capability_noaudit() interface so the denials will not be recorded. Signed-off-by: Eric Paris <eparis@redhat.com> Acked-by: Stephen Smalley <sds@tycho.nsa.gov> Signed-off-by: James Morris <jmorris@namei.org>";https://api.github.com/repos/torvalds/linux/git/commits/a2f2945a99057c7d44043465906c6bb63c3368a0;"these are not security decisions and LSMs should not record if they fall the request they should use the new has_capability_noaudit() interface so the denials will not be recorded.";yes;yes;yes 324;MDY6Q29tbWl0MjMyNTI5ODpmYmRkMTI2NzZjODNkZjc3NDgwZjAwZWJkMzJmYzk4ZmJlM2JmODM2;Qinghuang Feng;Linus Torvalds;"mm/oom_kill.c: fix badness() kerneldoc Paramter @mem has been removed since v2.6.26, now delete it's comment. Signed-off-by: Qinghuang Feng <qhfeng.kernel@gmail.com> Acked-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/fbdd12676c83df77480f00ebd32fc98fbe3bf836;mm/oom_kill.c: fix badness() kerneldoc;yes;yes;no 324;MDY6Q29tbWl0MjMyNTI5ODpmYmRkMTI2NzZjODNkZjc3NDgwZjAwZWJkMzJmYzk4ZmJlM2JmODM2;Qinghuang Feng;Linus Torvalds;"mm/oom_kill.c: fix badness() kerneldoc Paramter @mem has been removed since v2.6.26, now delete it's comment. Signed-off-by: Qinghuang Feng <qhfeng.kernel@gmail.com> Acked-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/fbdd12676c83df77480f00ebd32fc98fbe3bf836;Paramter @mem has been removed since v2.6.26, now delete it's comment.;yes;yes;yes 325;MDY6Q29tbWl0MjMyNTI5ODpiNDQxNmQyYmVhMDA3ZjA3ZjJlNzRjZGM0Y2I2NDA0MmVjOTk2Yzgz;David Rientjes;Linus Torvalds;"oom: do not dump task state for non thread group leaders When /proc/sys/vm/oom_dump_tasks is enabled, it's only necessary to dump task state information for thread group leaders. The kernel log gets quickly overwhelmed on machines with a massive number of threads by dumping non-thread group leaders. Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/b4416d2bea007f07f2e74cdc4cb64042ec996c83;oom: do not dump task state for non thread group leaders;yes;no;no 325;MDY6Q29tbWl0MjMyNTI5ODpiNDQxNmQyYmVhMDA3ZjA3ZjJlNzRjZGM0Y2I2NDA0MmVjOTk2Yzgz;David Rientjes;Linus Torvalds;"oom: do not dump task state for non thread group leaders When /proc/sys/vm/oom_dump_tasks is enabled, it's only necessary to dump task state information for thread group leaders. The kernel log gets quickly overwhelmed on machines with a massive number of threads by dumping non-thread group leaders. Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/b4416d2bea007f07f2e74cdc4cb64042ec996c83;"When /proc/sys/vm/oom_dump_tasks is enabled, it's only necessary to dump task state information for thread group leaders";yes;yes;yes 325;MDY6Q29tbWl0MjMyNTI5ODpiNDQxNmQyYmVhMDA3ZjA3ZjJlNzRjZGM0Y2I2NDA0MmVjOTk2Yzgz;David Rientjes;Linus Torvalds;"oom: do not dump task state for non thread group leaders When /proc/sys/vm/oom_dump_tasks is enabled, it's only necessary to dump task state information for thread group leaders. The kernel log gets quickly overwhelmed on machines with a massive number of threads by dumping non-thread group leaders. Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/b4416d2bea007f07f2e74cdc4cb64042ec996c83;" The kernel log gets quickly overwhelmed on machines with a massive number of threads by dumping non-thread group leaders.";no;yes;yes 326;MDY6Q29tbWl0MjMyNTI5ODo1Y2Q5YzU4ZmJlOWVjOTJiNDViMjdlMTMxNzE5YWY0ZjJiZDllYjQw;David Howells;James Morris;"security: Fix setting of PF_SUPERPRIV by __capable() Fix the setting of PF_SUPERPRIV by __capable() as it could corrupt the flags the target process if that is not the current process and it is trying to change its own flags in a different way at the same time. __capable() is using neither atomic ops nor locking to protect t->flags. This patch removes __capable() and introduces has_capability() that doesn't set PF_SUPERPRIV on the process being queried. This patch further splits security_ptrace() in two: (1) security_ptrace_may_access(). This passes judgement on whether one process may access another only (PTRACE_MODE_ATTACH for ptrace() and PTRACE_MODE_READ for /proc), and takes a pointer to the child process. current is the parent. (2) security_ptrace_traceme(). This passes judgement on PTRACE_TRACEME only, and takes only a pointer to the parent process. current is the child. In Smack and commoncap, this uses has_capability() to determine whether the parent will be permitted to use PTRACE_ATTACH if normal checks fail. This does not set PF_SUPERPRIV. Two of the instances of __capable() actually only act on current, and so have been changed to calls to capable(). Of the places that were using __capable(): (1) The OOM killer calls __capable() thrice when weighing the killability of a process. All of these now use has_capability(). (2) cap_ptrace() and smack_ptrace() were using __capable() to check to see whether the parent was allowed to trace any process. As mentioned above, these have been split. For PTRACE_ATTACH and /proc, capable() is now used, and for PTRACE_TRACEME, has_capability() is used. (3) cap_safe_nice() only ever saw current, so now uses capable(). (4) smack_setprocattr() rejected accesses to tasks other than current just after calling __capable(), so the order of these two tests have been switched and capable() is used instead. (5) In smack_file_send_sigiotask(), we need to allow privileged processes to receive SIGIO on files they're manipulating. (6) In smack_task_wait(), we let a process wait for a privileged process, whether or not the process doing the waiting is privileged. I've tested this with the LTP SELinux and syscalls testscripts. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: Casey Schaufler <casey@schaufler-ca.com> Acked-by: Andrew G. Morgan <morgan@kernel.org> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: James Morris <jmorris@namei.org>";https://api.github.com/repos/torvalds/linux/git/commits/5cd9c58fbe9ec92b45b27e131719af4f2bd9eb40;security: Fix setting of PF_SUPERPRIV by __capable();yes;yes;no 326;MDY6Q29tbWl0MjMyNTI5ODo1Y2Q5YzU4ZmJlOWVjOTJiNDViMjdlMTMxNzE5YWY0ZjJiZDllYjQw;David Howells;James Morris;"security: Fix setting of PF_SUPERPRIV by __capable() Fix the setting of PF_SUPERPRIV by __capable() as it could corrupt the flags the target process if that is not the current process and it is trying to change its own flags in a different way at the same time. __capable() is using neither atomic ops nor locking to protect t->flags. This patch removes __capable() and introduces has_capability() that doesn't set PF_SUPERPRIV on the process being queried. This patch further splits security_ptrace() in two: (1) security_ptrace_may_access(). This passes judgement on whether one process may access another only (PTRACE_MODE_ATTACH for ptrace() and PTRACE_MODE_READ for /proc), and takes a pointer to the child process. current is the parent. (2) security_ptrace_traceme(). This passes judgement on PTRACE_TRACEME only, and takes only a pointer to the parent process. current is the child. In Smack and commoncap, this uses has_capability() to determine whether the parent will be permitted to use PTRACE_ATTACH if normal checks fail. This does not set PF_SUPERPRIV. Two of the instances of __capable() actually only act on current, and so have been changed to calls to capable(). Of the places that were using __capable(): (1) The OOM killer calls __capable() thrice when weighing the killability of a process. All of these now use has_capability(). (2) cap_ptrace() and smack_ptrace() were using __capable() to check to see whether the parent was allowed to trace any process. As mentioned above, these have been split. For PTRACE_ATTACH and /proc, capable() is now used, and for PTRACE_TRACEME, has_capability() is used. (3) cap_safe_nice() only ever saw current, so now uses capable(). (4) smack_setprocattr() rejected accesses to tasks other than current just after calling __capable(), so the order of these two tests have been switched and capable() is used instead. (5) In smack_file_send_sigiotask(), we need to allow privileged processes to receive SIGIO on files they're manipulating. (6) In smack_task_wait(), we let a process wait for a privileged process, whether or not the process doing the waiting is privileged. I've tested this with the LTP SELinux and syscalls testscripts. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: Casey Schaufler <casey@schaufler-ca.com> Acked-by: Andrew G. Morgan <morgan@kernel.org> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: James Morris <jmorris@namei.org>";https://api.github.com/repos/torvalds/linux/git/commits/5cd9c58fbe9ec92b45b27e131719af4f2bd9eb40;"Fix the setting of PF_SUPERPRIV by __capable() as it could corrupt the flags the target process if that is not the current process and it is trying to change its own flags in a different way at the same time";yes;yes;yes 326;MDY6Q29tbWl0MjMyNTI5ODo1Y2Q5YzU4ZmJlOWVjOTJiNDViMjdlMTMxNzE5YWY0ZjJiZDllYjQw;David Howells;James Morris;"security: Fix setting of PF_SUPERPRIV by __capable() Fix the setting of PF_SUPERPRIV by __capable() as it could corrupt the flags the target process if that is not the current process and it is trying to change its own flags in a different way at the same time. __capable() is using neither atomic ops nor locking to protect t->flags. This patch removes __capable() and introduces has_capability() that doesn't set PF_SUPERPRIV on the process being queried. This patch further splits security_ptrace() in two: (1) security_ptrace_may_access(). This passes judgement on whether one process may access another only (PTRACE_MODE_ATTACH for ptrace() and PTRACE_MODE_READ for /proc), and takes a pointer to the child process. current is the parent. (2) security_ptrace_traceme(). This passes judgement on PTRACE_TRACEME only, and takes only a pointer to the parent process. current is the child. In Smack and commoncap, this uses has_capability() to determine whether the parent will be permitted to use PTRACE_ATTACH if normal checks fail. This does not set PF_SUPERPRIV. Two of the instances of __capable() actually only act on current, and so have been changed to calls to capable(). Of the places that were using __capable(): (1) The OOM killer calls __capable() thrice when weighing the killability of a process. All of these now use has_capability(). (2) cap_ptrace() and smack_ptrace() were using __capable() to check to see whether the parent was allowed to trace any process. As mentioned above, these have been split. For PTRACE_ATTACH and /proc, capable() is now used, and for PTRACE_TRACEME, has_capability() is used. (3) cap_safe_nice() only ever saw current, so now uses capable(). (4) smack_setprocattr() rejected accesses to tasks other than current just after calling __capable(), so the order of these two tests have been switched and capable() is used instead. (5) In smack_file_send_sigiotask(), we need to allow privileged processes to receive SIGIO on files they're manipulating. (6) In smack_task_wait(), we let a process wait for a privileged process, whether or not the process doing the waiting is privileged. I've tested this with the LTP SELinux and syscalls testscripts. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: Casey Schaufler <casey@schaufler-ca.com> Acked-by: Andrew G. Morgan <morgan@kernel.org> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: James Morris <jmorris@namei.org>";https://api.github.com/repos/torvalds/linux/git/commits/5cd9c58fbe9ec92b45b27e131719af4f2bd9eb40;__capable() is using neither atomic ops nor locking to protect t->flags;no;yes;yes 326;MDY6Q29tbWl0MjMyNTI5ODo1Y2Q5YzU4ZmJlOWVjOTJiNDViMjdlMTMxNzE5YWY0ZjJiZDllYjQw;David Howells;James Morris;"security: Fix setting of PF_SUPERPRIV by __capable() Fix the setting of PF_SUPERPRIV by __capable() as it could corrupt the flags the target process if that is not the current process and it is trying to change its own flags in a different way at the same time. __capable() is using neither atomic ops nor locking to protect t->flags. This patch removes __capable() and introduces has_capability() that doesn't set PF_SUPERPRIV on the process being queried. This patch further splits security_ptrace() in two: (1) security_ptrace_may_access(). This passes judgement on whether one process may access another only (PTRACE_MODE_ATTACH for ptrace() and PTRACE_MODE_READ for /proc), and takes a pointer to the child process. current is the parent. (2) security_ptrace_traceme(). This passes judgement on PTRACE_TRACEME only, and takes only a pointer to the parent process. current is the child. In Smack and commoncap, this uses has_capability() to determine whether the parent will be permitted to use PTRACE_ATTACH if normal checks fail. This does not set PF_SUPERPRIV. Two of the instances of __capable() actually only act on current, and so have been changed to calls to capable(). Of the places that were using __capable(): (1) The OOM killer calls __capable() thrice when weighing the killability of a process. All of these now use has_capability(). (2) cap_ptrace() and smack_ptrace() were using __capable() to check to see whether the parent was allowed to trace any process. As mentioned above, these have been split. For PTRACE_ATTACH and /proc, capable() is now used, and for PTRACE_TRACEME, has_capability() is used. (3) cap_safe_nice() only ever saw current, so now uses capable(). (4) smack_setprocattr() rejected accesses to tasks other than current just after calling __capable(), so the order of these two tests have been switched and capable() is used instead. (5) In smack_file_send_sigiotask(), we need to allow privileged processes to receive SIGIO on files they're manipulating. (6) In smack_task_wait(), we let a process wait for a privileged process, whether or not the process doing the waiting is privileged. I've tested this with the LTP SELinux and syscalls testscripts. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: Casey Schaufler <casey@schaufler-ca.com> Acked-by: Andrew G. Morgan <morgan@kernel.org> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: James Morris <jmorris@namei.org>";https://api.github.com/repos/torvalds/linux/git/commits/5cd9c58fbe9ec92b45b27e131719af4f2bd9eb40;" This patch removes __capable() and introduces has_capability() that doesn't set PF_SUPERPRIV on the process being queried";yes;no;no 326;MDY6Q29tbWl0MjMyNTI5ODo1Y2Q5YzU4ZmJlOWVjOTJiNDViMjdlMTMxNzE5YWY0ZjJiZDllYjQw;David Howells;James Morris;"security: Fix setting of PF_SUPERPRIV by __capable() Fix the setting of PF_SUPERPRIV by __capable() as it could corrupt the flags the target process if that is not the current process and it is trying to change its own flags in a different way at the same time. __capable() is using neither atomic ops nor locking to protect t->flags. This patch removes __capable() and introduces has_capability() that doesn't set PF_SUPERPRIV on the process being queried. This patch further splits security_ptrace() in two: (1) security_ptrace_may_access(). This passes judgement on whether one process may access another only (PTRACE_MODE_ATTACH for ptrace() and PTRACE_MODE_READ for /proc), and takes a pointer to the child process. current is the parent. (2) security_ptrace_traceme(). This passes judgement on PTRACE_TRACEME only, and takes only a pointer to the parent process. current is the child. In Smack and commoncap, this uses has_capability() to determine whether the parent will be permitted to use PTRACE_ATTACH if normal checks fail. This does not set PF_SUPERPRIV. Two of the instances of __capable() actually only act on current, and so have been changed to calls to capable(). Of the places that were using __capable(): (1) The OOM killer calls __capable() thrice when weighing the killability of a process. All of these now use has_capability(). (2) cap_ptrace() and smack_ptrace() were using __capable() to check to see whether the parent was allowed to trace any process. As mentioned above, these have been split. For PTRACE_ATTACH and /proc, capable() is now used, and for PTRACE_TRACEME, has_capability() is used. (3) cap_safe_nice() only ever saw current, so now uses capable(). (4) smack_setprocattr() rejected accesses to tasks other than current just after calling __capable(), so the order of these two tests have been switched and capable() is used instead. (5) In smack_file_send_sigiotask(), we need to allow privileged processes to receive SIGIO on files they're manipulating. (6) In smack_task_wait(), we let a process wait for a privileged process, whether or not the process doing the waiting is privileged. I've tested this with the LTP SELinux and syscalls testscripts. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: Casey Schaufler <casey@schaufler-ca.com> Acked-by: Andrew G. Morgan <morgan@kernel.org> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: James Morris <jmorris@namei.org>";https://api.github.com/repos/torvalds/linux/git/commits/5cd9c58fbe9ec92b45b27e131719af4f2bd9eb40;This patch further splits security_ptrace() in two;yes;no;no 326;MDY6Q29tbWl0MjMyNTI5ODo1Y2Q5YzU4ZmJlOWVjOTJiNDViMjdlMTMxNzE5YWY0ZjJiZDllYjQw;David Howells;James Morris;"security: Fix setting of PF_SUPERPRIV by __capable() Fix the setting of PF_SUPERPRIV by __capable() as it could corrupt the flags the target process if that is not the current process and it is trying to change its own flags in a different way at the same time. __capable() is using neither atomic ops nor locking to protect t->flags. This patch removes __capable() and introduces has_capability() that doesn't set PF_SUPERPRIV on the process being queried. This patch further splits security_ptrace() in two: (1) security_ptrace_may_access(). This passes judgement on whether one process may access another only (PTRACE_MODE_ATTACH for ptrace() and PTRACE_MODE_READ for /proc), and takes a pointer to the child process. current is the parent. (2) security_ptrace_traceme(). This passes judgement on PTRACE_TRACEME only, and takes only a pointer to the parent process. current is the child. In Smack and commoncap, this uses has_capability() to determine whether the parent will be permitted to use PTRACE_ATTACH if normal checks fail. This does not set PF_SUPERPRIV. Two of the instances of __capable() actually only act on current, and so have been changed to calls to capable(). Of the places that were using __capable(): (1) The OOM killer calls __capable() thrice when weighing the killability of a process. All of these now use has_capability(). (2) cap_ptrace() and smack_ptrace() were using __capable() to check to see whether the parent was allowed to trace any process. As mentioned above, these have been split. For PTRACE_ATTACH and /proc, capable() is now used, and for PTRACE_TRACEME, has_capability() is used. (3) cap_safe_nice() only ever saw current, so now uses capable(). (4) smack_setprocattr() rejected accesses to tasks other than current just after calling __capable(), so the order of these two tests have been switched and capable() is used instead. (5) In smack_file_send_sigiotask(), we need to allow privileged processes to receive SIGIO on files they're manipulating. (6) In smack_task_wait(), we let a process wait for a privileged process, whether or not the process doing the waiting is privileged. I've tested this with the LTP SELinux and syscalls testscripts. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: Casey Schaufler <casey@schaufler-ca.com> Acked-by: Andrew G. Morgan <morgan@kernel.org> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: James Morris <jmorris@namei.org>";https://api.github.com/repos/torvalds/linux/git/commits/5cd9c58fbe9ec92b45b27e131719af4f2bd9eb40; (1) security_ptrace_may_access();yes;no;no 326;MDY6Q29tbWl0MjMyNTI5ODo1Y2Q5YzU4ZmJlOWVjOTJiNDViMjdlMTMxNzE5YWY0ZjJiZDllYjQw;David Howells;James Morris;"security: Fix setting of PF_SUPERPRIV by __capable() Fix the setting of PF_SUPERPRIV by __capable() as it could corrupt the flags the target process if that is not the current process and it is trying to change its own flags in a different way at the same time. __capable() is using neither atomic ops nor locking to protect t->flags. This patch removes __capable() and introduces has_capability() that doesn't set PF_SUPERPRIV on the process being queried. This patch further splits security_ptrace() in two: (1) security_ptrace_may_access(). This passes judgement on whether one process may access another only (PTRACE_MODE_ATTACH for ptrace() and PTRACE_MODE_READ for /proc), and takes a pointer to the child process. current is the parent. (2) security_ptrace_traceme(). This passes judgement on PTRACE_TRACEME only, and takes only a pointer to the parent process. current is the child. In Smack and commoncap, this uses has_capability() to determine whether the parent will be permitted to use PTRACE_ATTACH if normal checks fail. This does not set PF_SUPERPRIV. Two of the instances of __capable() actually only act on current, and so have been changed to calls to capable(). Of the places that were using __capable(): (1) The OOM killer calls __capable() thrice when weighing the killability of a process. All of these now use has_capability(). (2) cap_ptrace() and smack_ptrace() were using __capable() to check to see whether the parent was allowed to trace any process. As mentioned above, these have been split. For PTRACE_ATTACH and /proc, capable() is now used, and for PTRACE_TRACEME, has_capability() is used. (3) cap_safe_nice() only ever saw current, so now uses capable(). (4) smack_setprocattr() rejected accesses to tasks other than current just after calling __capable(), so the order of these two tests have been switched and capable() is used instead. (5) In smack_file_send_sigiotask(), we need to allow privileged processes to receive SIGIO on files they're manipulating. (6) In smack_task_wait(), we let a process wait for a privileged process, whether or not the process doing the waiting is privileged. I've tested this with the LTP SELinux and syscalls testscripts. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: Casey Schaufler <casey@schaufler-ca.com> Acked-by: Andrew G. Morgan <morgan@kernel.org> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: James Morris <jmorris@namei.org>";https://api.github.com/repos/torvalds/linux/git/commits/5cd9c58fbe9ec92b45b27e131719af4f2bd9eb40;" This passes judgement on whether one process may access another only (PTRACE_MODE_ATTACH for ptrace() and PTRACE_MODE_READ for /proc), and takes a pointer to the child process";yes;no;no 326;MDY6Q29tbWl0MjMyNTI5ODo1Y2Q5YzU4ZmJlOWVjOTJiNDViMjdlMTMxNzE5YWY0ZjJiZDllYjQw;David Howells;James Morris;"security: Fix setting of PF_SUPERPRIV by __capable() Fix the setting of PF_SUPERPRIV by __capable() as it could corrupt the flags the target process if that is not the current process and it is trying to change its own flags in a different way at the same time. __capable() is using neither atomic ops nor locking to protect t->flags. This patch removes __capable() and introduces has_capability() that doesn't set PF_SUPERPRIV on the process being queried. This patch further splits security_ptrace() in two: (1) security_ptrace_may_access(). This passes judgement on whether one process may access another only (PTRACE_MODE_ATTACH for ptrace() and PTRACE_MODE_READ for /proc), and takes a pointer to the child process. current is the parent. (2) security_ptrace_traceme(). This passes judgement on PTRACE_TRACEME only, and takes only a pointer to the parent process. current is the child. In Smack and commoncap, this uses has_capability() to determine whether the parent will be permitted to use PTRACE_ATTACH if normal checks fail. This does not set PF_SUPERPRIV. Two of the instances of __capable() actually only act on current, and so have been changed to calls to capable(). Of the places that were using __capable(): (1) The OOM killer calls __capable() thrice when weighing the killability of a process. All of these now use has_capability(). (2) cap_ptrace() and smack_ptrace() were using __capable() to check to see whether the parent was allowed to trace any process. As mentioned above, these have been split. For PTRACE_ATTACH and /proc, capable() is now used, and for PTRACE_TRACEME, has_capability() is used. (3) cap_safe_nice() only ever saw current, so now uses capable(). (4) smack_setprocattr() rejected accesses to tasks other than current just after calling __capable(), so the order of these two tests have been switched and capable() is used instead. (5) In smack_file_send_sigiotask(), we need to allow privileged processes to receive SIGIO on files they're manipulating. (6) In smack_task_wait(), we let a process wait for a privileged process, whether or not the process doing the waiting is privileged. I've tested this with the LTP SELinux and syscalls testscripts. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: Casey Schaufler <casey@schaufler-ca.com> Acked-by: Andrew G. Morgan <morgan@kernel.org> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: James Morris <jmorris@namei.org>";https://api.github.com/repos/torvalds/linux/git/commits/5cd9c58fbe9ec92b45b27e131719af4f2bd9eb40; current is the parent;yes;no;no 326;MDY6Q29tbWl0MjMyNTI5ODo1Y2Q5YzU4ZmJlOWVjOTJiNDViMjdlMTMxNzE5YWY0ZjJiZDllYjQw;David Howells;James Morris;"security: Fix setting of PF_SUPERPRIV by __capable() Fix the setting of PF_SUPERPRIV by __capable() as it could corrupt the flags the target process if that is not the current process and it is trying to change its own flags in a different way at the same time. __capable() is using neither atomic ops nor locking to protect t->flags. This patch removes __capable() and introduces has_capability() that doesn't set PF_SUPERPRIV on the process being queried. This patch further splits security_ptrace() in two: (1) security_ptrace_may_access(). This passes judgement on whether one process may access another only (PTRACE_MODE_ATTACH for ptrace() and PTRACE_MODE_READ for /proc), and takes a pointer to the child process. current is the parent. (2) security_ptrace_traceme(). This passes judgement on PTRACE_TRACEME only, and takes only a pointer to the parent process. current is the child. In Smack and commoncap, this uses has_capability() to determine whether the parent will be permitted to use PTRACE_ATTACH if normal checks fail. This does not set PF_SUPERPRIV. Two of the instances of __capable() actually only act on current, and so have been changed to calls to capable(). Of the places that were using __capable(): (1) The OOM killer calls __capable() thrice when weighing the killability of a process. All of these now use has_capability(). (2) cap_ptrace() and smack_ptrace() were using __capable() to check to see whether the parent was allowed to trace any process. As mentioned above, these have been split. For PTRACE_ATTACH and /proc, capable() is now used, and for PTRACE_TRACEME, has_capability() is used. (3) cap_safe_nice() only ever saw current, so now uses capable(). (4) smack_setprocattr() rejected accesses to tasks other than current just after calling __capable(), so the order of these two tests have been switched and capable() is used instead. (5) In smack_file_send_sigiotask(), we need to allow privileged processes to receive SIGIO on files they're manipulating. (6) In smack_task_wait(), we let a process wait for a privileged process, whether or not the process doing the waiting is privileged. I've tested this with the LTP SELinux and syscalls testscripts. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: Casey Schaufler <casey@schaufler-ca.com> Acked-by: Andrew G. Morgan <morgan@kernel.org> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: James Morris <jmorris@namei.org>";https://api.github.com/repos/torvalds/linux/git/commits/5cd9c58fbe9ec92b45b27e131719af4f2bd9eb40; (2) security_ptrace_traceme();yes;no;no 326;MDY6Q29tbWl0MjMyNTI5ODo1Y2Q5YzU4ZmJlOWVjOTJiNDViMjdlMTMxNzE5YWY0ZjJiZDllYjQw;David Howells;James Morris;"security: Fix setting of PF_SUPERPRIV by __capable() Fix the setting of PF_SUPERPRIV by __capable() as it could corrupt the flags the target process if that is not the current process and it is trying to change its own flags in a different way at the same time. __capable() is using neither atomic ops nor locking to protect t->flags. This patch removes __capable() and introduces has_capability() that doesn't set PF_SUPERPRIV on the process being queried. This patch further splits security_ptrace() in two: (1) security_ptrace_may_access(). This passes judgement on whether one process may access another only (PTRACE_MODE_ATTACH for ptrace() and PTRACE_MODE_READ for /proc), and takes a pointer to the child process. current is the parent. (2) security_ptrace_traceme(). This passes judgement on PTRACE_TRACEME only, and takes only a pointer to the parent process. current is the child. In Smack and commoncap, this uses has_capability() to determine whether the parent will be permitted to use PTRACE_ATTACH if normal checks fail. This does not set PF_SUPERPRIV. Two of the instances of __capable() actually only act on current, and so have been changed to calls to capable(). Of the places that were using __capable(): (1) The OOM killer calls __capable() thrice when weighing the killability of a process. All of these now use has_capability(). (2) cap_ptrace() and smack_ptrace() were using __capable() to check to see whether the parent was allowed to trace any process. As mentioned above, these have been split. For PTRACE_ATTACH and /proc, capable() is now used, and for PTRACE_TRACEME, has_capability() is used. (3) cap_safe_nice() only ever saw current, so now uses capable(). (4) smack_setprocattr() rejected accesses to tasks other than current just after calling __capable(), so the order of these two tests have been switched and capable() is used instead. (5) In smack_file_send_sigiotask(), we need to allow privileged processes to receive SIGIO on files they're manipulating. (6) In smack_task_wait(), we let a process wait for a privileged process, whether or not the process doing the waiting is privileged. I've tested this with the LTP SELinux and syscalls testscripts. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: Casey Schaufler <casey@schaufler-ca.com> Acked-by: Andrew G. Morgan <morgan@kernel.org> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: James Morris <jmorris@namei.org>";https://api.github.com/repos/torvalds/linux/git/commits/5cd9c58fbe9ec92b45b27e131719af4f2bd9eb40;" This passes judgement on PTRACE_TRACEME only, and takes only a pointer to the parent process";yes;no;no 326;MDY6Q29tbWl0MjMyNTI5ODo1Y2Q5YzU4ZmJlOWVjOTJiNDViMjdlMTMxNzE5YWY0ZjJiZDllYjQw;David Howells;James Morris;"security: Fix setting of PF_SUPERPRIV by __capable() Fix the setting of PF_SUPERPRIV by __capable() as it could corrupt the flags the target process if that is not the current process and it is trying to change its own flags in a different way at the same time. __capable() is using neither atomic ops nor locking to protect t->flags. This patch removes __capable() and introduces has_capability() that doesn't set PF_SUPERPRIV on the process being queried. This patch further splits security_ptrace() in two: (1) security_ptrace_may_access(). This passes judgement on whether one process may access another only (PTRACE_MODE_ATTACH for ptrace() and PTRACE_MODE_READ for /proc), and takes a pointer to the child process. current is the parent. (2) security_ptrace_traceme(). This passes judgement on PTRACE_TRACEME only, and takes only a pointer to the parent process. current is the child. In Smack and commoncap, this uses has_capability() to determine whether the parent will be permitted to use PTRACE_ATTACH if normal checks fail. This does not set PF_SUPERPRIV. Two of the instances of __capable() actually only act on current, and so have been changed to calls to capable(). Of the places that were using __capable(): (1) The OOM killer calls __capable() thrice when weighing the killability of a process. All of these now use has_capability(). (2) cap_ptrace() and smack_ptrace() were using __capable() to check to see whether the parent was allowed to trace any process. As mentioned above, these have been split. For PTRACE_ATTACH and /proc, capable() is now used, and for PTRACE_TRACEME, has_capability() is used. (3) cap_safe_nice() only ever saw current, so now uses capable(). (4) smack_setprocattr() rejected accesses to tasks other than current just after calling __capable(), so the order of these two tests have been switched and capable() is used instead. (5) In smack_file_send_sigiotask(), we need to allow privileged processes to receive SIGIO on files they're manipulating. (6) In smack_task_wait(), we let a process wait for a privileged process, whether or not the process doing the waiting is privileged. I've tested this with the LTP SELinux and syscalls testscripts. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: Casey Schaufler <casey@schaufler-ca.com> Acked-by: Andrew G. Morgan <morgan@kernel.org> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: James Morris <jmorris@namei.org>";https://api.github.com/repos/torvalds/linux/git/commits/5cd9c58fbe9ec92b45b27e131719af4f2bd9eb40; current is the child;yes;no;no 326;MDY6Q29tbWl0MjMyNTI5ODo1Y2Q5YzU4ZmJlOWVjOTJiNDViMjdlMTMxNzE5YWY0ZjJiZDllYjQw;David Howells;James Morris;"security: Fix setting of PF_SUPERPRIV by __capable() Fix the setting of PF_SUPERPRIV by __capable() as it could corrupt the flags the target process if that is not the current process and it is trying to change its own flags in a different way at the same time. __capable() is using neither atomic ops nor locking to protect t->flags. This patch removes __capable() and introduces has_capability() that doesn't set PF_SUPERPRIV on the process being queried. This patch further splits security_ptrace() in two: (1) security_ptrace_may_access(). This passes judgement on whether one process may access another only (PTRACE_MODE_ATTACH for ptrace() and PTRACE_MODE_READ for /proc), and takes a pointer to the child process. current is the parent. (2) security_ptrace_traceme(). This passes judgement on PTRACE_TRACEME only, and takes only a pointer to the parent process. current is the child. In Smack and commoncap, this uses has_capability() to determine whether the parent will be permitted to use PTRACE_ATTACH if normal checks fail. This does not set PF_SUPERPRIV. Two of the instances of __capable() actually only act on current, and so have been changed to calls to capable(). Of the places that were using __capable(): (1) The OOM killer calls __capable() thrice when weighing the killability of a process. All of these now use has_capability(). (2) cap_ptrace() and smack_ptrace() were using __capable() to check to see whether the parent was allowed to trace any process. As mentioned above, these have been split. For PTRACE_ATTACH and /proc, capable() is now used, and for PTRACE_TRACEME, has_capability() is used. (3) cap_safe_nice() only ever saw current, so now uses capable(). (4) smack_setprocattr() rejected accesses to tasks other than current just after calling __capable(), so the order of these two tests have been switched and capable() is used instead. (5) In smack_file_send_sigiotask(), we need to allow privileged processes to receive SIGIO on files they're manipulating. (6) In smack_task_wait(), we let a process wait for a privileged process, whether or not the process doing the waiting is privileged. I've tested this with the LTP SELinux and syscalls testscripts. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: Casey Schaufler <casey@schaufler-ca.com> Acked-by: Andrew G. Morgan <morgan@kernel.org> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: James Morris <jmorris@namei.org>";https://api.github.com/repos/torvalds/linux/git/commits/5cd9c58fbe9ec92b45b27e131719af4f2bd9eb40;" In Smack and commoncap, this uses has_capability() to determine whether the parent will be permitted to use PTRACE_ATTACH if normal checks fail";yes;no;no 326;MDY6Q29tbWl0MjMyNTI5ODo1Y2Q5YzU4ZmJlOWVjOTJiNDViMjdlMTMxNzE5YWY0ZjJiZDllYjQw;David Howells;James Morris;"security: Fix setting of PF_SUPERPRIV by __capable() Fix the setting of PF_SUPERPRIV by __capable() as it could corrupt the flags the target process if that is not the current process and it is trying to change its own flags in a different way at the same time. __capable() is using neither atomic ops nor locking to protect t->flags. This patch removes __capable() and introduces has_capability() that doesn't set PF_SUPERPRIV on the process being queried. This patch further splits security_ptrace() in two: (1) security_ptrace_may_access(). This passes judgement on whether one process may access another only (PTRACE_MODE_ATTACH for ptrace() and PTRACE_MODE_READ for /proc), and takes a pointer to the child process. current is the parent. (2) security_ptrace_traceme(). This passes judgement on PTRACE_TRACEME only, and takes only a pointer to the parent process. current is the child. In Smack and commoncap, this uses has_capability() to determine whether the parent will be permitted to use PTRACE_ATTACH if normal checks fail. This does not set PF_SUPERPRIV. Two of the instances of __capable() actually only act on current, and so have been changed to calls to capable(). Of the places that were using __capable(): (1) The OOM killer calls __capable() thrice when weighing the killability of a process. All of these now use has_capability(). (2) cap_ptrace() and smack_ptrace() were using __capable() to check to see whether the parent was allowed to trace any process. As mentioned above, these have been split. For PTRACE_ATTACH and /proc, capable() is now used, and for PTRACE_TRACEME, has_capability() is used. (3) cap_safe_nice() only ever saw current, so now uses capable(). (4) smack_setprocattr() rejected accesses to tasks other than current just after calling __capable(), so the order of these two tests have been switched and capable() is used instead. (5) In smack_file_send_sigiotask(), we need to allow privileged processes to receive SIGIO on files they're manipulating. (6) In smack_task_wait(), we let a process wait for a privileged process, whether or not the process doing the waiting is privileged. I've tested this with the LTP SELinux and syscalls testscripts. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: Casey Schaufler <casey@schaufler-ca.com> Acked-by: Andrew G. Morgan <morgan@kernel.org> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: James Morris <jmorris@namei.org>";https://api.github.com/repos/torvalds/linux/git/commits/5cd9c58fbe9ec92b45b27e131719af4f2bd9eb40; This does not set PF_SUPERPRIV;yes;no;no 326;MDY6Q29tbWl0MjMyNTI5ODo1Y2Q5YzU4ZmJlOWVjOTJiNDViMjdlMTMxNzE5YWY0ZjJiZDllYjQw;David Howells;James Morris;"security: Fix setting of PF_SUPERPRIV by __capable() Fix the setting of PF_SUPERPRIV by __capable() as it could corrupt the flags the target process if that is not the current process and it is trying to change its own flags in a different way at the same time. __capable() is using neither atomic ops nor locking to protect t->flags. This patch removes __capable() and introduces has_capability() that doesn't set PF_SUPERPRIV on the process being queried. This patch further splits security_ptrace() in two: (1) security_ptrace_may_access(). This passes judgement on whether one process may access another only (PTRACE_MODE_ATTACH for ptrace() and PTRACE_MODE_READ for /proc), and takes a pointer to the child process. current is the parent. (2) security_ptrace_traceme(). This passes judgement on PTRACE_TRACEME only, and takes only a pointer to the parent process. current is the child. In Smack and commoncap, this uses has_capability() to determine whether the parent will be permitted to use PTRACE_ATTACH if normal checks fail. This does not set PF_SUPERPRIV. Two of the instances of __capable() actually only act on current, and so have been changed to calls to capable(). Of the places that were using __capable(): (1) The OOM killer calls __capable() thrice when weighing the killability of a process. All of these now use has_capability(). (2) cap_ptrace() and smack_ptrace() were using __capable() to check to see whether the parent was allowed to trace any process. As mentioned above, these have been split. For PTRACE_ATTACH and /proc, capable() is now used, and for PTRACE_TRACEME, has_capability() is used. (3) cap_safe_nice() only ever saw current, so now uses capable(). (4) smack_setprocattr() rejected accesses to tasks other than current just after calling __capable(), so the order of these two tests have been switched and capable() is used instead. (5) In smack_file_send_sigiotask(), we need to allow privileged processes to receive SIGIO on files they're manipulating. (6) In smack_task_wait(), we let a process wait for a privileged process, whether or not the process doing the waiting is privileged. I've tested this with the LTP SELinux and syscalls testscripts. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: Casey Schaufler <casey@schaufler-ca.com> Acked-by: Andrew G. Morgan <morgan@kernel.org> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: James Morris <jmorris@namei.org>";https://api.github.com/repos/torvalds/linux/git/commits/5cd9c58fbe9ec92b45b27e131719af4f2bd9eb40;"Two of the instances of __capable() actually only act on current, and so have been changed to calls to capable()";yes;yes;no 326;MDY6Q29tbWl0MjMyNTI5ODo1Y2Q5YzU4ZmJlOWVjOTJiNDViMjdlMTMxNzE5YWY0ZjJiZDllYjQw;David Howells;James Morris;"security: Fix setting of PF_SUPERPRIV by __capable() Fix the setting of PF_SUPERPRIV by __capable() as it could corrupt the flags the target process if that is not the current process and it is trying to change its own flags in a different way at the same time. __capable() is using neither atomic ops nor locking to protect t->flags. This patch removes __capable() and introduces has_capability() that doesn't set PF_SUPERPRIV on the process being queried. This patch further splits security_ptrace() in two: (1) security_ptrace_may_access(). This passes judgement on whether one process may access another only (PTRACE_MODE_ATTACH for ptrace() and PTRACE_MODE_READ for /proc), and takes a pointer to the child process. current is the parent. (2) security_ptrace_traceme(). This passes judgement on PTRACE_TRACEME only, and takes only a pointer to the parent process. current is the child. In Smack and commoncap, this uses has_capability() to determine whether the parent will be permitted to use PTRACE_ATTACH if normal checks fail. This does not set PF_SUPERPRIV. Two of the instances of __capable() actually only act on current, and so have been changed to calls to capable(). Of the places that were using __capable(): (1) The OOM killer calls __capable() thrice when weighing the killability of a process. All of these now use has_capability(). (2) cap_ptrace() and smack_ptrace() were using __capable() to check to see whether the parent was allowed to trace any process. As mentioned above, these have been split. For PTRACE_ATTACH and /proc, capable() is now used, and for PTRACE_TRACEME, has_capability() is used. (3) cap_safe_nice() only ever saw current, so now uses capable(). (4) smack_setprocattr() rejected accesses to tasks other than current just after calling __capable(), so the order of these two tests have been switched and capable() is used instead. (5) In smack_file_send_sigiotask(), we need to allow privileged processes to receive SIGIO on files they're manipulating. (6) In smack_task_wait(), we let a process wait for a privileged process, whether or not the process doing the waiting is privileged. I've tested this with the LTP SELinux and syscalls testscripts. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: Casey Schaufler <casey@schaufler-ca.com> Acked-by: Andrew G. Morgan <morgan@kernel.org> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: James Morris <jmorris@namei.org>";https://api.github.com/repos/torvalds/linux/git/commits/5cd9c58fbe9ec92b45b27e131719af4f2bd9eb40;Of the places that were using __capable();yes;no;no 326;MDY6Q29tbWl0MjMyNTI5ODo1Y2Q5YzU4ZmJlOWVjOTJiNDViMjdlMTMxNzE5YWY0ZjJiZDllYjQw;David Howells;James Morris;"security: Fix setting of PF_SUPERPRIV by __capable() Fix the setting of PF_SUPERPRIV by __capable() as it could corrupt the flags the target process if that is not the current process and it is trying to change its own flags in a different way at the same time. __capable() is using neither atomic ops nor locking to protect t->flags. This patch removes __capable() and introduces has_capability() that doesn't set PF_SUPERPRIV on the process being queried. This patch further splits security_ptrace() in two: (1) security_ptrace_may_access(). This passes judgement on whether one process may access another only (PTRACE_MODE_ATTACH for ptrace() and PTRACE_MODE_READ for /proc), and takes a pointer to the child process. current is the parent. (2) security_ptrace_traceme(). This passes judgement on PTRACE_TRACEME only, and takes only a pointer to the parent process. current is the child. In Smack and commoncap, this uses has_capability() to determine whether the parent will be permitted to use PTRACE_ATTACH if normal checks fail. This does not set PF_SUPERPRIV. Two of the instances of __capable() actually only act on current, and so have been changed to calls to capable(). Of the places that were using __capable(): (1) The OOM killer calls __capable() thrice when weighing the killability of a process. All of these now use has_capability(). (2) cap_ptrace() and smack_ptrace() were using __capable() to check to see whether the parent was allowed to trace any process. As mentioned above, these have been split. For PTRACE_ATTACH and /proc, capable() is now used, and for PTRACE_TRACEME, has_capability() is used. (3) cap_safe_nice() only ever saw current, so now uses capable(). (4) smack_setprocattr() rejected accesses to tasks other than current just after calling __capable(), so the order of these two tests have been switched and capable() is used instead. (5) In smack_file_send_sigiotask(), we need to allow privileged processes to receive SIGIO on files they're manipulating. (6) In smack_task_wait(), we let a process wait for a privileged process, whether or not the process doing the waiting is privileged. I've tested this with the LTP SELinux and syscalls testscripts. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: Casey Schaufler <casey@schaufler-ca.com> Acked-by: Andrew G. Morgan <morgan@kernel.org> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: James Morris <jmorris@namei.org>";https://api.github.com/repos/torvalds/linux/git/commits/5cd9c58fbe9ec92b45b27e131719af4f2bd9eb40;" (1) The OOM killer calls __capable() thrice when weighing the killability of a process";no;no;yes 326;MDY6Q29tbWl0MjMyNTI5ODo1Y2Q5YzU4ZmJlOWVjOTJiNDViMjdlMTMxNzE5YWY0ZjJiZDllYjQw;David Howells;James Morris;"security: Fix setting of PF_SUPERPRIV by __capable() Fix the setting of PF_SUPERPRIV by __capable() as it could corrupt the flags the target process if that is not the current process and it is trying to change its own flags in a different way at the same time. __capable() is using neither atomic ops nor locking to protect t->flags. This patch removes __capable() and introduces has_capability() that doesn't set PF_SUPERPRIV on the process being queried. This patch further splits security_ptrace() in two: (1) security_ptrace_may_access(). This passes judgement on whether one process may access another only (PTRACE_MODE_ATTACH for ptrace() and PTRACE_MODE_READ for /proc), and takes a pointer to the child process. current is the parent. (2) security_ptrace_traceme(). This passes judgement on PTRACE_TRACEME only, and takes only a pointer to the parent process. current is the child. In Smack and commoncap, this uses has_capability() to determine whether the parent will be permitted to use PTRACE_ATTACH if normal checks fail. This does not set PF_SUPERPRIV. Two of the instances of __capable() actually only act on current, and so have been changed to calls to capable(). Of the places that were using __capable(): (1) The OOM killer calls __capable() thrice when weighing the killability of a process. All of these now use has_capability(). (2) cap_ptrace() and smack_ptrace() were using __capable() to check to see whether the parent was allowed to trace any process. As mentioned above, these have been split. For PTRACE_ATTACH and /proc, capable() is now used, and for PTRACE_TRACEME, has_capability() is used. (3) cap_safe_nice() only ever saw current, so now uses capable(). (4) smack_setprocattr() rejected accesses to tasks other than current just after calling __capable(), so the order of these two tests have been switched and capable() is used instead. (5) In smack_file_send_sigiotask(), we need to allow privileged processes to receive SIGIO on files they're manipulating. (6) In smack_task_wait(), we let a process wait for a privileged process, whether or not the process doing the waiting is privileged. I've tested this with the LTP SELinux and syscalls testscripts. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: Casey Schaufler <casey@schaufler-ca.com> Acked-by: Andrew G. Morgan <morgan@kernel.org> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: James Morris <jmorris@namei.org>";https://api.github.com/repos/torvalds/linux/git/commits/5cd9c58fbe9ec92b45b27e131719af4f2bd9eb40; All of these now use has_capability();yes;no;no 326;MDY6Q29tbWl0MjMyNTI5ODo1Y2Q5YzU4ZmJlOWVjOTJiNDViMjdlMTMxNzE5YWY0ZjJiZDllYjQw;David Howells;James Morris;"security: Fix setting of PF_SUPERPRIV by __capable() Fix the setting of PF_SUPERPRIV by __capable() as it could corrupt the flags the target process if that is not the current process and it is trying to change its own flags in a different way at the same time. __capable() is using neither atomic ops nor locking to protect t->flags. This patch removes __capable() and introduces has_capability() that doesn't set PF_SUPERPRIV on the process being queried. This patch further splits security_ptrace() in two: (1) security_ptrace_may_access(). This passes judgement on whether one process may access another only (PTRACE_MODE_ATTACH for ptrace() and PTRACE_MODE_READ for /proc), and takes a pointer to the child process. current is the parent. (2) security_ptrace_traceme(). This passes judgement on PTRACE_TRACEME only, and takes only a pointer to the parent process. current is the child. In Smack and commoncap, this uses has_capability() to determine whether the parent will be permitted to use PTRACE_ATTACH if normal checks fail. This does not set PF_SUPERPRIV. Two of the instances of __capable() actually only act on current, and so have been changed to calls to capable(). Of the places that were using __capable(): (1) The OOM killer calls __capable() thrice when weighing the killability of a process. All of these now use has_capability(). (2) cap_ptrace() and smack_ptrace() were using __capable() to check to see whether the parent was allowed to trace any process. As mentioned above, these have been split. For PTRACE_ATTACH and /proc, capable() is now used, and for PTRACE_TRACEME, has_capability() is used. (3) cap_safe_nice() only ever saw current, so now uses capable(). (4) smack_setprocattr() rejected accesses to tasks other than current just after calling __capable(), so the order of these two tests have been switched and capable() is used instead. (5) In smack_file_send_sigiotask(), we need to allow privileged processes to receive SIGIO on files they're manipulating. (6) In smack_task_wait(), we let a process wait for a privileged process, whether or not the process doing the waiting is privileged. I've tested this with the LTP SELinux and syscalls testscripts. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: Casey Schaufler <casey@schaufler-ca.com> Acked-by: Andrew G. Morgan <morgan@kernel.org> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: James Morris <jmorris@namei.org>";https://api.github.com/repos/torvalds/linux/git/commits/5cd9c58fbe9ec92b45b27e131719af4f2bd9eb40;" (2) cap_ptrace() and smack_ptrace() were using __capable() to check to see whether the parent was allowed to trace any process";no;no;yes 326;MDY6Q29tbWl0MjMyNTI5ODo1Y2Q5YzU4ZmJlOWVjOTJiNDViMjdlMTMxNzE5YWY0ZjJiZDllYjQw;David Howells;James Morris;"security: Fix setting of PF_SUPERPRIV by __capable() Fix the setting of PF_SUPERPRIV by __capable() as it could corrupt the flags the target process if that is not the current process and it is trying to change its own flags in a different way at the same time. __capable() is using neither atomic ops nor locking to protect t->flags. This patch removes __capable() and introduces has_capability() that doesn't set PF_SUPERPRIV on the process being queried. This patch further splits security_ptrace() in two: (1) security_ptrace_may_access(). This passes judgement on whether one process may access another only (PTRACE_MODE_ATTACH for ptrace() and PTRACE_MODE_READ for /proc), and takes a pointer to the child process. current is the parent. (2) security_ptrace_traceme(). This passes judgement on PTRACE_TRACEME only, and takes only a pointer to the parent process. current is the child. In Smack and commoncap, this uses has_capability() to determine whether the parent will be permitted to use PTRACE_ATTACH if normal checks fail. This does not set PF_SUPERPRIV. Two of the instances of __capable() actually only act on current, and so have been changed to calls to capable(). Of the places that were using __capable(): (1) The OOM killer calls __capable() thrice when weighing the killability of a process. All of these now use has_capability(). (2) cap_ptrace() and smack_ptrace() were using __capable() to check to see whether the parent was allowed to trace any process. As mentioned above, these have been split. For PTRACE_ATTACH and /proc, capable() is now used, and for PTRACE_TRACEME, has_capability() is used. (3) cap_safe_nice() only ever saw current, so now uses capable(). (4) smack_setprocattr() rejected accesses to tasks other than current just after calling __capable(), so the order of these two tests have been switched and capable() is used instead. (5) In smack_file_send_sigiotask(), we need to allow privileged processes to receive SIGIO on files they're manipulating. (6) In smack_task_wait(), we let a process wait for a privileged process, whether or not the process doing the waiting is privileged. I've tested this with the LTP SELinux and syscalls testscripts. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: Casey Schaufler <casey@schaufler-ca.com> Acked-by: Andrew G. Morgan <morgan@kernel.org> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: James Morris <jmorris@namei.org>";https://api.github.com/repos/torvalds/linux/git/commits/5cd9c58fbe9ec92b45b27e131719af4f2bd9eb40;" As mentioned above, these have been split";yes;no;no 326;MDY6Q29tbWl0MjMyNTI5ODo1Y2Q5YzU4ZmJlOWVjOTJiNDViMjdlMTMxNzE5YWY0ZjJiZDllYjQw;David Howells;James Morris;"security: Fix setting of PF_SUPERPRIV by __capable() Fix the setting of PF_SUPERPRIV by __capable() as it could corrupt the flags the target process if that is not the current process and it is trying to change its own flags in a different way at the same time. __capable() is using neither atomic ops nor locking to protect t->flags. This patch removes __capable() and introduces has_capability() that doesn't set PF_SUPERPRIV on the process being queried. This patch further splits security_ptrace() in two: (1) security_ptrace_may_access(). This passes judgement on whether one process may access another only (PTRACE_MODE_ATTACH for ptrace() and PTRACE_MODE_READ for /proc), and takes a pointer to the child process. current is the parent. (2) security_ptrace_traceme(). This passes judgement on PTRACE_TRACEME only, and takes only a pointer to the parent process. current is the child. In Smack and commoncap, this uses has_capability() to determine whether the parent will be permitted to use PTRACE_ATTACH if normal checks fail. This does not set PF_SUPERPRIV. Two of the instances of __capable() actually only act on current, and so have been changed to calls to capable(). Of the places that were using __capable(): (1) The OOM killer calls __capable() thrice when weighing the killability of a process. All of these now use has_capability(). (2) cap_ptrace() and smack_ptrace() were using __capable() to check to see whether the parent was allowed to trace any process. As mentioned above, these have been split. For PTRACE_ATTACH and /proc, capable() is now used, and for PTRACE_TRACEME, has_capability() is used. (3) cap_safe_nice() only ever saw current, so now uses capable(). (4) smack_setprocattr() rejected accesses to tasks other than current just after calling __capable(), so the order of these two tests have been switched and capable() is used instead. (5) In smack_file_send_sigiotask(), we need to allow privileged processes to receive SIGIO on files they're manipulating. (6) In smack_task_wait(), we let a process wait for a privileged process, whether or not the process doing the waiting is privileged. I've tested this with the LTP SELinux and syscalls testscripts. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: Casey Schaufler <casey@schaufler-ca.com> Acked-by: Andrew G. Morgan <morgan@kernel.org> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: James Morris <jmorris@namei.org>";https://api.github.com/repos/torvalds/linux/git/commits/5cd9c58fbe9ec92b45b27e131719af4f2bd9eb40;" For PTRACE_ATTACH and /proc, capable() is now used, and for PTRACE_TRACEME, has_capability() is used";yes;no;no 326;MDY6Q29tbWl0MjMyNTI5ODo1Y2Q5YzU4ZmJlOWVjOTJiNDViMjdlMTMxNzE5YWY0ZjJiZDllYjQw;David Howells;James Morris;"security: Fix setting of PF_SUPERPRIV by __capable() Fix the setting of PF_SUPERPRIV by __capable() as it could corrupt the flags the target process if that is not the current process and it is trying to change its own flags in a different way at the same time. __capable() is using neither atomic ops nor locking to protect t->flags. This patch removes __capable() and introduces has_capability() that doesn't set PF_SUPERPRIV on the process being queried. This patch further splits security_ptrace() in two: (1) security_ptrace_may_access(). This passes judgement on whether one process may access another only (PTRACE_MODE_ATTACH for ptrace() and PTRACE_MODE_READ for /proc), and takes a pointer to the child process. current is the parent. (2) security_ptrace_traceme(). This passes judgement on PTRACE_TRACEME only, and takes only a pointer to the parent process. current is the child. In Smack and commoncap, this uses has_capability() to determine whether the parent will be permitted to use PTRACE_ATTACH if normal checks fail. This does not set PF_SUPERPRIV. Two of the instances of __capable() actually only act on current, and so have been changed to calls to capable(). Of the places that were using __capable(): (1) The OOM killer calls __capable() thrice when weighing the killability of a process. All of these now use has_capability(). (2) cap_ptrace() and smack_ptrace() were using __capable() to check to see whether the parent was allowed to trace any process. As mentioned above, these have been split. For PTRACE_ATTACH and /proc, capable() is now used, and for PTRACE_TRACEME, has_capability() is used. (3) cap_safe_nice() only ever saw current, so now uses capable(). (4) smack_setprocattr() rejected accesses to tasks other than current just after calling __capable(), so the order of these two tests have been switched and capable() is used instead. (5) In smack_file_send_sigiotask(), we need to allow privileged processes to receive SIGIO on files they're manipulating. (6) In smack_task_wait(), we let a process wait for a privileged process, whether or not the process doing the waiting is privileged. I've tested this with the LTP SELinux and syscalls testscripts. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: Casey Schaufler <casey@schaufler-ca.com> Acked-by: Andrew G. Morgan <morgan@kernel.org> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: James Morris <jmorris@namei.org>";https://api.github.com/repos/torvalds/linux/git/commits/5cd9c58fbe9ec92b45b27e131719af4f2bd9eb40; (3) cap_safe_nice() only ever saw current, so now uses capable();yes;yes;yes 326;MDY6Q29tbWl0MjMyNTI5ODo1Y2Q5YzU4ZmJlOWVjOTJiNDViMjdlMTMxNzE5YWY0ZjJiZDllYjQw;David Howells;James Morris;"security: Fix setting of PF_SUPERPRIV by __capable() Fix the setting of PF_SUPERPRIV by __capable() as it could corrupt the flags the target process if that is not the current process and it is trying to change its own flags in a different way at the same time. __capable() is using neither atomic ops nor locking to protect t->flags. This patch removes __capable() and introduces has_capability() that doesn't set PF_SUPERPRIV on the process being queried. This patch further splits security_ptrace() in two: (1) security_ptrace_may_access(). This passes judgement on whether one process may access another only (PTRACE_MODE_ATTACH for ptrace() and PTRACE_MODE_READ for /proc), and takes a pointer to the child process. current is the parent. (2) security_ptrace_traceme(). This passes judgement on PTRACE_TRACEME only, and takes only a pointer to the parent process. current is the child. In Smack and commoncap, this uses has_capability() to determine whether the parent will be permitted to use PTRACE_ATTACH if normal checks fail. This does not set PF_SUPERPRIV. Two of the instances of __capable() actually only act on current, and so have been changed to calls to capable(). Of the places that were using __capable(): (1) The OOM killer calls __capable() thrice when weighing the killability of a process. All of these now use has_capability(). (2) cap_ptrace() and smack_ptrace() were using __capable() to check to see whether the parent was allowed to trace any process. As mentioned above, these have been split. For PTRACE_ATTACH and /proc, capable() is now used, and for PTRACE_TRACEME, has_capability() is used. (3) cap_safe_nice() only ever saw current, so now uses capable(). (4) smack_setprocattr() rejected accesses to tasks other than current just after calling __capable(), so the order of these two tests have been switched and capable() is used instead. (5) In smack_file_send_sigiotask(), we need to allow privileged processes to receive SIGIO on files they're manipulating. (6) In smack_task_wait(), we let a process wait for a privileged process, whether or not the process doing the waiting is privileged. I've tested this with the LTP SELinux and syscalls testscripts. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: Casey Schaufler <casey@schaufler-ca.com> Acked-by: Andrew G. Morgan <morgan@kernel.org> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: James Morris <jmorris@namei.org>";https://api.github.com/repos/torvalds/linux/git/commits/5cd9c58fbe9ec92b45b27e131719af4f2bd9eb40;" (4) smack_setprocattr() rejected accesses to tasks other than current just after calling __capable(), so the order of these two tests have been switched and capable() is used instead";yes;yes;no 326;MDY6Q29tbWl0MjMyNTI5ODo1Y2Q5YzU4ZmJlOWVjOTJiNDViMjdlMTMxNzE5YWY0ZjJiZDllYjQw;David Howells;James Morris;"security: Fix setting of PF_SUPERPRIV by __capable() Fix the setting of PF_SUPERPRIV by __capable() as it could corrupt the flags the target process if that is not the current process and it is trying to change its own flags in a different way at the same time. __capable() is using neither atomic ops nor locking to protect t->flags. This patch removes __capable() and introduces has_capability() that doesn't set PF_SUPERPRIV on the process being queried. This patch further splits security_ptrace() in two: (1) security_ptrace_may_access(). This passes judgement on whether one process may access another only (PTRACE_MODE_ATTACH for ptrace() and PTRACE_MODE_READ for /proc), and takes a pointer to the child process. current is the parent. (2) security_ptrace_traceme(). This passes judgement on PTRACE_TRACEME only, and takes only a pointer to the parent process. current is the child. In Smack and commoncap, this uses has_capability() to determine whether the parent will be permitted to use PTRACE_ATTACH if normal checks fail. This does not set PF_SUPERPRIV. Two of the instances of __capable() actually only act on current, and so have been changed to calls to capable(). Of the places that were using __capable(): (1) The OOM killer calls __capable() thrice when weighing the killability of a process. All of these now use has_capability(). (2) cap_ptrace() and smack_ptrace() were using __capable() to check to see whether the parent was allowed to trace any process. As mentioned above, these have been split. For PTRACE_ATTACH and /proc, capable() is now used, and for PTRACE_TRACEME, has_capability() is used. (3) cap_safe_nice() only ever saw current, so now uses capable(). (4) smack_setprocattr() rejected accesses to tasks other than current just after calling __capable(), so the order of these two tests have been switched and capable() is used instead. (5) In smack_file_send_sigiotask(), we need to allow privileged processes to receive SIGIO on files they're manipulating. (6) In smack_task_wait(), we let a process wait for a privileged process, whether or not the process doing the waiting is privileged. I've tested this with the LTP SELinux and syscalls testscripts. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: Casey Schaufler <casey@schaufler-ca.com> Acked-by: Andrew G. Morgan <morgan@kernel.org> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: James Morris <jmorris@namei.org>";https://api.github.com/repos/torvalds/linux/git/commits/5cd9c58fbe9ec92b45b27e131719af4f2bd9eb40;" (5) In smack_file_send_sigiotask(), we need to allow privileged processes to receive SIGIO on files they're manipulating";yes;yes;no 326;MDY6Q29tbWl0MjMyNTI5ODo1Y2Q5YzU4ZmJlOWVjOTJiNDViMjdlMTMxNzE5YWY0ZjJiZDllYjQw;David Howells;James Morris;"security: Fix setting of PF_SUPERPRIV by __capable() Fix the setting of PF_SUPERPRIV by __capable() as it could corrupt the flags the target process if that is not the current process and it is trying to change its own flags in a different way at the same time. __capable() is using neither atomic ops nor locking to protect t->flags. This patch removes __capable() and introduces has_capability() that doesn't set PF_SUPERPRIV on the process being queried. This patch further splits security_ptrace() in two: (1) security_ptrace_may_access(). This passes judgement on whether one process may access another only (PTRACE_MODE_ATTACH for ptrace() and PTRACE_MODE_READ for /proc), and takes a pointer to the child process. current is the parent. (2) security_ptrace_traceme(). This passes judgement on PTRACE_TRACEME only, and takes only a pointer to the parent process. current is the child. In Smack and commoncap, this uses has_capability() to determine whether the parent will be permitted to use PTRACE_ATTACH if normal checks fail. This does not set PF_SUPERPRIV. Two of the instances of __capable() actually only act on current, and so have been changed to calls to capable(). Of the places that were using __capable(): (1) The OOM killer calls __capable() thrice when weighing the killability of a process. All of these now use has_capability(). (2) cap_ptrace() and smack_ptrace() were using __capable() to check to see whether the parent was allowed to trace any process. As mentioned above, these have been split. For PTRACE_ATTACH and /proc, capable() is now used, and for PTRACE_TRACEME, has_capability() is used. (3) cap_safe_nice() only ever saw current, so now uses capable(). (4) smack_setprocattr() rejected accesses to tasks other than current just after calling __capable(), so the order of these two tests have been switched and capable() is used instead. (5) In smack_file_send_sigiotask(), we need to allow privileged processes to receive SIGIO on files they're manipulating. (6) In smack_task_wait(), we let a process wait for a privileged process, whether or not the process doing the waiting is privileged. I've tested this with the LTP SELinux and syscalls testscripts. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: Casey Schaufler <casey@schaufler-ca.com> Acked-by: Andrew G. Morgan <morgan@kernel.org> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: James Morris <jmorris@namei.org>";https://api.github.com/repos/torvalds/linux/git/commits/5cd9c58fbe9ec92b45b27e131719af4f2bd9eb40;" (6) In smack_task_wait(), we let a process wait for a privileged process, whether or not the process doing the waiting is privileged";yes;no;yes 326;MDY6Q29tbWl0MjMyNTI5ODo1Y2Q5YzU4ZmJlOWVjOTJiNDViMjdlMTMxNzE5YWY0ZjJiZDllYjQw;David Howells;James Morris;"security: Fix setting of PF_SUPERPRIV by __capable() Fix the setting of PF_SUPERPRIV by __capable() as it could corrupt the flags the target process if that is not the current process and it is trying to change its own flags in a different way at the same time. __capable() is using neither atomic ops nor locking to protect t->flags. This patch removes __capable() and introduces has_capability() that doesn't set PF_SUPERPRIV on the process being queried. This patch further splits security_ptrace() in two: (1) security_ptrace_may_access(). This passes judgement on whether one process may access another only (PTRACE_MODE_ATTACH for ptrace() and PTRACE_MODE_READ for /proc), and takes a pointer to the child process. current is the parent. (2) security_ptrace_traceme(). This passes judgement on PTRACE_TRACEME only, and takes only a pointer to the parent process. current is the child. In Smack and commoncap, this uses has_capability() to determine whether the parent will be permitted to use PTRACE_ATTACH if normal checks fail. This does not set PF_SUPERPRIV. Two of the instances of __capable() actually only act on current, and so have been changed to calls to capable(). Of the places that were using __capable(): (1) The OOM killer calls __capable() thrice when weighing the killability of a process. All of these now use has_capability(). (2) cap_ptrace() and smack_ptrace() were using __capable() to check to see whether the parent was allowed to trace any process. As mentioned above, these have been split. For PTRACE_ATTACH and /proc, capable() is now used, and for PTRACE_TRACEME, has_capability() is used. (3) cap_safe_nice() only ever saw current, so now uses capable(). (4) smack_setprocattr() rejected accesses to tasks other than current just after calling __capable(), so the order of these two tests have been switched and capable() is used instead. (5) In smack_file_send_sigiotask(), we need to allow privileged processes to receive SIGIO on files they're manipulating. (6) In smack_task_wait(), we let a process wait for a privileged process, whether or not the process doing the waiting is privileged. I've tested this with the LTP SELinux and syscalls testscripts. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: Casey Schaufler <casey@schaufler-ca.com> Acked-by: Andrew G. Morgan <morgan@kernel.org> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: James Morris <jmorris@namei.org>";https://api.github.com/repos/torvalds/linux/git/commits/5cd9c58fbe9ec92b45b27e131719af4f2bd9eb40;I've tested this with the LTP SELinux and syscalls testscripts.;yes;no;no 327;MDY6Q29tbWl0MjMyNTI5ODo5N2Q4N2M5NzEwYmM2YzVmMjU4NWZiOWRjNThmNWJlZGJlOTk2ZjEw;Li Zefan;Linus Torvalds;"oom_kill: remove unused parameter in badness() In commit 4c4a22148909e4c003562ea7ffe0a06e26919e3c, we moved the memcontroller-related code from badness() to select_bad_process(), so the parameter 'mem' in badness() is unused now. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/97d87c9710bc6c5f2585fb9dc58f5bedbe996f10;oom_kill: remove unused parameter in badness();yes;yes;no 327;MDY6Q29tbWl0MjMyNTI5ODo5N2Q4N2M5NzEwYmM2YzVmMjU4NWZiOWRjNThmNWJlZGJlOTk2ZjEw;Li Zefan;Linus Torvalds;"oom_kill: remove unused parameter in badness() In commit 4c4a22148909e4c003562ea7ffe0a06e26919e3c, we moved the memcontroller-related code from badness() to select_bad_process(), so the parameter 'mem' in badness() is unused now. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/97d87c9710bc6c5f2585fb9dc58f5bedbe996f10;"In commit 4c4a22148909e4c003562ea7ffe0a06e26919e3c, we moved the memcontroller-related code from badness() to select_bad_process(), so the parameter 'mem' in badness() is unused now.";no;yes;yes 328;MDY6Q29tbWl0MjMyNTI5ODpkZDFhMjM5ZjZmMmQ0ZDNlZWRkMzE4NTgzZWMzMTlhYTE0NWIzMjRj;Mel Gorman;Linus Torvalds;"mm: have zonelist contains structs with both a zone pointer and zone_idx Filtering zonelists requires very frequent use of zone_idx(). This is costly as it involves a lookup of another structure and a substraction operation. As the zone_idx is often required, it should be quickly accessible. The node idx could also be stored here if it was found that accessing zone->node is significant which may be the case on workloads where nodemasks are heavily used. This patch introduces a struct zoneref to store a zone pointer and a zone index. The zonelist then consists of an array of these struct zonerefs which are looked up as necessary. Helpers are given for accessing the zone index as well as the node index. [kamezawa.hiroyu@jp.fujitsu.com: Suggested struct zoneref instead of embedding information in pointers] [hugh@veritas.com: mm-have-zonelist: fix memcg ooms] [hugh@veritas.com: just return do_try_to_free_pages] [hugh@veritas.com: do_try_to_free_pages gfp_mask redundant] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Christoph Lameter <clameter@sgi.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <clameter@sgi.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/dd1a239f6f2d4d3eedd318583ec319aa145b324c;mm: have zonelist contains structs with both a zone pointer and zone_idx;yes;no;no 328;MDY6Q29tbWl0MjMyNTI5ODpkZDFhMjM5ZjZmMmQ0ZDNlZWRkMzE4NTgzZWMzMTlhYTE0NWIzMjRj;Mel Gorman;Linus Torvalds;"mm: have zonelist contains structs with both a zone pointer and zone_idx Filtering zonelists requires very frequent use of zone_idx(). This is costly as it involves a lookup of another structure and a substraction operation. As the zone_idx is often required, it should be quickly accessible. The node idx could also be stored here if it was found that accessing zone->node is significant which may be the case on workloads where nodemasks are heavily used. This patch introduces a struct zoneref to store a zone pointer and a zone index. The zonelist then consists of an array of these struct zonerefs which are looked up as necessary. Helpers are given for accessing the zone index as well as the node index. [kamezawa.hiroyu@jp.fujitsu.com: Suggested struct zoneref instead of embedding information in pointers] [hugh@veritas.com: mm-have-zonelist: fix memcg ooms] [hugh@veritas.com: just return do_try_to_free_pages] [hugh@veritas.com: do_try_to_free_pages gfp_mask redundant] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Christoph Lameter <clameter@sgi.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <clameter@sgi.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/dd1a239f6f2d4d3eedd318583ec319aa145b324c;Filtering zonelists requires very frequent use of zone_idx();no;no;yes 328;MDY6Q29tbWl0MjMyNTI5ODpkZDFhMjM5ZjZmMmQ0ZDNlZWRkMzE4NTgzZWMzMTlhYTE0NWIzMjRj;Mel Gorman;Linus Torvalds;"mm: have zonelist contains structs with both a zone pointer and zone_idx Filtering zonelists requires very frequent use of zone_idx(). This is costly as it involves a lookup of another structure and a substraction operation. As the zone_idx is often required, it should be quickly accessible. The node idx could also be stored here if it was found that accessing zone->node is significant which may be the case on workloads where nodemasks are heavily used. This patch introduces a struct zoneref to store a zone pointer and a zone index. The zonelist then consists of an array of these struct zonerefs which are looked up as necessary. Helpers are given for accessing the zone index as well as the node index. [kamezawa.hiroyu@jp.fujitsu.com: Suggested struct zoneref instead of embedding information in pointers] [hugh@veritas.com: mm-have-zonelist: fix memcg ooms] [hugh@veritas.com: just return do_try_to_free_pages] [hugh@veritas.com: do_try_to_free_pages gfp_mask redundant] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Christoph Lameter <clameter@sgi.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <clameter@sgi.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/dd1a239f6f2d4d3eedd318583ec319aa145b324c;" This is costly as it involves a lookup of another structure and a substraction operation";no;yes;yes 328;MDY6Q29tbWl0MjMyNTI5ODpkZDFhMjM5ZjZmMmQ0ZDNlZWRkMzE4NTgzZWMzMTlhYTE0NWIzMjRj;Mel Gorman;Linus Torvalds;"mm: have zonelist contains structs with both a zone pointer and zone_idx Filtering zonelists requires very frequent use of zone_idx(). This is costly as it involves a lookup of another structure and a substraction operation. As the zone_idx is often required, it should be quickly accessible. The node idx could also be stored here if it was found that accessing zone->node is significant which may be the case on workloads where nodemasks are heavily used. This patch introduces a struct zoneref to store a zone pointer and a zone index. The zonelist then consists of an array of these struct zonerefs which are looked up as necessary. Helpers are given for accessing the zone index as well as the node index. [kamezawa.hiroyu@jp.fujitsu.com: Suggested struct zoneref instead of embedding information in pointers] [hugh@veritas.com: mm-have-zonelist: fix memcg ooms] [hugh@veritas.com: just return do_try_to_free_pages] [hugh@veritas.com: do_try_to_free_pages gfp_mask redundant] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Christoph Lameter <clameter@sgi.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <clameter@sgi.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/dd1a239f6f2d4d3eedd318583ec319aa145b324c;" As the zone_idx is often required, it should be quickly accessible";no;yes;yes 328;MDY6Q29tbWl0MjMyNTI5ODpkZDFhMjM5ZjZmMmQ0ZDNlZWRkMzE4NTgzZWMzMTlhYTE0NWIzMjRj;Mel Gorman;Linus Torvalds;"mm: have zonelist contains structs with both a zone pointer and zone_idx Filtering zonelists requires very frequent use of zone_idx(). This is costly as it involves a lookup of another structure and a substraction operation. As the zone_idx is often required, it should be quickly accessible. The node idx could also be stored here if it was found that accessing zone->node is significant which may be the case on workloads where nodemasks are heavily used. This patch introduces a struct zoneref to store a zone pointer and a zone index. The zonelist then consists of an array of these struct zonerefs which are looked up as necessary. Helpers are given for accessing the zone index as well as the node index. [kamezawa.hiroyu@jp.fujitsu.com: Suggested struct zoneref instead of embedding information in pointers] [hugh@veritas.com: mm-have-zonelist: fix memcg ooms] [hugh@veritas.com: just return do_try_to_free_pages] [hugh@veritas.com: do_try_to_free_pages gfp_mask redundant] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Christoph Lameter <clameter@sgi.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <clameter@sgi.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/dd1a239f6f2d4d3eedd318583ec319aa145b324c;" The node idx could also be stored here if it was found that accessing zone->node is significant which may be the case on workloads where nodemasks are heavily used";yes;yes;yes 328;MDY6Q29tbWl0MjMyNTI5ODpkZDFhMjM5ZjZmMmQ0ZDNlZWRkMzE4NTgzZWMzMTlhYTE0NWIzMjRj;Mel Gorman;Linus Torvalds;"mm: have zonelist contains structs with both a zone pointer and zone_idx Filtering zonelists requires very frequent use of zone_idx(). This is costly as it involves a lookup of another structure and a substraction operation. As the zone_idx is often required, it should be quickly accessible. The node idx could also be stored here if it was found that accessing zone->node is significant which may be the case on workloads where nodemasks are heavily used. This patch introduces a struct zoneref to store a zone pointer and a zone index. The zonelist then consists of an array of these struct zonerefs which are looked up as necessary. Helpers are given for accessing the zone index as well as the node index. [kamezawa.hiroyu@jp.fujitsu.com: Suggested struct zoneref instead of embedding information in pointers] [hugh@veritas.com: mm-have-zonelist: fix memcg ooms] [hugh@veritas.com: just return do_try_to_free_pages] [hugh@veritas.com: do_try_to_free_pages gfp_mask redundant] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Christoph Lameter <clameter@sgi.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <clameter@sgi.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/dd1a239f6f2d4d3eedd318583ec319aa145b324c;"This patch introduces a struct zoneref to store a zone pointer and a zone index";yes;yes;no 328;MDY6Q29tbWl0MjMyNTI5ODpkZDFhMjM5ZjZmMmQ0ZDNlZWRkMzE4NTgzZWMzMTlhYTE0NWIzMjRj;Mel Gorman;Linus Torvalds;"mm: have zonelist contains structs with both a zone pointer and zone_idx Filtering zonelists requires very frequent use of zone_idx(). This is costly as it involves a lookup of another structure and a substraction operation. As the zone_idx is often required, it should be quickly accessible. The node idx could also be stored here if it was found that accessing zone->node is significant which may be the case on workloads where nodemasks are heavily used. This patch introduces a struct zoneref to store a zone pointer and a zone index. The zonelist then consists of an array of these struct zonerefs which are looked up as necessary. Helpers are given for accessing the zone index as well as the node index. [kamezawa.hiroyu@jp.fujitsu.com: Suggested struct zoneref instead of embedding information in pointers] [hugh@veritas.com: mm-have-zonelist: fix memcg ooms] [hugh@veritas.com: just return do_try_to_free_pages] [hugh@veritas.com: do_try_to_free_pages gfp_mask redundant] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Christoph Lameter <clameter@sgi.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <clameter@sgi.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/dd1a239f6f2d4d3eedd318583ec319aa145b324c;" The zonelist then consists of an array of these struct zonerefs which are looked up as necessary";yes;no;no 328;MDY6Q29tbWl0MjMyNTI5ODpkZDFhMjM5ZjZmMmQ0ZDNlZWRkMzE4NTgzZWMzMTlhYTE0NWIzMjRj;Mel Gorman;Linus Torvalds;"mm: have zonelist contains structs with both a zone pointer and zone_idx Filtering zonelists requires very frequent use of zone_idx(). This is costly as it involves a lookup of another structure and a substraction operation. As the zone_idx is often required, it should be quickly accessible. The node idx could also be stored here if it was found that accessing zone->node is significant which may be the case on workloads where nodemasks are heavily used. This patch introduces a struct zoneref to store a zone pointer and a zone index. The zonelist then consists of an array of these struct zonerefs which are looked up as necessary. Helpers are given for accessing the zone index as well as the node index. [kamezawa.hiroyu@jp.fujitsu.com: Suggested struct zoneref instead of embedding information in pointers] [hugh@veritas.com: mm-have-zonelist: fix memcg ooms] [hugh@veritas.com: just return do_try_to_free_pages] [hugh@veritas.com: do_try_to_free_pages gfp_mask redundant] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Christoph Lameter <clameter@sgi.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <clameter@sgi.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/dd1a239f6f2d4d3eedd318583ec319aa145b324c;" Helpers are given for accessing the zone index as well as the node index.";yes;yes;no 329;MDY6Q29tbWl0MjMyNTI5ODo1NGE2ZWI1YzQ3NjVhYTU3M2EwMzBjZWViYTJjMTRlM2QyZWE1NzA2;Mel Gorman;Linus Torvalds;"mm: use two zonelist that are filtered by GFP mask Currently a node has two sets of zonelists, one for each zone type in the system and a second set for GFP_THISNODE allocations. Based on the zones allowed by a gfp mask, one of these zonelists is selected. All of these zonelists consume memory and occupy cache lines. This patch replaces the multiple zonelists per-node with two zonelists. The first contains all populated zones in the system, ordered by distance, for fallback allocations when the target/preferred node has no free pages. The second contains all populated zones in the node suitable for GFP_THISNODE allocations. An iterator macro is introduced called for_each_zone_zonelist() that interates through each zone allowed by the GFP flags in the selected zonelist. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <clameter@sgi.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/54a6eb5c4765aa573a030ceeba2c14e3d2ea5706;mm: use two zonelist that are filtered by GFP mask;yes;no;no 329;MDY6Q29tbWl0MjMyNTI5ODo1NGE2ZWI1YzQ3NjVhYTU3M2EwMzBjZWViYTJjMTRlM2QyZWE1NzA2;Mel Gorman;Linus Torvalds;"mm: use two zonelist that are filtered by GFP mask Currently a node has two sets of zonelists, one for each zone type in the system and a second set for GFP_THISNODE allocations. Based on the zones allowed by a gfp mask, one of these zonelists is selected. All of these zonelists consume memory and occupy cache lines. This patch replaces the multiple zonelists per-node with two zonelists. The first contains all populated zones in the system, ordered by distance, for fallback allocations when the target/preferred node has no free pages. The second contains all populated zones in the node suitable for GFP_THISNODE allocations. An iterator macro is introduced called for_each_zone_zonelist() that interates through each zone allowed by the GFP flags in the selected zonelist. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <clameter@sgi.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/54a6eb5c4765aa573a030ceeba2c14e3d2ea5706;"Currently a node has two sets of zonelists, one for each zone type in the system and a second set for GFP_THISNODE allocations";no;no;yes 329;MDY6Q29tbWl0MjMyNTI5ODo1NGE2ZWI1YzQ3NjVhYTU3M2EwMzBjZWViYTJjMTRlM2QyZWE1NzA2;Mel Gorman;Linus Torvalds;"mm: use two zonelist that are filtered by GFP mask Currently a node has two sets of zonelists, one for each zone type in the system and a second set for GFP_THISNODE allocations. Based on the zones allowed by a gfp mask, one of these zonelists is selected. All of these zonelists consume memory and occupy cache lines. This patch replaces the multiple zonelists per-node with two zonelists. The first contains all populated zones in the system, ordered by distance, for fallback allocations when the target/preferred node has no free pages. The second contains all populated zones in the node suitable for GFP_THISNODE allocations. An iterator macro is introduced called for_each_zone_zonelist() that interates through each zone allowed by the GFP flags in the selected zonelist. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <clameter@sgi.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/54a6eb5c4765aa573a030ceeba2c14e3d2ea5706;" Based on the zones allowed by a gfp mask, one of these zonelists is selected";no;no;yes 329;MDY6Q29tbWl0MjMyNTI5ODo1NGE2ZWI1YzQ3NjVhYTU3M2EwMzBjZWViYTJjMTRlM2QyZWE1NzA2;Mel Gorman;Linus Torvalds;"mm: use two zonelist that are filtered by GFP mask Currently a node has two sets of zonelists, one for each zone type in the system and a second set for GFP_THISNODE allocations. Based on the zones allowed by a gfp mask, one of these zonelists is selected. All of these zonelists consume memory and occupy cache lines. This patch replaces the multiple zonelists per-node with two zonelists. The first contains all populated zones in the system, ordered by distance, for fallback allocations when the target/preferred node has no free pages. The second contains all populated zones in the node suitable for GFP_THISNODE allocations. An iterator macro is introduced called for_each_zone_zonelist() that interates through each zone allowed by the GFP flags in the selected zonelist. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <clameter@sgi.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/54a6eb5c4765aa573a030ceeba2c14e3d2ea5706;" All of these zonelists consume memory and occupy cache lines";no;yes;yes 329;MDY6Q29tbWl0MjMyNTI5ODo1NGE2ZWI1YzQ3NjVhYTU3M2EwMzBjZWViYTJjMTRlM2QyZWE1NzA2;Mel Gorman;Linus Torvalds;"mm: use two zonelist that are filtered by GFP mask Currently a node has two sets of zonelists, one for each zone type in the system and a second set for GFP_THISNODE allocations. Based on the zones allowed by a gfp mask, one of these zonelists is selected. All of these zonelists consume memory and occupy cache lines. This patch replaces the multiple zonelists per-node with two zonelists. The first contains all populated zones in the system, ordered by distance, for fallback allocations when the target/preferred node has no free pages. The second contains all populated zones in the node suitable for GFP_THISNODE allocations. An iterator macro is introduced called for_each_zone_zonelist() that interates through each zone allowed by the GFP flags in the selected zonelist. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <clameter@sgi.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/54a6eb5c4765aa573a030ceeba2c14e3d2ea5706;This patch replaces the multiple zonelists per-node with two zonelists;yes;no;no 329;MDY6Q29tbWl0MjMyNTI5ODo1NGE2ZWI1YzQ3NjVhYTU3M2EwMzBjZWViYTJjMTRlM2QyZWE1NzA2;Mel Gorman;Linus Torvalds;"mm: use two zonelist that are filtered by GFP mask Currently a node has two sets of zonelists, one for each zone type in the system and a second set for GFP_THISNODE allocations. Based on the zones allowed by a gfp mask, one of these zonelists is selected. All of these zonelists consume memory and occupy cache lines. This patch replaces the multiple zonelists per-node with two zonelists. The first contains all populated zones in the system, ordered by distance, for fallback allocations when the target/preferred node has no free pages. The second contains all populated zones in the node suitable for GFP_THISNODE allocations. An iterator macro is introduced called for_each_zone_zonelist() that interates through each zone allowed by the GFP flags in the selected zonelist. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <clameter@sgi.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/54a6eb5c4765aa573a030ceeba2c14e3d2ea5706;" The first contains all populated zones in the system, ordered by distance, for fallback allocations when the target/preferred node has no free pages";yes;yes;no 329;MDY6Q29tbWl0MjMyNTI5ODo1NGE2ZWI1YzQ3NjVhYTU3M2EwMzBjZWViYTJjMTRlM2QyZWE1NzA2;Mel Gorman;Linus Torvalds;"mm: use two zonelist that are filtered by GFP mask Currently a node has two sets of zonelists, one for each zone type in the system and a second set for GFP_THISNODE allocations. Based on the zones allowed by a gfp mask, one of these zonelists is selected. All of these zonelists consume memory and occupy cache lines. This patch replaces the multiple zonelists per-node with two zonelists. The first contains all populated zones in the system, ordered by distance, for fallback allocations when the target/preferred node has no free pages. The second contains all populated zones in the node suitable for GFP_THISNODE allocations. An iterator macro is introduced called for_each_zone_zonelist() that interates through each zone allowed by the GFP flags in the selected zonelist. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <clameter@sgi.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/54a6eb5c4765aa573a030ceeba2c14e3d2ea5706;" The second contains all populated zones in the node suitable for GFP_THISNODE allocations";yes;no;no 329;MDY6Q29tbWl0MjMyNTI5ODo1NGE2ZWI1YzQ3NjVhYTU3M2EwMzBjZWViYTJjMTRlM2QyZWE1NzA2;Mel Gorman;Linus Torvalds;"mm: use two zonelist that are filtered by GFP mask Currently a node has two sets of zonelists, one for each zone type in the system and a second set for GFP_THISNODE allocations. Based on the zones allowed by a gfp mask, one of these zonelists is selected. All of these zonelists consume memory and occupy cache lines. This patch replaces the multiple zonelists per-node with two zonelists. The first contains all populated zones in the system, ordered by distance, for fallback allocations when the target/preferred node has no free pages. The second contains all populated zones in the node suitable for GFP_THISNODE allocations. An iterator macro is introduced called for_each_zone_zonelist() that interates through each zone allowed by the GFP flags in the selected zonelist. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <clameter@sgi.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/54a6eb5c4765aa573a030ceeba2c14e3d2ea5706;"An iterator macro is introduced called for_each_zone_zonelist() that interates through each zone allowed by the GFP flags in the selected zonelist.";yes;no;no 330;MDY6Q29tbWl0MjMyNTI5ODplMTE1ZjJkODkyNTM0OTBmYjJkYmYzMDRiNjI3ZjhkOTA4ZGYyNmYx;Li Zefan;Linus Torvalds;"memcg: fix oops in oom handling When I used a test program to fork mass processes and immediately move them to a cgroup where the memory limit is low enough to trigger oom kill, I got oops: BUG: unable to handle kernel NULL pointer dereference at 0000000000000808 IP: [<ffffffff8045c47f>] _spin_lock_irqsave+0x8/0x18 PGD 4c95f067 PUD 4406c067 PMD 0 Oops: 0002 [1] SMP CPU 2 Modules linked in: Pid: 11973, comm: a.out Not tainted 2.6.25-rc7 #5 RIP: 0010:[<ffffffff8045c47f>] [<ffffffff8045c47f>] _spin_lock_irqsave+0x8/0x18 RSP: 0018:ffff8100448c7c30 EFLAGS: 00010002 RAX: 0000000000000202 RBX: 0000000000000009 RCX: 000000000001c9f3 RDX: 0000000000000100 RSI: 0000000000000001 RDI: 0000000000000808 RBP: ffff81007e444080 R08: 0000000000000000 R09: ffff8100448c7900 R10: ffff81000105f480 R11: 00000100ffffffff R12: ffff810067c84140 R13: 0000000000000001 R14: ffff8100441d0018 R15: ffff81007da56200 FS: 00007f70eb1856f0(0000) GS:ffff81007fbad3c0(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000808 CR3: 000000004498a000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process a.out (pid: 11973, threadinfo ffff8100448c6000, task ffff81007da533e0) Stack: ffffffff8023ef5a 00000000000000d0 ffffffff80548dc0 00000000000000d0 ffff810067c84140 ffff81007e444080 ffffffff8026cef9 00000000000000d0 ffff8100441d0000 00000000000000d0 ffff8100441d0000 ffff8100505445c0 Call Trace: [<ffffffff8023ef5a>] ? force_sig_info+0x25/0xb9 [<ffffffff8026cef9>] ? oom_kill_task+0x77/0xe2 [<ffffffff8026d696>] ? mem_cgroup_out_of_memory+0x55/0x67 [<ffffffff802910ad>] ? mem_cgroup_charge_common+0xec/0x202 [<ffffffff8027997b>] ? handle_mm_fault+0x24e/0x77f [<ffffffff8022c4af>] ? default_wake_function+0x0/0xe [<ffffffff8027a17a>] ? get_user_pages+0x2ce/0x3af [<ffffffff80290fee>] ? mem_cgroup_charge_common+0x2d/0x202 [<ffffffff8027a441>] ? make_pages_present+0x8e/0xa4 [<ffffffff8027d1ab>] ? mmap_region+0x373/0x429 [<ffffffff8027d7eb>] ? do_mmap_pgoff+0x2ff/0x364 [<ffffffff80210471>] ? sys_mmap+0xe5/0x111 [<ffffffff8020bfc9>] ? tracesys+0xdc/0xe1 Code: 00 00 01 48 8b 3c 24 e9 46 d4 dd ff f0 ff 07 48 8b 3c 24 e9 3a d4 dd ff fe 07 48 8b 3c 24 e9 2f d4 dd ff 9c 58 fa ba 00 01 00 00 <f0> 66 0f c1 17 38 f2 74 06 f3 90 8a 17 eb f6 c3 fa b8 00 01 00 RIP [<ffffffff8045c47f>] _spin_lock_irqsave+0x8/0x18 RSP <ffff8100448c7c30> CR2: 0000000000000808 ---[ end trace c3702fa668021ea4 ]--- It's reproducable in a x86_64 box, but doesn't happen in x86_32. This is because tsk->sighand is not guarded by RCU, so we have to hold tasklist_lock, just as what out_of_memory() does. Signed-off-by: Li Zefan <lizf@cn.fujitsu> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Paul Menage <menage@google.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: David Rientjes <rientjes@cs.washington.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/e115f2d89253490fb2dbf304b627f8d908df26f1;memcg: fix oops in oom handling;yes;yes;no 330;MDY6Q29tbWl0MjMyNTI5ODplMTE1ZjJkODkyNTM0OTBmYjJkYmYzMDRiNjI3ZjhkOTA4ZGYyNmYx;Li Zefan;Linus Torvalds;"memcg: fix oops in oom handling When I used a test program to fork mass processes and immediately move them to a cgroup where the memory limit is low enough to trigger oom kill, I got oops: BUG: unable to handle kernel NULL pointer dereference at 0000000000000808 IP: [<ffffffff8045c47f>] _spin_lock_irqsave+0x8/0x18 PGD 4c95f067 PUD 4406c067 PMD 0 Oops: 0002 [1] SMP CPU 2 Modules linked in: Pid: 11973, comm: a.out Not tainted 2.6.25-rc7 #5 RIP: 0010:[<ffffffff8045c47f>] [<ffffffff8045c47f>] _spin_lock_irqsave+0x8/0x18 RSP: 0018:ffff8100448c7c30 EFLAGS: 00010002 RAX: 0000000000000202 RBX: 0000000000000009 RCX: 000000000001c9f3 RDX: 0000000000000100 RSI: 0000000000000001 RDI: 0000000000000808 RBP: ffff81007e444080 R08: 0000000000000000 R09: ffff8100448c7900 R10: ffff81000105f480 R11: 00000100ffffffff R12: ffff810067c84140 R13: 0000000000000001 R14: ffff8100441d0018 R15: ffff81007da56200 FS: 00007f70eb1856f0(0000) GS:ffff81007fbad3c0(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000808 CR3: 000000004498a000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process a.out (pid: 11973, threadinfo ffff8100448c6000, task ffff81007da533e0) Stack: ffffffff8023ef5a 00000000000000d0 ffffffff80548dc0 00000000000000d0 ffff810067c84140 ffff81007e444080 ffffffff8026cef9 00000000000000d0 ffff8100441d0000 00000000000000d0 ffff8100441d0000 ffff8100505445c0 Call Trace: [<ffffffff8023ef5a>] ? force_sig_info+0x25/0xb9 [<ffffffff8026cef9>] ? oom_kill_task+0x77/0xe2 [<ffffffff8026d696>] ? mem_cgroup_out_of_memory+0x55/0x67 [<ffffffff802910ad>] ? mem_cgroup_charge_common+0xec/0x202 [<ffffffff8027997b>] ? handle_mm_fault+0x24e/0x77f [<ffffffff8022c4af>] ? default_wake_function+0x0/0xe [<ffffffff8027a17a>] ? get_user_pages+0x2ce/0x3af [<ffffffff80290fee>] ? mem_cgroup_charge_common+0x2d/0x202 [<ffffffff8027a441>] ? make_pages_present+0x8e/0xa4 [<ffffffff8027d1ab>] ? mmap_region+0x373/0x429 [<ffffffff8027d7eb>] ? do_mmap_pgoff+0x2ff/0x364 [<ffffffff80210471>] ? sys_mmap+0xe5/0x111 [<ffffffff8020bfc9>] ? tracesys+0xdc/0xe1 Code: 00 00 01 48 8b 3c 24 e9 46 d4 dd ff f0 ff 07 48 8b 3c 24 e9 3a d4 dd ff fe 07 48 8b 3c 24 e9 2f d4 dd ff 9c 58 fa ba 00 01 00 00 <f0> 66 0f c1 17 38 f2 74 06 f3 90 8a 17 eb f6 c3 fa b8 00 01 00 RIP [<ffffffff8045c47f>] _spin_lock_irqsave+0x8/0x18 RSP <ffff8100448c7c30> CR2: 0000000000000808 ---[ end trace c3702fa668021ea4 ]--- It's reproducable in a x86_64 box, but doesn't happen in x86_32. This is because tsk->sighand is not guarded by RCU, so we have to hold tasklist_lock, just as what out_of_memory() does. Signed-off-by: Li Zefan <lizf@cn.fujitsu> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Paul Menage <menage@google.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: David Rientjes <rientjes@cs.washington.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/e115f2d89253490fb2dbf304b627f8d908df26f1;"When I used a test program to fork mass processes and immediately move them to a cgroup where the memory limit is low enough to trigger oom kill, I got oops";no;yes;yes 330;MDY6Q29tbWl0MjMyNTI5ODplMTE1ZjJkODkyNTM0OTBmYjJkYmYzMDRiNjI3ZjhkOTA4ZGYyNmYx;Li Zefan;Linus Torvalds;"memcg: fix oops in oom handling When I used a test program to fork mass processes and immediately move them to a cgroup where the memory limit is low enough to trigger oom kill, I got oops: BUG: unable to handle kernel NULL pointer dereference at 0000000000000808 IP: [<ffffffff8045c47f>] _spin_lock_irqsave+0x8/0x18 PGD 4c95f067 PUD 4406c067 PMD 0 Oops: 0002 [1] SMP CPU 2 Modules linked in: Pid: 11973, comm: a.out Not tainted 2.6.25-rc7 #5 RIP: 0010:[<ffffffff8045c47f>] [<ffffffff8045c47f>] _spin_lock_irqsave+0x8/0x18 RSP: 0018:ffff8100448c7c30 EFLAGS: 00010002 RAX: 0000000000000202 RBX: 0000000000000009 RCX: 000000000001c9f3 RDX: 0000000000000100 RSI: 0000000000000001 RDI: 0000000000000808 RBP: ffff81007e444080 R08: 0000000000000000 R09: ffff8100448c7900 R10: ffff81000105f480 R11: 00000100ffffffff R12: ffff810067c84140 R13: 0000000000000001 R14: ffff8100441d0018 R15: ffff81007da56200 FS: 00007f70eb1856f0(0000) GS:ffff81007fbad3c0(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000808 CR3: 000000004498a000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process a.out (pid: 11973, threadinfo ffff8100448c6000, task ffff81007da533e0) Stack: ffffffff8023ef5a 00000000000000d0 ffffffff80548dc0 00000000000000d0 ffff810067c84140 ffff81007e444080 ffffffff8026cef9 00000000000000d0 ffff8100441d0000 00000000000000d0 ffff8100441d0000 ffff8100505445c0 Call Trace: [<ffffffff8023ef5a>] ? force_sig_info+0x25/0xb9 [<ffffffff8026cef9>] ? oom_kill_task+0x77/0xe2 [<ffffffff8026d696>] ? mem_cgroup_out_of_memory+0x55/0x67 [<ffffffff802910ad>] ? mem_cgroup_charge_common+0xec/0x202 [<ffffffff8027997b>] ? handle_mm_fault+0x24e/0x77f [<ffffffff8022c4af>] ? default_wake_function+0x0/0xe [<ffffffff8027a17a>] ? get_user_pages+0x2ce/0x3af [<ffffffff80290fee>] ? mem_cgroup_charge_common+0x2d/0x202 [<ffffffff8027a441>] ? make_pages_present+0x8e/0xa4 [<ffffffff8027d1ab>] ? mmap_region+0x373/0x429 [<ffffffff8027d7eb>] ? do_mmap_pgoff+0x2ff/0x364 [<ffffffff80210471>] ? sys_mmap+0xe5/0x111 [<ffffffff8020bfc9>] ? tracesys+0xdc/0xe1 Code: 00 00 01 48 8b 3c 24 e9 46 d4 dd ff f0 ff 07 48 8b 3c 24 e9 3a d4 dd ff fe 07 48 8b 3c 24 e9 2f d4 dd ff 9c 58 fa ba 00 01 00 00 <f0> 66 0f c1 17 38 f2 74 06 f3 90 8a 17 eb f6 c3 fa b8 00 01 00 RIP [<ffffffff8045c47f>] _spin_lock_irqsave+0x8/0x18 RSP <ffff8100448c7c30> CR2: 0000000000000808 ---[ end trace c3702fa668021ea4 ]--- It's reproducable in a x86_64 box, but doesn't happen in x86_32. This is because tsk->sighand is not guarded by RCU, so we have to hold tasklist_lock, just as what out_of_memory() does. Signed-off-by: Li Zefan <lizf@cn.fujitsu> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Paul Menage <menage@google.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: David Rientjes <rientjes@cs.washington.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/e115f2d89253490fb2dbf304b627f8d908df26f1;"This is because tsk->sighand is not guarded by RCU, so we have to hold tasklist_lock, just as what out_of_memory() does.";yes;yes;yes 331;MDY6Q29tbWl0MjMyNTI5ODoxYjU3OGRmMDIyMDdhNjdhMjllOGNlZDRkYjNiMzZkODlkZjUyZmVm;Randy Dunlap;Linus Torvalds;"mm/oom_kill: fix kernel-doc Fix kernel-doc notation in oom_kill.c. Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/1b578df02207a67a29e8ced4db3b36d89df52fef;mm/oom_kill: fix kernel-doc;yes;yes;no 331;MDY6Q29tbWl0MjMyNTI5ODoxYjU3OGRmMDIyMDdhNjdhMjllOGNlZDRkYjNiMzZkODlkZjUyZmVm;Randy Dunlap;Linus Torvalds;"mm/oom_kill: fix kernel-doc Fix kernel-doc notation in oom_kill.c. Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/1b578df02207a67a29e8ced4db3b36d89df52fef;Fix kernel-doc notation in oom_kill.c.;yes;yes;no 332;MDY6Q29tbWl0MjMyNTI5ODowMGYwYjgyNTllNDg5NzljMzcyMTI5OTVkNzk4ZjNmYmQwMzc0Njkw;Balbir Singh;Linus Torvalds;"Memory controller: rename to Memory Resource Controller Rename Memory Controller to Memory Resource Controller. Reflect the same changes in the CONFIG definition for the Memory Resource Controller. Group together the config options for Resource Counters and Memory Resource Controller. Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/00f0b8259e48979c37212995d798f3fbd0374690;Memory controller: rename to Memory Resource Controller;yes;no;no 332;MDY6Q29tbWl0MjMyNTI5ODowMGYwYjgyNTllNDg5NzljMzcyMTI5OTVkNzk4ZjNmYmQwMzc0Njkw;Balbir Singh;Linus Torvalds;"Memory controller: rename to Memory Resource Controller Rename Memory Controller to Memory Resource Controller. Reflect the same changes in the CONFIG definition for the Memory Resource Controller. Group together the config options for Resource Counters and Memory Resource Controller. Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/00f0b8259e48979c37212995d798f3fbd0374690;Rename Memory Controller to Memory Resource Controller;yes;no;no 332;MDY6Q29tbWl0MjMyNTI5ODowMGYwYjgyNTllNDg5NzljMzcyMTI5OTVkNzk4ZjNmYmQwMzc0Njkw;Balbir Singh;Linus Torvalds;"Memory controller: rename to Memory Resource Controller Rename Memory Controller to Memory Resource Controller. Reflect the same changes in the CONFIG definition for the Memory Resource Controller. Group together the config options for Resource Counters and Memory Resource Controller. Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/00f0b8259e48979c37212995d798f3fbd0374690;" Reflect the same changes in the CONFIG definition for the Memory Resource Controller";yes;yes;no 332;MDY6Q29tbWl0MjMyNTI5ODowMGYwYjgyNTllNDg5NzljMzcyMTI5OTVkNzk4ZjNmYmQwMzc0Njkw;Balbir Singh;Linus Torvalds;"Memory controller: rename to Memory Resource Controller Rename Memory Controller to Memory Resource Controller. Reflect the same changes in the CONFIG definition for the Memory Resource Controller. Group together the config options for Resource Counters and Memory Resource Controller. Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/00f0b8259e48979c37212995d798f3fbd0374690;" Group together the config options for Resource Counters and Memory Resource Controller.";yes;no;no 333;MDY6Q29tbWl0MjMyNTI5ODpmZWYxYmRkNjhjODFiNzE4ODJjY2I2ZjQ3YzcwOTgwYTAzMTgyMDYz;David Rientjes;Linus Torvalds;"oom: add sysctl to enable task memory dump Adds a new sysctl, 'oom_dump_tasks', that enables the kernel to produce a dump of all system tasks (excluding kernel threads) when performing an OOM-killing. Information includes pid, uid, tgid, vm size, rss, cpu, oom_adj score, and name. This is helpful for determining why there was an OOM condition and which rogue task caused it. It is configurable so that large systems, such as those with several thousand tasks, do not incur a performance penalty associated with dumping data they may not desire. If an OOM was triggered as a result of a memory controller, the tasklist shall be filtered to exclude tasks that are not a member of the same cgroup. Cc: Andrea Arcangeli <andrea@suse.de> Cc: Christoph Lameter <clameter@sgi.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/fef1bdd68c81b71882ccb6f47c70980a03182063;oom: add sysctl to enable task memory dump;yes;yes;no 333;MDY6Q29tbWl0MjMyNTI5ODpmZWYxYmRkNjhjODFiNzE4ODJjY2I2ZjQ3YzcwOTgwYTAzMTgyMDYz;David Rientjes;Linus Torvalds;"oom: add sysctl to enable task memory dump Adds a new sysctl, 'oom_dump_tasks', that enables the kernel to produce a dump of all system tasks (excluding kernel threads) when performing an OOM-killing. Information includes pid, uid, tgid, vm size, rss, cpu, oom_adj score, and name. This is helpful for determining why there was an OOM condition and which rogue task caused it. It is configurable so that large systems, such as those with several thousand tasks, do not incur a performance penalty associated with dumping data they may not desire. If an OOM was triggered as a result of a memory controller, the tasklist shall be filtered to exclude tasks that are not a member of the same cgroup. Cc: Andrea Arcangeli <andrea@suse.de> Cc: Christoph Lameter <clameter@sgi.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/fef1bdd68c81b71882ccb6f47c70980a03182063;"Adds a new sysctl, 'oom_dump_tasks', that enables the kernel to produce a dump of all system tasks (excluding kernel threads) when performing an OOM-killing";yes;yes;no 333;MDY6Q29tbWl0MjMyNTI5ODpmZWYxYmRkNjhjODFiNzE4ODJjY2I2ZjQ3YzcwOTgwYTAzMTgyMDYz;David Rientjes;Linus Torvalds;"oom: add sysctl to enable task memory dump Adds a new sysctl, 'oom_dump_tasks', that enables the kernel to produce a dump of all system tasks (excluding kernel threads) when performing an OOM-killing. Information includes pid, uid, tgid, vm size, rss, cpu, oom_adj score, and name. This is helpful for determining why there was an OOM condition and which rogue task caused it. It is configurable so that large systems, such as those with several thousand tasks, do not incur a performance penalty associated with dumping data they may not desire. If an OOM was triggered as a result of a memory controller, the tasklist shall be filtered to exclude tasks that are not a member of the same cgroup. Cc: Andrea Arcangeli <andrea@suse.de> Cc: Christoph Lameter <clameter@sgi.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/fef1bdd68c81b71882ccb6f47c70980a03182063;" Information includes pid, uid, tgid, vm size, rss, cpu, oom_adj score, and name";yes;no;no 333;MDY6Q29tbWl0MjMyNTI5ODpmZWYxYmRkNjhjODFiNzE4ODJjY2I2ZjQ3YzcwOTgwYTAzMTgyMDYz;David Rientjes;Linus Torvalds;"oom: add sysctl to enable task memory dump Adds a new sysctl, 'oom_dump_tasks', that enables the kernel to produce a dump of all system tasks (excluding kernel threads) when performing an OOM-killing. Information includes pid, uid, tgid, vm size, rss, cpu, oom_adj score, and name. This is helpful for determining why there was an OOM condition and which rogue task caused it. It is configurable so that large systems, such as those with several thousand tasks, do not incur a performance penalty associated with dumping data they may not desire. If an OOM was triggered as a result of a memory controller, the tasklist shall be filtered to exclude tasks that are not a member of the same cgroup. Cc: Andrea Arcangeli <andrea@suse.de> Cc: Christoph Lameter <clameter@sgi.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/fef1bdd68c81b71882ccb6f47c70980a03182063;"This is helpful for determining why there was an OOM condition and which rogue task caused it";no;yes;no 333;MDY6Q29tbWl0MjMyNTI5ODpmZWYxYmRkNjhjODFiNzE4ODJjY2I2ZjQ3YzcwOTgwYTAzMTgyMDYz;David Rientjes;Linus Torvalds;"oom: add sysctl to enable task memory dump Adds a new sysctl, 'oom_dump_tasks', that enables the kernel to produce a dump of all system tasks (excluding kernel threads) when performing an OOM-killing. Information includes pid, uid, tgid, vm size, rss, cpu, oom_adj score, and name. This is helpful for determining why there was an OOM condition and which rogue task caused it. It is configurable so that large systems, such as those with several thousand tasks, do not incur a performance penalty associated with dumping data they may not desire. If an OOM was triggered as a result of a memory controller, the tasklist shall be filtered to exclude tasks that are not a member of the same cgroup. Cc: Andrea Arcangeli <andrea@suse.de> Cc: Christoph Lameter <clameter@sgi.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/fef1bdd68c81b71882ccb6f47c70980a03182063;"It is configurable so that large systems, such as those with several thousand tasks, do not incur a performance penalty associated with dumping data they may not desire";yes;yes;no 333;MDY6Q29tbWl0MjMyNTI5ODpmZWYxYmRkNjhjODFiNzE4ODJjY2I2ZjQ3YzcwOTgwYTAzMTgyMDYz;David Rientjes;Linus Torvalds;"oom: add sysctl to enable task memory dump Adds a new sysctl, 'oom_dump_tasks', that enables the kernel to produce a dump of all system tasks (excluding kernel threads) when performing an OOM-killing. Information includes pid, uid, tgid, vm size, rss, cpu, oom_adj score, and name. This is helpful for determining why there was an OOM condition and which rogue task caused it. It is configurable so that large systems, such as those with several thousand tasks, do not incur a performance penalty associated with dumping data they may not desire. If an OOM was triggered as a result of a memory controller, the tasklist shall be filtered to exclude tasks that are not a member of the same cgroup. Cc: Andrea Arcangeli <andrea@suse.de> Cc: Christoph Lameter <clameter@sgi.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/fef1bdd68c81b71882ccb6f47c70980a03182063;"If an OOM was triggered as a result of a memory controller, the tasklist shall be filtered to exclude tasks that are not a member of the same cgroup.";yes;yes;no 334;MDY6Q29tbWl0MjMyNTI5ODo0YzRhMjIxNDg5MDllNGMwMDM1NjJlYTdmZmUwYTA2ZTI2OTE5ZTNj;David Rientjes;Linus Torvalds;"memcontrol: move oom task exclusion to tasklist scan Creates a helper function to return non-zero if a task is a member of a memory controller: int task_in_mem_cgroup(const struct task_struct *task, const struct mem_cgroup *mem); When the OOM killer is constrained by the memory controller, the exclusion of tasks that are not a member of that controller was previously misplaced and appeared in the badness scoring function. It should be excluded during the tasklist scan in select_bad_process() instead. [akpm@linux-foundation.org: build fix] Cc: Christoph Lameter <clameter@sgi.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/4c4a22148909e4c003562ea7ffe0a06e26919e3c;memcontrol: move oom task exclusion to tasklist scan;yes;no;no 334;MDY6Q29tbWl0MjMyNTI5ODo0YzRhMjIxNDg5MDllNGMwMDM1NjJlYTdmZmUwYTA2ZTI2OTE5ZTNj;David Rientjes;Linus Torvalds;"memcontrol: move oom task exclusion to tasklist scan Creates a helper function to return non-zero if a task is a member of a memory controller: int task_in_mem_cgroup(const struct task_struct *task, const struct mem_cgroup *mem); When the OOM killer is constrained by the memory controller, the exclusion of tasks that are not a member of that controller was previously misplaced and appeared in the badness scoring function. It should be excluded during the tasklist scan in select_bad_process() instead. [akpm@linux-foundation.org: build fix] Cc: Christoph Lameter <clameter@sgi.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/4c4a22148909e4c003562ea7ffe0a06e26919e3c;"Creates a helper function to return non-zero if a task is a member of a memory controller";yes;no;no 334;MDY6Q29tbWl0MjMyNTI5ODo0YzRhMjIxNDg5MDllNGMwMDM1NjJlYTdmZmUwYTA2ZTI2OTE5ZTNj;David Rientjes;Linus Torvalds;"memcontrol: move oom task exclusion to tasklist scan Creates a helper function to return non-zero if a task is a member of a memory controller: int task_in_mem_cgroup(const struct task_struct *task, const struct mem_cgroup *mem); When the OOM killer is constrained by the memory controller, the exclusion of tasks that are not a member of that controller was previously misplaced and appeared in the badness scoring function. It should be excluded during the tasklist scan in select_bad_process() instead. [akpm@linux-foundation.org: build fix] Cc: Christoph Lameter <clameter@sgi.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/4c4a22148909e4c003562ea7ffe0a06e26919e3c;" int task_in_mem_cgroup(const struct task_struct *task, When the OOM killer is constrained by the memory controller, the exclusion of tasks that are not a member of that controller was previously misplaced and appeared in the badness scoring function";no;yes;yes 334;MDY6Q29tbWl0MjMyNTI5ODo0YzRhMjIxNDg5MDllNGMwMDM1NjJlYTdmZmUwYTA2ZTI2OTE5ZTNj;David Rientjes;Linus Torvalds;"memcontrol: move oom task exclusion to tasklist scan Creates a helper function to return non-zero if a task is a member of a memory controller: int task_in_mem_cgroup(const struct task_struct *task, const struct mem_cgroup *mem); When the OOM killer is constrained by the memory controller, the exclusion of tasks that are not a member of that controller was previously misplaced and appeared in the badness scoring function. It should be excluded during the tasklist scan in select_bad_process() instead. [akpm@linux-foundation.org: build fix] Cc: Christoph Lameter <clameter@sgi.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/4c4a22148909e4c003562ea7ffe0a06e26919e3c;" It should be excluded during the tasklist scan in select_bad_process() instead.";yes;no;no 335;MDY6Q29tbWl0MjMyNTI5ODpjN2JhNWM5ZTgxNzY3MDRiZmFjMDcyOTg3NWZhNjI3OTgwMzc1ODRk;Pavel Emelianov;Linus Torvalds;"Memory controller: OOM handling Out of memory handling for cgroups over their limit. A task from the cgroup over limit is chosen using the existing OOM logic and killed. TODO: 1. As discussed in the OLS BOF session, consider implementing a user space policy for OOM handling. [akpm@linux-foundation.org: fix build due to oom-killer changes] Signed-off-by: Pavel Emelianov <xemul@openvz.org> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Paul Menage <menage@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: ""Eric W. Biederman"" <ebiederm@xmission.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Kirill Korotaev <dev@sw.ru> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: David Rientjes <rientjes@google.com> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/c7ba5c9e8176704bfac0729875fa62798037584d;Memory controller: OOM handling;yes;no;no 335;MDY6Q29tbWl0MjMyNTI5ODpjN2JhNWM5ZTgxNzY3MDRiZmFjMDcyOTg3NWZhNjI3OTgwMzc1ODRk;Pavel Emelianov;Linus Torvalds;"Memory controller: OOM handling Out of memory handling for cgroups over their limit. A task from the cgroup over limit is chosen using the existing OOM logic and killed. TODO: 1. As discussed in the OLS BOF session, consider implementing a user space policy for OOM handling. [akpm@linux-foundation.org: fix build due to oom-killer changes] Signed-off-by: Pavel Emelianov <xemul@openvz.org> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Paul Menage <menage@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: ""Eric W. Biederman"" <ebiederm@xmission.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Kirill Korotaev <dev@sw.ru> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: David Rientjes <rientjes@google.com> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/c7ba5c9e8176704bfac0729875fa62798037584d;Out of memory handling for cgroups over their limit;yes;no;no 335;MDY6Q29tbWl0MjMyNTI5ODpjN2JhNWM5ZTgxNzY3MDRiZmFjMDcyOTg3NWZhNjI3OTgwMzc1ODRk;Pavel Emelianov;Linus Torvalds;"Memory controller: OOM handling Out of memory handling for cgroups over their limit. A task from the cgroup over limit is chosen using the existing OOM logic and killed. TODO: 1. As discussed in the OLS BOF session, consider implementing a user space policy for OOM handling. [akpm@linux-foundation.org: fix build due to oom-killer changes] Signed-off-by: Pavel Emelianov <xemul@openvz.org> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Paul Menage <menage@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: ""Eric W. Biederman"" <ebiederm@xmission.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Kirill Korotaev <dev@sw.ru> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: David Rientjes <rientjes@google.com> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/c7ba5c9e8176704bfac0729875fa62798037584d;"A task from the cgroup over limit is chosen using the existing OOM logic and killed";yes;no;no 335;MDY6Q29tbWl0MjMyNTI5ODpjN2JhNWM5ZTgxNzY3MDRiZmFjMDcyOTg3NWZhNjI3OTgwMzc1ODRk;Pavel Emelianov;Linus Torvalds;"Memory controller: OOM handling Out of memory handling for cgroups over their limit. A task from the cgroup over limit is chosen using the existing OOM logic and killed. TODO: 1. As discussed in the OLS BOF session, consider implementing a user space policy for OOM handling. [akpm@linux-foundation.org: fix build due to oom-killer changes] Signed-off-by: Pavel Emelianov <xemul@openvz.org> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Paul Menage <menage@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: ""Eric W. Biederman"" <ebiederm@xmission.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Kirill Korotaev <dev@sw.ru> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: David Rientjes <rientjes@google.com> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/c7ba5c9e8176704bfac0729875fa62798037584d;TODO;no;no;no 335;MDY6Q29tbWl0MjMyNTI5ODpjN2JhNWM5ZTgxNzY3MDRiZmFjMDcyOTg3NWZhNjI3OTgwMzc1ODRk;Pavel Emelianov;Linus Torvalds;"Memory controller: OOM handling Out of memory handling for cgroups over their limit. A task from the cgroup over limit is chosen using the existing OOM logic and killed. TODO: 1. As discussed in the OLS BOF session, consider implementing a user space policy for OOM handling. [akpm@linux-foundation.org: fix build due to oom-killer changes] Signed-off-by: Pavel Emelianov <xemul@openvz.org> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Paul Menage <menage@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: ""Eric W. Biederman"" <ebiederm@xmission.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Kirill Korotaev <dev@sw.ru> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: David Rientjes <rientjes@google.com> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/c7ba5c9e8176704bfac0729875fa62798037584d;1;no;no;no 335;MDY6Q29tbWl0MjMyNTI5ODpjN2JhNWM5ZTgxNzY3MDRiZmFjMDcyOTg3NWZhNjI3OTgwMzc1ODRk;Pavel Emelianov;Linus Torvalds;"Memory controller: OOM handling Out of memory handling for cgroups over their limit. A task from the cgroup over limit is chosen using the existing OOM logic and killed. TODO: 1. As discussed in the OLS BOF session, consider implementing a user space policy for OOM handling. [akpm@linux-foundation.org: fix build due to oom-killer changes] Signed-off-by: Pavel Emelianov <xemul@openvz.org> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Paul Menage <menage@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: ""Eric W. Biederman"" <ebiederm@xmission.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Kirill Korotaev <dev@sw.ru> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: David Rientjes <rientjes@google.com> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/c7ba5c9e8176704bfac0729875fa62798037584d;"As discussed in the OLS BOF session, consider implementing a user space policy for OOM handling.";yes;no;no 336;MDY6Q29tbWl0MjMyNTI5ODo5NzgyOTk1NWFkMjkxYWNlYzFkOGI5NGU5OTExYjNjZWIxMTE4YmIx;Serge E. Hallyn;Linus Torvalds;"oom_kill: remove uid==0 checks Root processes are considered more important when out of memory and killing proceses. The check for CAP_SYS_ADMIN was augmented with a check for uid==0 or euid==0. There are several possible ways to look at this: 1. uid comparisons are unnecessary, trust CAP_SYS_ADMIN alone. However CAP_SYS_RESOURCE is the one that really means ""give me extra resources"" so allow for that as well. 2. Any privileged code should be protected, but uid is not an indication of privilege. So we should check whether any capabilities are raised. 3. uid==0 makes processes on the host as well as in containers more important, so we should keep the existing checks. 4. uid==0 makes processes only on the host more important, even without any capabilities. So we should be keeping the (uid==0||euid==0) check but only when userns==&init_user_ns. I'm following number 1 here. Signed-off-by: Serge Hallyn <serue@us.ibm.com> Cc: Andrew Morgan <morgan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/97829955ad291acec1d8b94e9911b3ceb1118bb1;oom_kill: remove uid==0 checks;yes;no;no 336;MDY6Q29tbWl0MjMyNTI5ODo5NzgyOTk1NWFkMjkxYWNlYzFkOGI5NGU5OTExYjNjZWIxMTE4YmIx;Serge E. Hallyn;Linus Torvalds;"oom_kill: remove uid==0 checks Root processes are considered more important when out of memory and killing proceses. The check for CAP_SYS_ADMIN was augmented with a check for uid==0 or euid==0. There are several possible ways to look at this: 1. uid comparisons are unnecessary, trust CAP_SYS_ADMIN alone. However CAP_SYS_RESOURCE is the one that really means ""give me extra resources"" so allow for that as well. 2. Any privileged code should be protected, but uid is not an indication of privilege. So we should check whether any capabilities are raised. 3. uid==0 makes processes on the host as well as in containers more important, so we should keep the existing checks. 4. uid==0 makes processes only on the host more important, even without any capabilities. So we should be keeping the (uid==0||euid==0) check but only when userns==&init_user_ns. I'm following number 1 here. Signed-off-by: Serge Hallyn <serue@us.ibm.com> Cc: Andrew Morgan <morgan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/97829955ad291acec1d8b94e9911b3ceb1118bb1;"Root processes are considered more important when out of memory and killing proceses";no;no;yes 336;MDY6Q29tbWl0MjMyNTI5ODo5NzgyOTk1NWFkMjkxYWNlYzFkOGI5NGU5OTExYjNjZWIxMTE4YmIx;Serge E. Hallyn;Linus Torvalds;"oom_kill: remove uid==0 checks Root processes are considered more important when out of memory and killing proceses. The check for CAP_SYS_ADMIN was augmented with a check for uid==0 or euid==0. There are several possible ways to look at this: 1. uid comparisons are unnecessary, trust CAP_SYS_ADMIN alone. However CAP_SYS_RESOURCE is the one that really means ""give me extra resources"" so allow for that as well. 2. Any privileged code should be protected, but uid is not an indication of privilege. So we should check whether any capabilities are raised. 3. uid==0 makes processes on the host as well as in containers more important, so we should keep the existing checks. 4. uid==0 makes processes only on the host more important, even without any capabilities. So we should be keeping the (uid==0||euid==0) check but only when userns==&init_user_ns. I'm following number 1 here. Signed-off-by: Serge Hallyn <serue@us.ibm.com> Cc: Andrew Morgan <morgan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/97829955ad291acec1d8b94e9911b3ceb1118bb1;" The check for CAP_SYS_ADMIN was augmented with a check for uid==0 or euid==0";no;no;yes 336;MDY6Q29tbWl0MjMyNTI5ODo5NzgyOTk1NWFkMjkxYWNlYzFkOGI5NGU5OTExYjNjZWIxMTE4YmIx;Serge E. Hallyn;Linus Torvalds;"oom_kill: remove uid==0 checks Root processes are considered more important when out of memory and killing proceses. The check for CAP_SYS_ADMIN was augmented with a check for uid==0 or euid==0. There are several possible ways to look at this: 1. uid comparisons are unnecessary, trust CAP_SYS_ADMIN alone. However CAP_SYS_RESOURCE is the one that really means ""give me extra resources"" so allow for that as well. 2. Any privileged code should be protected, but uid is not an indication of privilege. So we should check whether any capabilities are raised. 3. uid==0 makes processes on the host as well as in containers more important, so we should keep the existing checks. 4. uid==0 makes processes only on the host more important, even without any capabilities. So we should be keeping the (uid==0||euid==0) check but only when userns==&init_user_ns. I'm following number 1 here. Signed-off-by: Serge Hallyn <serue@us.ibm.com> Cc: Andrew Morgan <morgan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/97829955ad291acec1d8b94e9911b3ceb1118bb1;There are several possible ways to look at this;no;no;yes 336;MDY6Q29tbWl0MjMyNTI5ODo5NzgyOTk1NWFkMjkxYWNlYzFkOGI5NGU5OTExYjNjZWIxMTE4YmIx;Serge E. Hallyn;Linus Torvalds;"oom_kill: remove uid==0 checks Root processes are considered more important when out of memory and killing proceses. The check for CAP_SYS_ADMIN was augmented with a check for uid==0 or euid==0. There are several possible ways to look at this: 1. uid comparisons are unnecessary, trust CAP_SYS_ADMIN alone. However CAP_SYS_RESOURCE is the one that really means ""give me extra resources"" so allow for that as well. 2. Any privileged code should be protected, but uid is not an indication of privilege. So we should check whether any capabilities are raised. 3. uid==0 makes processes on the host as well as in containers more important, so we should keep the existing checks. 4. uid==0 makes processes only on the host more important, even without any capabilities. So we should be keeping the (uid==0||euid==0) check but only when userns==&init_user_ns. I'm following number 1 here. Signed-off-by: Serge Hallyn <serue@us.ibm.com> Cc: Andrew Morgan <morgan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/97829955ad291acec1d8b94e9911b3ceb1118bb1;"uid comparisons are unnecessary, trust CAP_SYS_ADMIN alone";yes;yes;yes 336;MDY6Q29tbWl0MjMyNTI5ODo5NzgyOTk1NWFkMjkxYWNlYzFkOGI5NGU5OTExYjNjZWIxMTE4YmIx;Serge E. Hallyn;Linus Torvalds;"oom_kill: remove uid==0 checks Root processes are considered more important when out of memory and killing proceses. The check for CAP_SYS_ADMIN was augmented with a check for uid==0 or euid==0. There are several possible ways to look at this: 1. uid comparisons are unnecessary, trust CAP_SYS_ADMIN alone. However CAP_SYS_RESOURCE is the one that really means ""give me extra resources"" so allow for that as well. 2. Any privileged code should be protected, but uid is not an indication of privilege. So we should check whether any capabilities are raised. 3. uid==0 makes processes on the host as well as in containers more important, so we should keep the existing checks. 4. uid==0 makes processes only on the host more important, even without any capabilities. So we should be keeping the (uid==0||euid==0) check but only when userns==&init_user_ns. I'm following number 1 here. Signed-off-by: Serge Hallyn <serue@us.ibm.com> Cc: Andrew Morgan <morgan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/97829955ad291acec1d8b94e9911b3ceb1118bb1;" However CAP_SYS_RESOURCE is the one that really means ""give me extra resources"" so allow for that as well";yes;yes;yes 336;MDY6Q29tbWl0MjMyNTI5ODo5NzgyOTk1NWFkMjkxYWNlYzFkOGI5NGU5OTExYjNjZWIxMTE4YmIx;Serge E. Hallyn;Linus Torvalds;"oom_kill: remove uid==0 checks Root processes are considered more important when out of memory and killing proceses. The check for CAP_SYS_ADMIN was augmented with a check for uid==0 or euid==0. There are several possible ways to look at this: 1. uid comparisons are unnecessary, trust CAP_SYS_ADMIN alone. However CAP_SYS_RESOURCE is the one that really means ""give me extra resources"" so allow for that as well. 2. Any privileged code should be protected, but uid is not an indication of privilege. So we should check whether any capabilities are raised. 3. uid==0 makes processes on the host as well as in containers more important, so we should keep the existing checks. 4. uid==0 makes processes only on the host more important, even without any capabilities. So we should be keeping the (uid==0||euid==0) check but only when userns==&init_user_ns. I'm following number 1 here. Signed-off-by: Serge Hallyn <serue@us.ibm.com> Cc: Andrew Morgan <morgan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/97829955ad291acec1d8b94e9911b3ceb1118bb1;"Any privileged code should be protected, but uid is not an indication of privilege";yes;yes;yes 336;MDY6Q29tbWl0MjMyNTI5ODo5NzgyOTk1NWFkMjkxYWNlYzFkOGI5NGU5OTExYjNjZWIxMTE4YmIx;Serge E. Hallyn;Linus Torvalds;"oom_kill: remove uid==0 checks Root processes are considered more important when out of memory and killing proceses. The check for CAP_SYS_ADMIN was augmented with a check for uid==0 or euid==0. There are several possible ways to look at this: 1. uid comparisons are unnecessary, trust CAP_SYS_ADMIN alone. However CAP_SYS_RESOURCE is the one that really means ""give me extra resources"" so allow for that as well. 2. Any privileged code should be protected, but uid is not an indication of privilege. So we should check whether any capabilities are raised. 3. uid==0 makes processes on the host as well as in containers more important, so we should keep the existing checks. 4. uid==0 makes processes only on the host more important, even without any capabilities. So we should be keeping the (uid==0||euid==0) check but only when userns==&init_user_ns. I'm following number 1 here. Signed-off-by: Serge Hallyn <serue@us.ibm.com> Cc: Andrew Morgan <morgan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/97829955ad291acec1d8b94e9911b3ceb1118bb1;" So we should check whether any capabilities are raised";yes;no;yes 336;MDY6Q29tbWl0MjMyNTI5ODo5NzgyOTk1NWFkMjkxYWNlYzFkOGI5NGU5OTExYjNjZWIxMTE4YmIx;Serge E. Hallyn;Linus Torvalds;"oom_kill: remove uid==0 checks Root processes are considered more important when out of memory and killing proceses. The check for CAP_SYS_ADMIN was augmented with a check for uid==0 or euid==0. There are several possible ways to look at this: 1. uid comparisons are unnecessary, trust CAP_SYS_ADMIN alone. However CAP_SYS_RESOURCE is the one that really means ""give me extra resources"" so allow for that as well. 2. Any privileged code should be protected, but uid is not an indication of privilege. So we should check whether any capabilities are raised. 3. uid==0 makes processes on the host as well as in containers more important, so we should keep the existing checks. 4. uid==0 makes processes only on the host more important, even without any capabilities. So we should be keeping the (uid==0||euid==0) check but only when userns==&init_user_ns. I'm following number 1 here. Signed-off-by: Serge Hallyn <serue@us.ibm.com> Cc: Andrew Morgan <morgan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/97829955ad291acec1d8b94e9911b3ceb1118bb1;"uid==0 makes processes on the host as well as in containers more important, so we should keep the existing checks";yes;yes;yes 336;MDY6Q29tbWl0MjMyNTI5ODo5NzgyOTk1NWFkMjkxYWNlYzFkOGI5NGU5OTExYjNjZWIxMTE4YmIx;Serge E. Hallyn;Linus Torvalds;"oom_kill: remove uid==0 checks Root processes are considered more important when out of memory and killing proceses. The check for CAP_SYS_ADMIN was augmented with a check for uid==0 or euid==0. There are several possible ways to look at this: 1. uid comparisons are unnecessary, trust CAP_SYS_ADMIN alone. However CAP_SYS_RESOURCE is the one that really means ""give me extra resources"" so allow for that as well. 2. Any privileged code should be protected, but uid is not an indication of privilege. So we should check whether any capabilities are raised. 3. uid==0 makes processes on the host as well as in containers more important, so we should keep the existing checks. 4. uid==0 makes processes only on the host more important, even without any capabilities. So we should be keeping the (uid==0||euid==0) check but only when userns==&init_user_ns. I'm following number 1 here. Signed-off-by: Serge Hallyn <serue@us.ibm.com> Cc: Andrew Morgan <morgan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/97829955ad291acec1d8b94e9911b3ceb1118bb1;"uid==0 makes processes only on the host more important, even without any capabilities";yes;yes;yes 336;MDY6Q29tbWl0MjMyNTI5ODo5NzgyOTk1NWFkMjkxYWNlYzFkOGI5NGU5OTExYjNjZWIxMTE4YmIx;Serge E. Hallyn;Linus Torvalds;"oom_kill: remove uid==0 checks Root processes are considered more important when out of memory and killing proceses. The check for CAP_SYS_ADMIN was augmented with a check for uid==0 or euid==0. There are several possible ways to look at this: 1. uid comparisons are unnecessary, trust CAP_SYS_ADMIN alone. However CAP_SYS_RESOURCE is the one that really means ""give me extra resources"" so allow for that as well. 2. Any privileged code should be protected, but uid is not an indication of privilege. So we should check whether any capabilities are raised. 3. uid==0 makes processes on the host as well as in containers more important, so we should keep the existing checks. 4. uid==0 makes processes only on the host more important, even without any capabilities. So we should be keeping the (uid==0||euid==0) check but only when userns==&init_user_ns. I'm following number 1 here. Signed-off-by: Serge Hallyn <serue@us.ibm.com> Cc: Andrew Morgan <morgan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/97829955ad291acec1d8b94e9911b3ceb1118bb1;" So we should be keeping the (uid==0||euid==0) check but only when userns==&init_user_ns";yes;no;yes 336;MDY6Q29tbWl0MjMyNTI5ODo5NzgyOTk1NWFkMjkxYWNlYzFkOGI5NGU5OTExYjNjZWIxMTE4YmIx;Serge E. Hallyn;Linus Torvalds;"oom_kill: remove uid==0 checks Root processes are considered more important when out of memory and killing proceses. The check for CAP_SYS_ADMIN was augmented with a check for uid==0 or euid==0. There are several possible ways to look at this: 1. uid comparisons are unnecessary, trust CAP_SYS_ADMIN alone. However CAP_SYS_RESOURCE is the one that really means ""give me extra resources"" so allow for that as well. 2. Any privileged code should be protected, but uid is not an indication of privilege. So we should check whether any capabilities are raised. 3. uid==0 makes processes on the host as well as in containers more important, so we should keep the existing checks. 4. uid==0 makes processes only on the host more important, even without any capabilities. So we should be keeping the (uid==0||euid==0) check but only when userns==&init_user_ns. I'm following number 1 here. Signed-off-by: Serge Hallyn <serue@us.ibm.com> Cc: Andrew Morgan <morgan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/97829955ad291acec1d8b94e9911b3ceb1118bb1;I'm following number 1 here.;yes;no;no 337;MDY6Q29tbWl0MjMyNTI5ODplMzM4ZDI2M2E3NmFmNzhmZThmMzhhNzIxMzExODhiNThmY2ViNTkx;Andrew Morgan;Linus Torvalds;"Add 64-bit capability support to the kernel The patch supports legacy (32-bit) capability userspace, and where possible translates 32-bit capabilities to/from userspace and the VFS to 64-bit kernel space capabilities. If a capability set cannot be compressed into 32-bits for consumption by user space, the system call fails, with -ERANGE. FWIW libcap-2.00 supports this change (and earlier capability formats) http://www.kernel.org/pub/linux/libs/security/linux-privs/kernel-2.6/ [akpm@linux-foundation.org: coding-syle fixes] [akpm@linux-foundation.org: use get_task_comm()] [ezk@cs.sunysb.edu: build fix] [akpm@linux-foundation.org: do not initialise statics to 0 or NULL] [akpm@linux-foundation.org: unused var] [serue@us.ibm.com: export __cap_ symbols] Signed-off-by: Andrew G. Morgan <morgan@kernel.org> Cc: Stephen Smalley <sds@tycho.nsa.gov> Acked-by: Serge Hallyn <serue@us.ibm.com> Cc: Chris Wright <chrisw@sous-sol.org> Cc: James Morris <jmorris@namei.org> Cc: Casey Schaufler <casey@schaufler-ca.com> Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/e338d263a76af78fe8f38a72131188b58fceb591;Add 64-bit capability support to the kernel;yes;no;no 337;MDY6Q29tbWl0MjMyNTI5ODplMzM4ZDI2M2E3NmFmNzhmZThmMzhhNzIxMzExODhiNThmY2ViNTkx;Andrew Morgan;Linus Torvalds;"Add 64-bit capability support to the kernel The patch supports legacy (32-bit) capability userspace, and where possible translates 32-bit capabilities to/from userspace and the VFS to 64-bit kernel space capabilities. If a capability set cannot be compressed into 32-bits for consumption by user space, the system call fails, with -ERANGE. FWIW libcap-2.00 supports this change (and earlier capability formats) http://www.kernel.org/pub/linux/libs/security/linux-privs/kernel-2.6/ [akpm@linux-foundation.org: coding-syle fixes] [akpm@linux-foundation.org: use get_task_comm()] [ezk@cs.sunysb.edu: build fix] [akpm@linux-foundation.org: do not initialise statics to 0 or NULL] [akpm@linux-foundation.org: unused var] [serue@us.ibm.com: export __cap_ symbols] Signed-off-by: Andrew G. Morgan <morgan@kernel.org> Cc: Stephen Smalley <sds@tycho.nsa.gov> Acked-by: Serge Hallyn <serue@us.ibm.com> Cc: Chris Wright <chrisw@sous-sol.org> Cc: James Morris <jmorris@namei.org> Cc: Casey Schaufler <casey@schaufler-ca.com> Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/e338d263a76af78fe8f38a72131188b58fceb591;"The patch supports legacy (32-bit) capability userspace, and where possible translates 32-bit capabilities to/from userspace and the VFS to 64-bit kernel space capabilities";yes;yes;no 337;MDY6Q29tbWl0MjMyNTI5ODplMzM4ZDI2M2E3NmFmNzhmZThmMzhhNzIxMzExODhiNThmY2ViNTkx;Andrew Morgan;Linus Torvalds;"Add 64-bit capability support to the kernel The patch supports legacy (32-bit) capability userspace, and where possible translates 32-bit capabilities to/from userspace and the VFS to 64-bit kernel space capabilities. If a capability set cannot be compressed into 32-bits for consumption by user space, the system call fails, with -ERANGE. FWIW libcap-2.00 supports this change (and earlier capability formats) http://www.kernel.org/pub/linux/libs/security/linux-privs/kernel-2.6/ [akpm@linux-foundation.org: coding-syle fixes] [akpm@linux-foundation.org: use get_task_comm()] [ezk@cs.sunysb.edu: build fix] [akpm@linux-foundation.org: do not initialise statics to 0 or NULL] [akpm@linux-foundation.org: unused var] [serue@us.ibm.com: export __cap_ symbols] Signed-off-by: Andrew G. Morgan <morgan@kernel.org> Cc: Stephen Smalley <sds@tycho.nsa.gov> Acked-by: Serge Hallyn <serue@us.ibm.com> Cc: Chris Wright <chrisw@sous-sol.org> Cc: James Morris <jmorris@namei.org> Cc: Casey Schaufler <casey@schaufler-ca.com> Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/e338d263a76af78fe8f38a72131188b58fceb591;" If a capability set cannot be compressed into 32-bits for consumption by user space, the system call fails, with -ERANGE";yes;no;no 337;MDY6Q29tbWl0MjMyNTI5ODplMzM4ZDI2M2E3NmFmNzhmZThmMzhhNzIxMzExODhiNThmY2ViNTkx;Andrew Morgan;Linus Torvalds;"Add 64-bit capability support to the kernel The patch supports legacy (32-bit) capability userspace, and where possible translates 32-bit capabilities to/from userspace and the VFS to 64-bit kernel space capabilities. If a capability set cannot be compressed into 32-bits for consumption by user space, the system call fails, with -ERANGE. FWIW libcap-2.00 supports this change (and earlier capability formats) http://www.kernel.org/pub/linux/libs/security/linux-privs/kernel-2.6/ [akpm@linux-foundation.org: coding-syle fixes] [akpm@linux-foundation.org: use get_task_comm()] [ezk@cs.sunysb.edu: build fix] [akpm@linux-foundation.org: do not initialise statics to 0 or NULL] [akpm@linux-foundation.org: unused var] [serue@us.ibm.com: export __cap_ symbols] Signed-off-by: Andrew G. Morgan <morgan@kernel.org> Cc: Stephen Smalley <sds@tycho.nsa.gov> Acked-by: Serge Hallyn <serue@us.ibm.com> Cc: Chris Wright <chrisw@sous-sol.org> Cc: James Morris <jmorris@namei.org> Cc: Casey Schaufler <casey@schaufler-ca.com> Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/e338d263a76af78fe8f38a72131188b58fceb591;FWIW libcap-2.00 supports this change (and earlier capability formats);yes;no;yes 338;MDY6Q29tbWl0MjMyNTI5ODpmYTcxNzA2MGYxYWI3ZWI2NTcwZjJmYjQ5MTM2ZjgzOGZjOTE5NWE5;Peter Zijlstra;Ingo Molnar;"sched: sched_rt_entity Move the task_struct members specific to rt scheduling together. A future optimization could be to put sched_entity and sched_rt_entity into a union. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> CC: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>";https://api.github.com/repos/torvalds/linux/git/commits/fa717060f1ab7eb6570f2fb49136f838fc9195a9;sched: sched_rt_entity;no;no;no 338;MDY6Q29tbWl0MjMyNTI5ODpmYTcxNzA2MGYxYWI3ZWI2NTcwZjJmYjQ5MTM2ZjgzOGZjOTE5NWE5;Peter Zijlstra;Ingo Molnar;"sched: sched_rt_entity Move the task_struct members specific to rt scheduling together. A future optimization could be to put sched_entity and sched_rt_entity into a union. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> CC: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>";https://api.github.com/repos/torvalds/linux/git/commits/fa717060f1ab7eb6570f2fb49136f838fc9195a9;Move the task_struct members specific to rt scheduling together;yes;yes;no 338;MDY6Q29tbWl0MjMyNTI5ODpmYTcxNzA2MGYxYWI3ZWI2NTcwZjJmYjQ5MTM2ZjgzOGZjOTE5NWE5;Peter Zijlstra;Ingo Molnar;"sched: sched_rt_entity Move the task_struct members specific to rt scheduling together. A future optimization could be to put sched_entity and sched_rt_entity into a union. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> CC: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>";https://api.github.com/repos/torvalds/linux/git/commits/fa717060f1ab7eb6570f2fb49136f838fc9195a9;"A future optimization could be to put sched_entity and sched_rt_entity into a union.";yes;no;no 339;MDY6Q29tbWl0MjMyNTI5ODplOTFhODEwZTg4NDg1MDc4MWExY2FkYTJlYTgxYjgwMTY4ODFkMjQ0;Al Viro;Linus Torvalds;"oom_kill bug Wrong order of arguments Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/e91a810e884850781a1cada2ea81b8016881d244;oom_kill bug;no;yes;yes 339;MDY6Q29tbWl0MjMyNTI5ODplOTFhODEwZTg4NDg1MDc4MWExY2FkYTJlYTgxYjgwMTY4ODFkMjQ0;Al Viro;Linus Torvalds;"oom_kill bug Wrong order of arguments Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/e91a810e884850781a1cada2ea81b8016881d244;Wrong order of arguments;no;yes;no 340;MDY6Q29tbWl0MjMyNTI5ODpiYTI1ZjlkY2M0ZWE2ZTMwODM5ZmNhYjVhNTUxNmYyMTc2ZDViZmVk;Pavel Emelyanov;Linus Torvalds;"Use helpers to obtain task pid in printks The task_struct->pid member is going to be deprecated, so start using the helpers (task_pid_nr/task_pid_vnr/task_pid_nr_ns) in the kernel. The first thing to start with is the pid, printed to dmesg - in this case we may safely use task_pid_nr(). Besides, printks produce more (much more) than a half of all the explicit pid usage. [akpm@linux-foundation.org: git-drm went and changed lots of stuff] Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Dave Airlie <airlied@linux.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/ba25f9dcc4ea6e30839fcab5a5516f2176d5bfed;Use helpers to obtain task pid in printks;yes;yes;no 340;MDY6Q29tbWl0MjMyNTI5ODpiYTI1ZjlkY2M0ZWE2ZTMwODM5ZmNhYjVhNTUxNmYyMTc2ZDViZmVk;Pavel Emelyanov;Linus Torvalds;"Use helpers to obtain task pid in printks The task_struct->pid member is going to be deprecated, so start using the helpers (task_pid_nr/task_pid_vnr/task_pid_nr_ns) in the kernel. The first thing to start with is the pid, printed to dmesg - in this case we may safely use task_pid_nr(). Besides, printks produce more (much more) than a half of all the explicit pid usage. [akpm@linux-foundation.org: git-drm went and changed lots of stuff] Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Dave Airlie <airlied@linux.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/ba25f9dcc4ea6e30839fcab5a5516f2176d5bfed;"The task_struct->pid member is going to be deprecated, so start using the helpers (task_pid_nr/task_pid_vnr/task_pid_nr_ns) in the kernel";yes;yes;yes 340;MDY6Q29tbWl0MjMyNTI5ODpiYTI1ZjlkY2M0ZWE2ZTMwODM5ZmNhYjVhNTUxNmYyMTc2ZDViZmVk;Pavel Emelyanov;Linus Torvalds;"Use helpers to obtain task pid in printks The task_struct->pid member is going to be deprecated, so start using the helpers (task_pid_nr/task_pid_vnr/task_pid_nr_ns) in the kernel. The first thing to start with is the pid, printed to dmesg - in this case we may safely use task_pid_nr(). Besides, printks produce more (much more) than a half of all the explicit pid usage. [akpm@linux-foundation.org: git-drm went and changed lots of stuff] Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Dave Airlie <airlied@linux.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/ba25f9dcc4ea6e30839fcab5a5516f2176d5bfed;"The first thing to start with is the pid, printed to dmesg - in this case we may safely use task_pid_nr()";yes;yes;no 340;MDY6Q29tbWl0MjMyNTI5ODpiYTI1ZjlkY2M0ZWE2ZTMwODM5ZmNhYjVhNTUxNmYyMTc2ZDViZmVk;Pavel Emelyanov;Linus Torvalds;"Use helpers to obtain task pid in printks The task_struct->pid member is going to be deprecated, so start using the helpers (task_pid_nr/task_pid_vnr/task_pid_nr_ns) in the kernel. The first thing to start with is the pid, printed to dmesg - in this case we may safely use task_pid_nr(). Besides, printks produce more (much more) than a half of all the explicit pid usage. [akpm@linux-foundation.org: git-drm went and changed lots of stuff] Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Dave Airlie <airlied@linux.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/ba25f9dcc4ea6e30839fcab5a5516f2176d5bfed;"Besides, printks produce more (much more) than a half of all the explicit pid usage.";yes;yes;yes 341;MDY6Q29tbWl0MjMyNTI5ODpiYWMwYWJkNjE3NGU0Mjc0MDRkZDE5N2NkYmVmZWNlMzFlOTczMjli;Pavel Emelyanov;Linus Torvalds;"Isolate some explicit usage of task->tgid With pid namespaces this field is now dangerous to use explicitly, so hide it behind the helpers. Also the pid and pgrp fields o task_struct and signal_struct are to be deprecated. Unfortunately this patch cannot be sent right now as this leads to tons of warnings, so start isolating them, and deprecate later. Actually the p->tgid == pid has to be changed to has_group_leader_pid(), but Oleg pointed out that in case of posix cpu timers this is the same, and thread_group_leader() is more preferable. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Acked-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: ""Eric W. Biederman"" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/bac0abd6174e427404dd197cdbefece31e97329b;Isolate some explicit usage of task->tgid;yes;no;no 341;MDY6Q29tbWl0MjMyNTI5ODpiYWMwYWJkNjE3NGU0Mjc0MDRkZDE5N2NkYmVmZWNlMzFlOTczMjli;Pavel Emelyanov;Linus Torvalds;"Isolate some explicit usage of task->tgid With pid namespaces this field is now dangerous to use explicitly, so hide it behind the helpers. Also the pid and pgrp fields o task_struct and signal_struct are to be deprecated. Unfortunately this patch cannot be sent right now as this leads to tons of warnings, so start isolating them, and deprecate later. Actually the p->tgid == pid has to be changed to has_group_leader_pid(), but Oleg pointed out that in case of posix cpu timers this is the same, and thread_group_leader() is more preferable. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Acked-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: ""Eric W. Biederman"" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/bac0abd6174e427404dd197cdbefece31e97329b;"With pid namespaces this field is now dangerous to use explicitly, so hide it behind the helpers";yes;yes;yes 341;MDY6Q29tbWl0MjMyNTI5ODpiYWMwYWJkNjE3NGU0Mjc0MDRkZDE5N2NkYmVmZWNlMzFlOTczMjli;Pavel Emelyanov;Linus Torvalds;"Isolate some explicit usage of task->tgid With pid namespaces this field is now dangerous to use explicitly, so hide it behind the helpers. Also the pid and pgrp fields o task_struct and signal_struct are to be deprecated. Unfortunately this patch cannot be sent right now as this leads to tons of warnings, so start isolating them, and deprecate later. Actually the p->tgid == pid has to be changed to has_group_leader_pid(), but Oleg pointed out that in case of posix cpu timers this is the same, and thread_group_leader() is more preferable. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Acked-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: ""Eric W. Biederman"" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/bac0abd6174e427404dd197cdbefece31e97329b;"Also the pid and pgrp fields o task_struct and signal_struct are to be deprecated";yes;no;yes 341;MDY6Q29tbWl0MjMyNTI5ODpiYWMwYWJkNjE3NGU0Mjc0MDRkZDE5N2NkYmVmZWNlMzFlOTczMjli;Pavel Emelyanov;Linus Torvalds;"Isolate some explicit usage of task->tgid With pid namespaces this field is now dangerous to use explicitly, so hide it behind the helpers. Also the pid and pgrp fields o task_struct and signal_struct are to be deprecated. Unfortunately this patch cannot be sent right now as this leads to tons of warnings, so start isolating them, and deprecate later. Actually the p->tgid == pid has to be changed to has_group_leader_pid(), but Oleg pointed out that in case of posix cpu timers this is the same, and thread_group_leader() is more preferable. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Acked-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: ""Eric W. Biederman"" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/bac0abd6174e427404dd197cdbefece31e97329b;" Unfortunately this patch cannot be sent right now as this leads to tons of warnings, so start isolating them, and deprecate later";yes;yes;yes 341;MDY6Q29tbWl0MjMyNTI5ODpiYWMwYWJkNjE3NGU0Mjc0MDRkZDE5N2NkYmVmZWNlMzFlOTczMjli;Pavel Emelyanov;Linus Torvalds;"Isolate some explicit usage of task->tgid With pid namespaces this field is now dangerous to use explicitly, so hide it behind the helpers. Also the pid and pgrp fields o task_struct and signal_struct are to be deprecated. Unfortunately this patch cannot be sent right now as this leads to tons of warnings, so start isolating them, and deprecate later. Actually the p->tgid == pid has to be changed to has_group_leader_pid(), but Oleg pointed out that in case of posix cpu timers this is the same, and thread_group_leader() is more preferable. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Acked-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: ""Eric W. Biederman"" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/bac0abd6174e427404dd197cdbefece31e97329b;"Actually the p->tgid == pid has to be changed to has_group_leader_pid(), but Oleg pointed out that in case of posix cpu timers this is the same, and thread_group_leader() is more preferable.";yes;yes;yes 342;MDY6Q29tbWl0MjMyNTI5ODo3YjE5MTVhOTg5ZWE0ZDQyNmQwZmQ5ODk3NGFiODBmMzBlZjFkNzc5;Matthias Kaehlcke;Linus Torvalds;"mm/oom_kill.c: Use list_for_each_entry instead of list_for_each mm/oom_kill.c: Convert list_for_each to list_for_each_entry in oom_kill_process() Signed-off-by: Matthias Kaehlcke <matthias.kaehlcke@gmail.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/7b1915a989ea4d426d0fd98974ab80f30ef1d779;mm/oom_kill.c: Use list_for_each_entry instead of list_for_each;yes;no;no 342;MDY6Q29tbWl0MjMyNTI5ODo3YjE5MTVhOTg5ZWE0ZDQyNmQwZmQ5ODk3NGFiODBmMzBlZjFkNzc5;Matthias Kaehlcke;Linus Torvalds;"mm/oom_kill.c: Use list_for_each_entry instead of list_for_each mm/oom_kill.c: Convert list_for_each to list_for_each_entry in oom_kill_process() Signed-off-by: Matthias Kaehlcke <matthias.kaehlcke@gmail.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/7b1915a989ea4d426d0fd98974ab80f30ef1d779;"mm/oom_kill.c: Convert list_for_each to list_for_each_entry in oom_kill_process()";yes;no;no 343;MDY6Q29tbWl0MjMyNTI5ODpiNDYwY2JjNTgxYTUzY2MwODhjZWJhODA2MDgwMjFkZDQ5YzYzYzQz;Serge E. Hallyn;Linus Torvalds;"pid namespaces: define is_global_init() and is_container_init() is_init() is an ambiguous name for the pid==1 check. Split it into is_global_init() and is_container_init(). A cgroup init has it's tsk->pid == 1. A global init also has it's tsk->pid == 1 and it's active pid namespace is the init_pid_ns. But rather than check the active pid namespace, compare the task structure with 'init_pid_ns.child_reaper', which is initialized during boot to the /sbin/init process and never changes. Changelog: 2.6.22-rc4-mm2-pidns1: - Use 'init_pid_ns.child_reaper' to determine if a given task is the global init (/sbin/init) process. This would improve performance and remove dependence on the task_pid(). 2.6.21-mm2-pidns2: - [Sukadev Bhattiprolu] Changed is_container_init() calls in {powerpc, ppc,avr32}/traps.c for the _exception() call to is_global_init(). This way, we kill only the cgroup if the cgroup's init has a bug rather than force a kernel panic. [akpm@linux-foundation.org: fix comment] [sukadev@us.ibm.com: Use is_global_init() in arch/m32r/mm/fault.c] [bunk@stusta.de: kernel/pid.c: remove unused exports] [sukadev@us.ibm.com: Fix capability.c to work with threaded init] Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Acked-by: Pavel Emelianov <xemul@openvz.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Herbert Poetzel <herbert@13thfloor.at> Cc: Kirill Korotaev <dev@sw.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/b460cbc581a53cc088ceba80608021dd49c63c43;pid namespaces: define is_global_init() and is_container_init();yes;no;no 343;MDY6Q29tbWl0MjMyNTI5ODpiNDYwY2JjNTgxYTUzY2MwODhjZWJhODA2MDgwMjFkZDQ5YzYzYzQz;Serge E. Hallyn;Linus Torvalds;"pid namespaces: define is_global_init() and is_container_init() is_init() is an ambiguous name for the pid==1 check. Split it into is_global_init() and is_container_init(). A cgroup init has it's tsk->pid == 1. A global init also has it's tsk->pid == 1 and it's active pid namespace is the init_pid_ns. But rather than check the active pid namespace, compare the task structure with 'init_pid_ns.child_reaper', which is initialized during boot to the /sbin/init process and never changes. Changelog: 2.6.22-rc4-mm2-pidns1: - Use 'init_pid_ns.child_reaper' to determine if a given task is the global init (/sbin/init) process. This would improve performance and remove dependence on the task_pid(). 2.6.21-mm2-pidns2: - [Sukadev Bhattiprolu] Changed is_container_init() calls in {powerpc, ppc,avr32}/traps.c for the _exception() call to is_global_init(). This way, we kill only the cgroup if the cgroup's init has a bug rather than force a kernel panic. [akpm@linux-foundation.org: fix comment] [sukadev@us.ibm.com: Use is_global_init() in arch/m32r/mm/fault.c] [bunk@stusta.de: kernel/pid.c: remove unused exports] [sukadev@us.ibm.com: Fix capability.c to work with threaded init] Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Acked-by: Pavel Emelianov <xemul@openvz.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Herbert Poetzel <herbert@13thfloor.at> Cc: Kirill Korotaev <dev@sw.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/b460cbc581a53cc088ceba80608021dd49c63c43;is_init() is an ambiguous name for the pid==1 check;no;yes;yes 343;MDY6Q29tbWl0MjMyNTI5ODpiNDYwY2JjNTgxYTUzY2MwODhjZWJhODA2MDgwMjFkZDQ5YzYzYzQz;Serge E. Hallyn;Linus Torvalds;"pid namespaces: define is_global_init() and is_container_init() is_init() is an ambiguous name for the pid==1 check. Split it into is_global_init() and is_container_init(). A cgroup init has it's tsk->pid == 1. A global init also has it's tsk->pid == 1 and it's active pid namespace is the init_pid_ns. But rather than check the active pid namespace, compare the task structure with 'init_pid_ns.child_reaper', which is initialized during boot to the /sbin/init process and never changes. Changelog: 2.6.22-rc4-mm2-pidns1: - Use 'init_pid_ns.child_reaper' to determine if a given task is the global init (/sbin/init) process. This would improve performance and remove dependence on the task_pid(). 2.6.21-mm2-pidns2: - [Sukadev Bhattiprolu] Changed is_container_init() calls in {powerpc, ppc,avr32}/traps.c for the _exception() call to is_global_init(). This way, we kill only the cgroup if the cgroup's init has a bug rather than force a kernel panic. [akpm@linux-foundation.org: fix comment] [sukadev@us.ibm.com: Use is_global_init() in arch/m32r/mm/fault.c] [bunk@stusta.de: kernel/pid.c: remove unused exports] [sukadev@us.ibm.com: Fix capability.c to work with threaded init] Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Acked-by: Pavel Emelianov <xemul@openvz.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Herbert Poetzel <herbert@13thfloor.at> Cc: Kirill Korotaev <dev@sw.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/b460cbc581a53cc088ceba80608021dd49c63c43;" Split it into is_global_init() and is_container_init()";yes;no;no 343;MDY6Q29tbWl0MjMyNTI5ODpiNDYwY2JjNTgxYTUzY2MwODhjZWJhODA2MDgwMjFkZDQ5YzYzYzQz;Serge E. Hallyn;Linus Torvalds;"pid namespaces: define is_global_init() and is_container_init() is_init() is an ambiguous name for the pid==1 check. Split it into is_global_init() and is_container_init(). A cgroup init has it's tsk->pid == 1. A global init also has it's tsk->pid == 1 and it's active pid namespace is the init_pid_ns. But rather than check the active pid namespace, compare the task structure with 'init_pid_ns.child_reaper', which is initialized during boot to the /sbin/init process and never changes. Changelog: 2.6.22-rc4-mm2-pidns1: - Use 'init_pid_ns.child_reaper' to determine if a given task is the global init (/sbin/init) process. This would improve performance and remove dependence on the task_pid(). 2.6.21-mm2-pidns2: - [Sukadev Bhattiprolu] Changed is_container_init() calls in {powerpc, ppc,avr32}/traps.c for the _exception() call to is_global_init(). This way, we kill only the cgroup if the cgroup's init has a bug rather than force a kernel panic. [akpm@linux-foundation.org: fix comment] [sukadev@us.ibm.com: Use is_global_init() in arch/m32r/mm/fault.c] [bunk@stusta.de: kernel/pid.c: remove unused exports] [sukadev@us.ibm.com: Fix capability.c to work with threaded init] Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Acked-by: Pavel Emelianov <xemul@openvz.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Herbert Poetzel <herbert@13thfloor.at> Cc: Kirill Korotaev <dev@sw.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/b460cbc581a53cc088ceba80608021dd49c63c43;A cgroup init has it's tsk->pid == 1;yes;no;yes 343;MDY6Q29tbWl0MjMyNTI5ODpiNDYwY2JjNTgxYTUzY2MwODhjZWJhODA2MDgwMjFkZDQ5YzYzYzQz;Serge E. Hallyn;Linus Torvalds;"pid namespaces: define is_global_init() and is_container_init() is_init() is an ambiguous name for the pid==1 check. Split it into is_global_init() and is_container_init(). A cgroup init has it's tsk->pid == 1. A global init also has it's tsk->pid == 1 and it's active pid namespace is the init_pid_ns. But rather than check the active pid namespace, compare the task structure with 'init_pid_ns.child_reaper', which is initialized during boot to the /sbin/init process and never changes. Changelog: 2.6.22-rc4-mm2-pidns1: - Use 'init_pid_ns.child_reaper' to determine if a given task is the global init (/sbin/init) process. This would improve performance and remove dependence on the task_pid(). 2.6.21-mm2-pidns2: - [Sukadev Bhattiprolu] Changed is_container_init() calls in {powerpc, ppc,avr32}/traps.c for the _exception() call to is_global_init(). This way, we kill only the cgroup if the cgroup's init has a bug rather than force a kernel panic. [akpm@linux-foundation.org: fix comment] [sukadev@us.ibm.com: Use is_global_init() in arch/m32r/mm/fault.c] [bunk@stusta.de: kernel/pid.c: remove unused exports] [sukadev@us.ibm.com: Fix capability.c to work with threaded init] Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Acked-by: Pavel Emelianov <xemul@openvz.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Herbert Poetzel <herbert@13thfloor.at> Cc: Kirill Korotaev <dev@sw.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/b460cbc581a53cc088ceba80608021dd49c63c43;"A global init also has it's tsk->pid == 1 and it's active pid namespace is the init_pid_ns";yes;no;yes 343;MDY6Q29tbWl0MjMyNTI5ODpiNDYwY2JjNTgxYTUzY2MwODhjZWJhODA2MDgwMjFkZDQ5YzYzYzQz;Serge E. Hallyn;Linus Torvalds;"pid namespaces: define is_global_init() and is_container_init() is_init() is an ambiguous name for the pid==1 check. Split it into is_global_init() and is_container_init(). A cgroup init has it's tsk->pid == 1. A global init also has it's tsk->pid == 1 and it's active pid namespace is the init_pid_ns. But rather than check the active pid namespace, compare the task structure with 'init_pid_ns.child_reaper', which is initialized during boot to the /sbin/init process and never changes. Changelog: 2.6.22-rc4-mm2-pidns1: - Use 'init_pid_ns.child_reaper' to determine if a given task is the global init (/sbin/init) process. This would improve performance and remove dependence on the task_pid(). 2.6.21-mm2-pidns2: - [Sukadev Bhattiprolu] Changed is_container_init() calls in {powerpc, ppc,avr32}/traps.c for the _exception() call to is_global_init(). This way, we kill only the cgroup if the cgroup's init has a bug rather than force a kernel panic. [akpm@linux-foundation.org: fix comment] [sukadev@us.ibm.com: Use is_global_init() in arch/m32r/mm/fault.c] [bunk@stusta.de: kernel/pid.c: remove unused exports] [sukadev@us.ibm.com: Fix capability.c to work with threaded init] Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Acked-by: Pavel Emelianov <xemul@openvz.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Herbert Poetzel <herbert@13thfloor.at> Cc: Kirill Korotaev <dev@sw.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/b460cbc581a53cc088ceba80608021dd49c63c43;" But rather than check the active pid namespace, compare the task structure with 'init_pid_ns.child_reaper', which is initialized during boot to the /sbin/init process and never changes";yes;no;yes 343;MDY6Q29tbWl0MjMyNTI5ODpiNDYwY2JjNTgxYTUzY2MwODhjZWJhODA2MDgwMjFkZDQ5YzYzYzQz;Serge E. Hallyn;Linus Torvalds;"pid namespaces: define is_global_init() and is_container_init() is_init() is an ambiguous name for the pid==1 check. Split it into is_global_init() and is_container_init(). A cgroup init has it's tsk->pid == 1. A global init also has it's tsk->pid == 1 and it's active pid namespace is the init_pid_ns. But rather than check the active pid namespace, compare the task structure with 'init_pid_ns.child_reaper', which is initialized during boot to the /sbin/init process and never changes. Changelog: 2.6.22-rc4-mm2-pidns1: - Use 'init_pid_ns.child_reaper' to determine if a given task is the global init (/sbin/init) process. This would improve performance and remove dependence on the task_pid(). 2.6.21-mm2-pidns2: - [Sukadev Bhattiprolu] Changed is_container_init() calls in {powerpc, ppc,avr32}/traps.c for the _exception() call to is_global_init(). This way, we kill only the cgroup if the cgroup's init has a bug rather than force a kernel panic. [akpm@linux-foundation.org: fix comment] [sukadev@us.ibm.com: Use is_global_init() in arch/m32r/mm/fault.c] [bunk@stusta.de: kernel/pid.c: remove unused exports] [sukadev@us.ibm.com: Fix capability.c to work with threaded init] Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Acked-by: Pavel Emelianov <xemul@openvz.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Herbert Poetzel <herbert@13thfloor.at> Cc: Kirill Korotaev <dev@sw.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/b460cbc581a53cc088ceba80608021dd49c63c43;Changelog;no;no;no 343;MDY6Q29tbWl0MjMyNTI5ODpiNDYwY2JjNTgxYTUzY2MwODhjZWJhODA2MDgwMjFkZDQ5YzYzYzQz;Serge E. Hallyn;Linus Torvalds;"pid namespaces: define is_global_init() and is_container_init() is_init() is an ambiguous name for the pid==1 check. Split it into is_global_init() and is_container_init(). A cgroup init has it's tsk->pid == 1. A global init also has it's tsk->pid == 1 and it's active pid namespace is the init_pid_ns. But rather than check the active pid namespace, compare the task structure with 'init_pid_ns.child_reaper', which is initialized during boot to the /sbin/init process and never changes. Changelog: 2.6.22-rc4-mm2-pidns1: - Use 'init_pid_ns.child_reaper' to determine if a given task is the global init (/sbin/init) process. This would improve performance and remove dependence on the task_pid(). 2.6.21-mm2-pidns2: - [Sukadev Bhattiprolu] Changed is_container_init() calls in {powerpc, ppc,avr32}/traps.c for the _exception() call to is_global_init(). This way, we kill only the cgroup if the cgroup's init has a bug rather than force a kernel panic. [akpm@linux-foundation.org: fix comment] [sukadev@us.ibm.com: Use is_global_init() in arch/m32r/mm/fault.c] [bunk@stusta.de: kernel/pid.c: remove unused exports] [sukadev@us.ibm.com: Fix capability.c to work with threaded init] Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Acked-by: Pavel Emelianov <xemul@openvz.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Herbert Poetzel <herbert@13thfloor.at> Cc: Kirill Korotaev <dev@sw.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/b460cbc581a53cc088ceba80608021dd49c63c43;" - Use 'init_pid_ns.child_reaper' to determine if a given task is the global init (/sbin/init) process";yes;no;no 343;MDY6Q29tbWl0MjMyNTI5ODpiNDYwY2JjNTgxYTUzY2MwODhjZWJhODA2MDgwMjFkZDQ5YzYzYzQz;Serge E. Hallyn;Linus Torvalds;"pid namespaces: define is_global_init() and is_container_init() is_init() is an ambiguous name for the pid==1 check. Split it into is_global_init() and is_container_init(). A cgroup init has it's tsk->pid == 1. A global init also has it's tsk->pid == 1 and it's active pid namespace is the init_pid_ns. But rather than check the active pid namespace, compare the task structure with 'init_pid_ns.child_reaper', which is initialized during boot to the /sbin/init process and never changes. Changelog: 2.6.22-rc4-mm2-pidns1: - Use 'init_pid_ns.child_reaper' to determine if a given task is the global init (/sbin/init) process. This would improve performance and remove dependence on the task_pid(). 2.6.21-mm2-pidns2: - [Sukadev Bhattiprolu] Changed is_container_init() calls in {powerpc, ppc,avr32}/traps.c for the _exception() call to is_global_init(). This way, we kill only the cgroup if the cgroup's init has a bug rather than force a kernel panic. [akpm@linux-foundation.org: fix comment] [sukadev@us.ibm.com: Use is_global_init() in arch/m32r/mm/fault.c] [bunk@stusta.de: kernel/pid.c: remove unused exports] [sukadev@us.ibm.com: Fix capability.c to work with threaded init] Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Acked-by: Pavel Emelianov <xemul@openvz.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Herbert Poetzel <herbert@13thfloor.at> Cc: Kirill Korotaev <dev@sw.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/b460cbc581a53cc088ceba80608021dd49c63c43;"This would improve performance and remove dependence on the task_pid()";no;yes;no 343;MDY6Q29tbWl0MjMyNTI5ODpiNDYwY2JjNTgxYTUzY2MwODhjZWJhODA2MDgwMjFkZDQ5YzYzYzQz;Serge E. Hallyn;Linus Torvalds;"pid namespaces: define is_global_init() and is_container_init() is_init() is an ambiguous name for the pid==1 check. Split it into is_global_init() and is_container_init(). A cgroup init has it's tsk->pid == 1. A global init also has it's tsk->pid == 1 and it's active pid namespace is the init_pid_ns. But rather than check the active pid namespace, compare the task structure with 'init_pid_ns.child_reaper', which is initialized during boot to the /sbin/init process and never changes. Changelog: 2.6.22-rc4-mm2-pidns1: - Use 'init_pid_ns.child_reaper' to determine if a given task is the global init (/sbin/init) process. This would improve performance and remove dependence on the task_pid(). 2.6.21-mm2-pidns2: - [Sukadev Bhattiprolu] Changed is_container_init() calls in {powerpc, ppc,avr32}/traps.c for the _exception() call to is_global_init(). This way, we kill only the cgroup if the cgroup's init has a bug rather than force a kernel panic. [akpm@linux-foundation.org: fix comment] [sukadev@us.ibm.com: Use is_global_init() in arch/m32r/mm/fault.c] [bunk@stusta.de: kernel/pid.c: remove unused exports] [sukadev@us.ibm.com: Fix capability.c to work with threaded init] Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Acked-by: Pavel Emelianov <xemul@openvz.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Herbert Poetzel <herbert@13thfloor.at> Cc: Kirill Korotaev <dev@sw.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/b460cbc581a53cc088ceba80608021dd49c63c43; ppc,avr32}/traps.c for the _exception() call to is_global_init();yes;no;yes 343;MDY6Q29tbWl0MjMyNTI5ODpiNDYwY2JjNTgxYTUzY2MwODhjZWJhODA2MDgwMjFkZDQ5YzYzYzQz;Serge E. Hallyn;Linus Torvalds;"pid namespaces: define is_global_init() and is_container_init() is_init() is an ambiguous name for the pid==1 check. Split it into is_global_init() and is_container_init(). A cgroup init has it's tsk->pid == 1. A global init also has it's tsk->pid == 1 and it's active pid namespace is the init_pid_ns. But rather than check the active pid namespace, compare the task structure with 'init_pid_ns.child_reaper', which is initialized during boot to the /sbin/init process and never changes. Changelog: 2.6.22-rc4-mm2-pidns1: - Use 'init_pid_ns.child_reaper' to determine if a given task is the global init (/sbin/init) process. This would improve performance and remove dependence on the task_pid(). 2.6.21-mm2-pidns2: - [Sukadev Bhattiprolu] Changed is_container_init() calls in {powerpc, ppc,avr32}/traps.c for the _exception() call to is_global_init(). This way, we kill only the cgroup if the cgroup's init has a bug rather than force a kernel panic. [akpm@linux-foundation.org: fix comment] [sukadev@us.ibm.com: Use is_global_init() in arch/m32r/mm/fault.c] [bunk@stusta.de: kernel/pid.c: remove unused exports] [sukadev@us.ibm.com: Fix capability.c to work with threaded init] Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Acked-by: Pavel Emelianov <xemul@openvz.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Herbert Poetzel <herbert@13thfloor.at> Cc: Kirill Korotaev <dev@sw.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/b460cbc581a53cc088ceba80608021dd49c63c43;" This way, we kill only the cgroup if the cgroup's init has a bug rather than force a kernel panic.";yes;yes;no 344;MDY6Q29tbWl0MjMyNTI5ODphZTc0MTM4ZGE2MDljNTc2YjIyMWM3NjVlZmE4YjgxYjIzNjVmNDY1;David Rientjes;Linus Torvalds;"oom: convert zone_scan_lock from mutex to spinlock There's no reason to sleep in try_set_zone_oom() or clear_zonelist_oom() if the lock can't be acquired; it will be available soon enough once the zonelist scanning is done. All other threads waiting for the OOM killer are also contingent on the exiting task being able to acquire the lock in clear_zonelist_oom() so it doesn't make sense to put it to sleep. Cc: Andrea Arcangeli <andrea@suse.de> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/ae74138da609c576b221c765efa8b81b2365f465;oom: convert zone_scan_lock from mutex to spinlock;yes;no;no 344;MDY6Q29tbWl0MjMyNTI5ODphZTc0MTM4ZGE2MDljNTc2YjIyMWM3NjVlZmE4YjgxYjIzNjVmNDY1;David Rientjes;Linus Torvalds;"oom: convert zone_scan_lock from mutex to spinlock There's no reason to sleep in try_set_zone_oom() or clear_zonelist_oom() if the lock can't be acquired; it will be available soon enough once the zonelist scanning is done. All other threads waiting for the OOM killer are also contingent on the exiting task being able to acquire the lock in clear_zonelist_oom() so it doesn't make sense to put it to sleep. Cc: Andrea Arcangeli <andrea@suse.de> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/ae74138da609c576b221c765efa8b81b2365f465;"There's no reason to sleep in try_set_zone_oom() or clear_zonelist_oom() if the lock can't be acquired; it will be available soon enough once the zonelist scanning is done";yes;yes;yes 344;MDY6Q29tbWl0MjMyNTI5ODphZTc0MTM4ZGE2MDljNTc2YjIyMWM3NjVlZmE4YjgxYjIzNjVmNDY1;David Rientjes;Linus Torvalds;"oom: convert zone_scan_lock from mutex to spinlock There's no reason to sleep in try_set_zone_oom() or clear_zonelist_oom() if the lock can't be acquired; it will be available soon enough once the zonelist scanning is done. All other threads waiting for the OOM killer are also contingent on the exiting task being able to acquire the lock in clear_zonelist_oom() so it doesn't make sense to put it to sleep. Cc: Andrea Arcangeli <andrea@suse.de> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/ae74138da609c576b221c765efa8b81b2365f465;" All other threads waiting for the OOM killer are also contingent on the exiting task being able to acquire the lock in clear_zonelist_oom() so it doesn't make sense to put it to sleep.";no;yes;yes 345;MDY6Q29tbWl0MjMyNTI5ODozZmY1NjY5NjNjZTgwNDgwOWFmOWUzMjMzMWIyODdlZWRlZWZmNTAx;David Rientjes;Linus Torvalds;"oom: do not take callback_mutex Since no task descriptor's 'cpuset' field is dereferenced in the execution of the OOM killer anymore, it is no longer necessary to take callback_mutex. [akpm@linux-foundation.org: restore cpuset_lock for other patches] Cc: Andrea Arcangeli <andrea@suse.de> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3ff566963ce804809af9e32331b287eedeeff501;oom: do not take callback_mutex;yes;no;no 345;MDY6Q29tbWl0MjMyNTI5ODozZmY1NjY5NjNjZTgwNDgwOWFmOWUzMjMzMWIyODdlZWRlZWZmNTAx;David Rientjes;Linus Torvalds;"oom: do not take callback_mutex Since no task descriptor's 'cpuset' field is dereferenced in the execution of the OOM killer anymore, it is no longer necessary to take callback_mutex. [akpm@linux-foundation.org: restore cpuset_lock for other patches] Cc: Andrea Arcangeli <andrea@suse.de> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3ff566963ce804809af9e32331b287eedeeff501;"Since no task descriptor's 'cpuset' field is dereferenced in the execution of the OOM killer anymore, it is no longer necessary to take callback_mutex.";no;yes;yes 346;MDY6Q29tbWl0MjMyNTI5ODpiYmUzNzNmMmM2MGIyYWEzNmMzMjMxNzM0YTVhZmM1MjcxYTA2NzE4;David Rientjes;Linus Torvalds;"oom: compare cpuset mems_allowed instead of exclusive ancestors Instead of testing for overlap in the memory nodes of the the nearest exclusive ancestor of both current and the candidate task, it is better to simply test for intersection between the task's mems_allowed in their task descriptors. This does not require taking callback_mutex since it is only used as a hint in the badness scoring. Tasks that do not have an intersection in their mems_allowed with the current task are not explicitly restricted from being OOM killed because it is quite possible that the candidate task has allocated memory there before and has since changed its mems_allowed. Cc: Andrea Arcangeli <andrea@suse.de> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/bbe373f2c60b2aa36c3231734a5afc5271a06718;oom: compare cpuset mems_allowed instead of exclusive ancestors;yes;no;no 346;MDY6Q29tbWl0MjMyNTI5ODpiYmUzNzNmMmM2MGIyYWEzNmMzMjMxNzM0YTVhZmM1MjcxYTA2NzE4;David Rientjes;Linus Torvalds;"oom: compare cpuset mems_allowed instead of exclusive ancestors Instead of testing for overlap in the memory nodes of the the nearest exclusive ancestor of both current and the candidate task, it is better to simply test for intersection between the task's mems_allowed in their task descriptors. This does not require taking callback_mutex since it is only used as a hint in the badness scoring. Tasks that do not have an intersection in their mems_allowed with the current task are not explicitly restricted from being OOM killed because it is quite possible that the candidate task has allocated memory there before and has since changed its mems_allowed. Cc: Andrea Arcangeli <andrea@suse.de> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/bbe373f2c60b2aa36c3231734a5afc5271a06718;"Instead of testing for overlap in the memory nodes of the the nearest exclusive ancestor of both current and the candidate task, it is better to simply test for intersection between the task's mems_allowed in their task descriptors";yes;yes;no 346;MDY6Q29tbWl0MjMyNTI5ODpiYmUzNzNmMmM2MGIyYWEzNmMzMjMxNzM0YTVhZmM1MjcxYTA2NzE4;David Rientjes;Linus Torvalds;"oom: compare cpuset mems_allowed instead of exclusive ancestors Instead of testing for overlap in the memory nodes of the the nearest exclusive ancestor of both current and the candidate task, it is better to simply test for intersection between the task's mems_allowed in their task descriptors. This does not require taking callback_mutex since it is only used as a hint in the badness scoring. Tasks that do not have an intersection in their mems_allowed with the current task are not explicitly restricted from being OOM killed because it is quite possible that the candidate task has allocated memory there before and has since changed its mems_allowed. Cc: Andrea Arcangeli <andrea@suse.de> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/bbe373f2c60b2aa36c3231734a5afc5271a06718;" This does not require taking callback_mutex since it is only used as a hint in the badness scoring";yes;yes;yes 346;MDY6Q29tbWl0MjMyNTI5ODpiYmUzNzNmMmM2MGIyYWEzNmMzMjMxNzM0YTVhZmM1MjcxYTA2NzE4;David Rientjes;Linus Torvalds;"oom: compare cpuset mems_allowed instead of exclusive ancestors Instead of testing for overlap in the memory nodes of the the nearest exclusive ancestor of both current and the candidate task, it is better to simply test for intersection between the task's mems_allowed in their task descriptors. This does not require taking callback_mutex since it is only used as a hint in the badness scoring. Tasks that do not have an intersection in their mems_allowed with the current task are not explicitly restricted from being OOM killed because it is quite possible that the candidate task has allocated memory there before and has since changed its mems_allowed. Cc: Andrea Arcangeli <andrea@suse.de> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/bbe373f2c60b2aa36c3231734a5afc5271a06718;"Tasks that do not have an intersection in their mems_allowed with the current task are not explicitly restricted from being OOM killed because it is quite possible that the candidate task has allocated memory there before and has since changed its mems_allowed.";yes;yes;yes 347;MDY6Q29tbWl0MjMyNTI5ODo3MjEzZjUwNjZmYzhhMTdjNzgzODlmZTI0NWRlNTIyYjVjZjA2NDhh;David Rientjes;Linus Torvalds;"oom: suppress extraneous stack and memory dump Suppresses the extraneous stack and memory dump when a parallel OOM killing has been found. There's no need to fill the ring buffer with this information if its already been printed and the condition that triggered the previous OOM killer has not yet been alleviated. Cc: Andrea Arcangeli <andrea@suse.de> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/7213f5066fc8a17c78389fe245de522b5cf0648a;oom: suppress extraneous stack and memory dump;yes;yes;no 347;MDY6Q29tbWl0MjMyNTI5ODo3MjEzZjUwNjZmYzhhMTdjNzgzODlmZTI0NWRlNTIyYjVjZjA2NDhh;David Rientjes;Linus Torvalds;"oom: suppress extraneous stack and memory dump Suppresses the extraneous stack and memory dump when a parallel OOM killing has been found. There's no need to fill the ring buffer with this information if its already been printed and the condition that triggered the previous OOM killer has not yet been alleviated. Cc: Andrea Arcangeli <andrea@suse.de> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/7213f5066fc8a17c78389fe245de522b5cf0648a;"Suppresses the extraneous stack and memory dump when a parallel OOM killing has been found";yes;yes;yes 347;MDY6Q29tbWl0MjMyNTI5ODo3MjEzZjUwNjZmYzhhMTdjNzgzODlmZTI0NWRlNTIyYjVjZjA2NDhh;David Rientjes;Linus Torvalds;"oom: suppress extraneous stack and memory dump Suppresses the extraneous stack and memory dump when a parallel OOM killing has been found. There's no need to fill the ring buffer with this information if its already been printed and the condition that triggered the previous OOM killer has not yet been alleviated. Cc: Andrea Arcangeli <andrea@suse.de> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/7213f5066fc8a17c78389fe245de522b5cf0648a;" There's no need to fill the ring buffer with this information if its already been printed and the condition that triggered the previous OOM killer has not yet been alleviated.";yes;yes;yes 348;MDY6Q29tbWl0MjMyNTI5ODpmZTA3MWQ3ZThhYWU1NzQ1YzAwOWM4MDhiYjg5MzNmMjJhOWUzMDVh;David Rientjes;Linus Torvalds;"oom: add oom_kill_allocating_task sysctl Adds a new sysctl, 'oom_kill_allocating_task', which will automatically kill the OOM-triggering task instead of scanning through the tasklist to find a memory-hogging target. This is helpful for systems with an insanely large number of tasks where scanning the tasklist significantly degrades performance. Cc: Andrea Arcangeli <andrea@suse.de> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/fe071d7e8aae5745c009c808bb8933f22a9e305a;oom: add oom_kill_allocating_task sysctl;yes;no;no 348;MDY6Q29tbWl0MjMyNTI5ODpmZTA3MWQ3ZThhYWU1NzQ1YzAwOWM4MDhiYjg5MzNmMjJhOWUzMDVh;David Rientjes;Linus Torvalds;"oom: add oom_kill_allocating_task sysctl Adds a new sysctl, 'oom_kill_allocating_task', which will automatically kill the OOM-triggering task instead of scanning through the tasklist to find a memory-hogging target. This is helpful for systems with an insanely large number of tasks where scanning the tasklist significantly degrades performance. Cc: Andrea Arcangeli <andrea@suse.de> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/fe071d7e8aae5745c009c808bb8933f22a9e305a;"Adds a new sysctl, 'oom_kill_allocating_task', which will automatically kill the OOM-triggering task instead of scanning through the tasklist to find a memory-hogging target";yes;yes;yes 348;MDY6Q29tbWl0MjMyNTI5ODpmZTA3MWQ3ZThhYWU1NzQ1YzAwOWM4MDhiYjg5MzNmMjJhOWUzMDVh;David Rientjes;Linus Torvalds;"oom: add oom_kill_allocating_task sysctl Adds a new sysctl, 'oom_kill_allocating_task', which will automatically kill the OOM-triggering task instead of scanning through the tasklist to find a memory-hogging target. This is helpful for systems with an insanely large number of tasks where scanning the tasklist significantly degrades performance. Cc: Andrea Arcangeli <andrea@suse.de> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/fe071d7e8aae5745c009c808bb8933f22a9e305a;" This is helpful for systems with an insanely large number of tasks where scanning the tasklist significantly degrades performance.";no;yes;yes 349;MDY6Q29tbWl0MjMyNTI5ODowOThkN2YxMjhhNGU1M2NiNjQ5MzA2Mjg5MTVhYzc2Nzc4NWUwZTYw;David Rientjes;Linus Torvalds;"oom: add per-zone locking OOM killer synchronization should be done with zone granularity so that memory policy and cpuset allocations may have their corresponding zones locked and allow parallel kills for other OOM conditions that may exist elsewhere in the system. DMA allocations can be targeted at the zone level, which would not be possible if locking was done in nodes or globally. Synchronization shall be done with a variation of ""trylocks."" The goal is to put the current task to sleep and restart the failed allocation attempt later if the trylock fails. Otherwise, the OOM killer is invoked. Each zone in the zonelist that __alloc_pages() was called with is checked for the newly-introduced ZONE_OOM_LOCKED flag. If any zone has this flag present, the ""trylock"" to serialize the OOM killer fails and returns zero. Otherwise, all the zones have ZONE_OOM_LOCKED set and the try_set_zone_oom() function returns non-zero. Cc: Andrea Arcangeli <andrea@suse.de> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/098d7f128a4e53cb64930628915ac767785e0e60;oom: add per-zone locking;yes;no;no 349;MDY6Q29tbWl0MjMyNTI5ODowOThkN2YxMjhhNGU1M2NiNjQ5MzA2Mjg5MTVhYzc2Nzc4NWUwZTYw;David Rientjes;Linus Torvalds;"oom: add per-zone locking OOM killer synchronization should be done with zone granularity so that memory policy and cpuset allocations may have their corresponding zones locked and allow parallel kills for other OOM conditions that may exist elsewhere in the system. DMA allocations can be targeted at the zone level, which would not be possible if locking was done in nodes or globally. Synchronization shall be done with a variation of ""trylocks."" The goal is to put the current task to sleep and restart the failed allocation attempt later if the trylock fails. Otherwise, the OOM killer is invoked. Each zone in the zonelist that __alloc_pages() was called with is checked for the newly-introduced ZONE_OOM_LOCKED flag. If any zone has this flag present, the ""trylock"" to serialize the OOM killer fails and returns zero. Otherwise, all the zones have ZONE_OOM_LOCKED set and the try_set_zone_oom() function returns non-zero. Cc: Andrea Arcangeli <andrea@suse.de> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/098d7f128a4e53cb64930628915ac767785e0e60;"OOM killer synchronization should be done with zone granularity so that memory policy and cpuset allocations may have their corresponding zones locked and allow parallel kills for other OOM conditions that may exist elsewhere in the system";yes;yes;no 349;MDY6Q29tbWl0MjMyNTI5ODowOThkN2YxMjhhNGU1M2NiNjQ5MzA2Mjg5MTVhYzc2Nzc4NWUwZTYw;David Rientjes;Linus Torvalds;"oom: add per-zone locking OOM killer synchronization should be done with zone granularity so that memory policy and cpuset allocations may have their corresponding zones locked and allow parallel kills for other OOM conditions that may exist elsewhere in the system. DMA allocations can be targeted at the zone level, which would not be possible if locking was done in nodes or globally. Synchronization shall be done with a variation of ""trylocks."" The goal is to put the current task to sleep and restart the failed allocation attempt later if the trylock fails. Otherwise, the OOM killer is invoked. Each zone in the zonelist that __alloc_pages() was called with is checked for the newly-introduced ZONE_OOM_LOCKED flag. If any zone has this flag present, the ""trylock"" to serialize the OOM killer fails and returns zero. Otherwise, all the zones have ZONE_OOM_LOCKED set and the try_set_zone_oom() function returns non-zero. Cc: Andrea Arcangeli <andrea@suse.de> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/098d7f128a4e53cb64930628915ac767785e0e60;" DMA allocations can be targeted at the zone level, which would not be possible if locking was done in nodes or globally";yes;yes;yes 349;MDY6Q29tbWl0MjMyNTI5ODowOThkN2YxMjhhNGU1M2NiNjQ5MzA2Mjg5MTVhYzc2Nzc4NWUwZTYw;David Rientjes;Linus Torvalds;"oom: add per-zone locking OOM killer synchronization should be done with zone granularity so that memory policy and cpuset allocations may have their corresponding zones locked and allow parallel kills for other OOM conditions that may exist elsewhere in the system. DMA allocations can be targeted at the zone level, which would not be possible if locking was done in nodes or globally. Synchronization shall be done with a variation of ""trylocks."" The goal is to put the current task to sleep and restart the failed allocation attempt later if the trylock fails. Otherwise, the OOM killer is invoked. Each zone in the zonelist that __alloc_pages() was called with is checked for the newly-introduced ZONE_OOM_LOCKED flag. If any zone has this flag present, the ""trylock"" to serialize the OOM killer fails and returns zero. Otherwise, all the zones have ZONE_OOM_LOCKED set and the try_set_zone_oom() function returns non-zero. Cc: Andrea Arcangeli <andrea@suse.de> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/098d7f128a4e53cb64930628915ac767785e0e60;"Synchronization shall be done with a variation of ""trylocks."" The goal is to put the current task to sleep and restart the failed allocation attempt later if the trylock fails";yes;yes;no 349;MDY6Q29tbWl0MjMyNTI5ODowOThkN2YxMjhhNGU1M2NiNjQ5MzA2Mjg5MTVhYzc2Nzc4NWUwZTYw;David Rientjes;Linus Torvalds;"oom: add per-zone locking OOM killer synchronization should be done with zone granularity so that memory policy and cpuset allocations may have their corresponding zones locked and allow parallel kills for other OOM conditions that may exist elsewhere in the system. DMA allocations can be targeted at the zone level, which would not be possible if locking was done in nodes or globally. Synchronization shall be done with a variation of ""trylocks."" The goal is to put the current task to sleep and restart the failed allocation attempt later if the trylock fails. Otherwise, the OOM killer is invoked. Each zone in the zonelist that __alloc_pages() was called with is checked for the newly-introduced ZONE_OOM_LOCKED flag. If any zone has this flag present, the ""trylock"" to serialize the OOM killer fails and returns zero. Otherwise, all the zones have ZONE_OOM_LOCKED set and the try_set_zone_oom() function returns non-zero. Cc: Andrea Arcangeli <andrea@suse.de> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/098d7f128a4e53cb64930628915ac767785e0e60; Otherwise, the OOM killer is invoked;yes;no;no 349;MDY6Q29tbWl0MjMyNTI5ODowOThkN2YxMjhhNGU1M2NiNjQ5MzA2Mjg5MTVhYzc2Nzc4NWUwZTYw;David Rientjes;Linus Torvalds;"oom: add per-zone locking OOM killer synchronization should be done with zone granularity so that memory policy and cpuset allocations may have their corresponding zones locked and allow parallel kills for other OOM conditions that may exist elsewhere in the system. DMA allocations can be targeted at the zone level, which would not be possible if locking was done in nodes or globally. Synchronization shall be done with a variation of ""trylocks."" The goal is to put the current task to sleep and restart the failed allocation attempt later if the trylock fails. Otherwise, the OOM killer is invoked. Each zone in the zonelist that __alloc_pages() was called with is checked for the newly-introduced ZONE_OOM_LOCKED flag. If any zone has this flag present, the ""trylock"" to serialize the OOM killer fails and returns zero. Otherwise, all the zones have ZONE_OOM_LOCKED set and the try_set_zone_oom() function returns non-zero. Cc: Andrea Arcangeli <andrea@suse.de> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/098d7f128a4e53cb64930628915ac767785e0e60;"Each zone in the zonelist that __alloc_pages() was called with is checked for the newly-introduced ZONE_OOM_LOCKED flag";yes;no;no 349;MDY6Q29tbWl0MjMyNTI5ODowOThkN2YxMjhhNGU1M2NiNjQ5MzA2Mjg5MTVhYzc2Nzc4NWUwZTYw;David Rientjes;Linus Torvalds;"oom: add per-zone locking OOM killer synchronization should be done with zone granularity so that memory policy and cpuset allocations may have their corresponding zones locked and allow parallel kills for other OOM conditions that may exist elsewhere in the system. DMA allocations can be targeted at the zone level, which would not be possible if locking was done in nodes or globally. Synchronization shall be done with a variation of ""trylocks."" The goal is to put the current task to sleep and restart the failed allocation attempt later if the trylock fails. Otherwise, the OOM killer is invoked. Each zone in the zonelist that __alloc_pages() was called with is checked for the newly-introduced ZONE_OOM_LOCKED flag. If any zone has this flag present, the ""trylock"" to serialize the OOM killer fails and returns zero. Otherwise, all the zones have ZONE_OOM_LOCKED set and the try_set_zone_oom() function returns non-zero. Cc: Andrea Arcangeli <andrea@suse.de> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/098d7f128a4e53cb64930628915ac767785e0e60;" If any zone has this flag present, the ""trylock"" to serialize the OOM killer fails and returns zero";yes;no;no 349;MDY6Q29tbWl0MjMyNTI5ODowOThkN2YxMjhhNGU1M2NiNjQ5MzA2Mjg5MTVhYzc2Nzc4NWUwZTYw;David Rientjes;Linus Torvalds;"oom: add per-zone locking OOM killer synchronization should be done with zone granularity so that memory policy and cpuset allocations may have their corresponding zones locked and allow parallel kills for other OOM conditions that may exist elsewhere in the system. DMA allocations can be targeted at the zone level, which would not be possible if locking was done in nodes or globally. Synchronization shall be done with a variation of ""trylocks."" The goal is to put the current task to sleep and restart the failed allocation attempt later if the trylock fails. Otherwise, the OOM killer is invoked. Each zone in the zonelist that __alloc_pages() was called with is checked for the newly-introduced ZONE_OOM_LOCKED flag. If any zone has this flag present, the ""trylock"" to serialize the OOM killer fails and returns zero. Otherwise, all the zones have ZONE_OOM_LOCKED set and the try_set_zone_oom() function returns non-zero. Cc: Andrea Arcangeli <andrea@suse.de> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/098d7f128a4e53cb64930628915ac767785e0e60;" Otherwise, all the zones have ZONE_OOM_LOCKED set and the try_set_zone_oom() function returns non-zero.";yes;no;no 350;MDY6Q29tbWl0MjMyNTI5ODo3MGUyNGJkZjZkMmZlYWQxNDYzMWU3MmEwN2ZiYTAxMjQwMGM1MjFl;David Rientjes;Linus Torvalds;"oom: move constraints to enum The OOM killer's CONSTRAINT definitions are really more appropriate in an enum, so define them in include/linux/oom.h. Cc: Andrea Arcangeli <andrea@suse.de> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/70e24bdf6d2fead14631e72a07fba012400c521e;oom: move constraints to enum;yes;no;no 350;MDY6Q29tbWl0MjMyNTI5ODo3MGUyNGJkZjZkMmZlYWQxNDYzMWU3MmEwN2ZiYTAxMjQwMGM1MjFl;David Rientjes;Linus Torvalds;"oom: move constraints to enum The OOM killer's CONSTRAINT definitions are really more appropriate in an enum, so define them in include/linux/oom.h. Cc: Andrea Arcangeli <andrea@suse.de> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/70e24bdf6d2fead14631e72a07fba012400c521e;"The OOM killer's CONSTRAINT definitions are really more appropriate in an enum, so define them in include/linux/oom.h.";yes;yes;no 351;MDY6Q29tbWl0MjMyNTI5ODplZTMxYWY1ZDY0OWQ4YWE2YWM3OTQ4YTZkOTdhZTQ4MzY3ZmYyZDdl;Christoph Lameter;Linus Torvalds;"Memoryless nodes: OOM: use N_HIGH_MEMORY map instead of constructing one on the fly constrained_alloc() builds its own memory map for nodes with memory. We have that available in N_HIGH_MEMORY now. So simplify the code. Signed-off-by: Christoph Lameter <clameter@sgi.com> Acked-by: Nishanth Aravamudan <nacc@us.ibm.com> Acked-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Acked-by: Bob Picco <bob.picco@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@skynet.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/ee31af5d649d8aa6ac7948a6d97ae48367ff2d7e;Memoryless nodes: OOM: use N_HIGH_MEMORY map instead of constructing one on the fly;yes;no;no 351;MDY6Q29tbWl0MjMyNTI5ODplZTMxYWY1ZDY0OWQ4YWE2YWM3OTQ4YTZkOTdhZTQ4MzY3ZmYyZDdl;Christoph Lameter;Linus Torvalds;"Memoryless nodes: OOM: use N_HIGH_MEMORY map instead of constructing one on the fly constrained_alloc() builds its own memory map for nodes with memory. We have that available in N_HIGH_MEMORY now. So simplify the code. Signed-off-by: Christoph Lameter <clameter@sgi.com> Acked-by: Nishanth Aravamudan <nacc@us.ibm.com> Acked-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Acked-by: Bob Picco <bob.picco@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@skynet.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/ee31af5d649d8aa6ac7948a6d97ae48367ff2d7e;constrained_alloc() builds its own memory map for nodes with memory;no;no;yes 351;MDY6Q29tbWl0MjMyNTI5ODplZTMxYWY1ZDY0OWQ4YWE2YWM3OTQ4YTZkOTdhZTQ4MzY3ZmYyZDdl;Christoph Lameter;Linus Torvalds;"Memoryless nodes: OOM: use N_HIGH_MEMORY map instead of constructing one on the fly constrained_alloc() builds its own memory map for nodes with memory. We have that available in N_HIGH_MEMORY now. So simplify the code. Signed-off-by: Christoph Lameter <clameter@sgi.com> Acked-by: Nishanth Aravamudan <nacc@us.ibm.com> Acked-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Acked-by: Bob Picco <bob.picco@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@skynet.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/ee31af5d649d8aa6ac7948a6d97ae48367ff2d7e;" We have that available in N_HIGH_MEMORY now";no;yes;yes 351;MDY6Q29tbWl0MjMyNTI5ODplZTMxYWY1ZDY0OWQ4YWE2YWM3OTQ4YTZkOTdhZTQ4MzY3ZmYyZDdl;Christoph Lameter;Linus Torvalds;"Memoryless nodes: OOM: use N_HIGH_MEMORY map instead of constructing one on the fly constrained_alloc() builds its own memory map for nodes with memory. We have that available in N_HIGH_MEMORY now. So simplify the code. Signed-off-by: Christoph Lameter <clameter@sgi.com> Acked-by: Nishanth Aravamudan <nacc@us.ibm.com> Acked-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Acked-by: Bob Picco <bob.picco@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@skynet.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/ee31af5d649d8aa6ac7948a6d97ae48367ff2d7e; So simplify the code.;yes;yes;no 352;MDY6Q29tbWl0MjMyNTI5ODphNWU1OGE2MTQyMGU5OWRkMDg2ODVmNjIyZDRkYzY2NmJmMDdlOWE1;David Rientjes;Linus Torvalds;"oom: print points as unsigned long In badness(), the automatic variable 'points' is unsigned long. Print it as such. Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/a5e58a61420e99dd08685f622d4dc666bf07e9a5;oom: print points as unsigned long;yes;no;yes 352;MDY6Q29tbWl0MjMyNTI5ODphNWU1OGE2MTQyMGU5OWRkMDg2ODVmNjIyZDRkYzY2NmJmMDdlOWE1;David Rientjes;Linus Torvalds;"oom: print points as unsigned long In badness(), the automatic variable 'points' is unsigned long. Print it as such. Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/a5e58a61420e99dd08685f622d4dc666bf07e9a5;In badness(), the automatic variable 'points' is unsigned long;no;yes;yes 352;MDY6Q29tbWl0MjMyNTI5ODphNWU1OGE2MTQyMGU5OWRkMDg2ODVmNjIyZDRkYzY2NmJmMDdlOWE1;David Rientjes;Linus Torvalds;"oom: print points as unsigned long In badness(), the automatic variable 'points' is unsigned long. Print it as such. Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/a5e58a61420e99dd08685f622d4dc666bf07e9a5;" Print it as such.";yes;yes;no 353;MDY6Q29tbWl0MjMyNTI5ODo0ZTk1MGY2ZjAxODlmNjVmOGJmMDY5Y2YyMjcyNjQ5ZWY0MThmNWU0;Alexey Dobriyan;Linus Torvalds;"Remove fs.h from mm.h Remove fs.h from mm.h. For this, 1) Uninline vma_wants_writenotify(). It's pretty huge anyway. 2) Add back fs.h or less bloated headers (err.h) to files that need it. As result, on x86_64 allyesconfig, fs.h dependencies cut down from 3929 files rebuilt down to 3444 (-12.3%). Cross-compile tested without regressions on my two usual configs and (sigh): alpha arm-mx1ads mips-bigsur powerpc-ebony alpha-allnoconfig arm-neponset mips-capcella powerpc-g5 alpha-defconfig arm-netwinder mips-cobalt powerpc-holly alpha-up arm-netx mips-db1000 powerpc-iseries arm arm-ns9xxx mips-db1100 powerpc-linkstation arm-assabet arm-omap_h2_1610 mips-db1200 powerpc-lite5200 arm-at91rm9200dk arm-onearm mips-db1500 powerpc-maple arm-at91rm9200ek arm-picotux200 mips-db1550 powerpc-mpc7448_hpc2 arm-at91sam9260ek arm-pleb mips-ddb5477 powerpc-mpc8272_ads arm-at91sam9261ek arm-pnx4008 mips-decstation powerpc-mpc8313_rdb arm-at91sam9263ek arm-pxa255-idp mips-e55 powerpc-mpc832x_mds arm-at91sam9rlek arm-realview mips-emma2rh powerpc-mpc832x_rdb arm-ateb9200 arm-realview-smp mips-excite powerpc-mpc834x_itx arm-badge4 arm-rpc mips-fulong powerpc-mpc834x_itxgp arm-carmeva arm-s3c2410 mips-ip22 powerpc-mpc834x_mds arm-cerfcube arm-shannon mips-ip27 powerpc-mpc836x_mds arm-clps7500 arm-shark mips-ip32 powerpc-mpc8540_ads arm-collie arm-simpad mips-jazz powerpc-mpc8544_ds arm-corgi arm-spitz mips-jmr3927 powerpc-mpc8560_ads arm-csb337 arm-trizeps4 mips-malta powerpc-mpc8568mds arm-csb637 arm-versatile mips-mipssim powerpc-mpc85xx_cds arm-ebsa110 i386 mips-mpc30x powerpc-mpc8641_hpcn arm-edb7211 i386-allnoconfig mips-msp71xx powerpc-mpc866_ads arm-em_x270 i386-defconfig mips-ocelot powerpc-mpc885_ads arm-ep93xx i386-up mips-pb1100 powerpc-pasemi arm-footbridge ia64 mips-pb1500 powerpc-pmac32 arm-fortunet ia64-allnoconfig mips-pb1550 powerpc-ppc64 arm-h3600 ia64-bigsur mips-pnx8550-jbs powerpc-prpmc2800 arm-h7201 ia64-defconfig mips-pnx8550-stb810 powerpc-ps3 arm-h7202 ia64-gensparse mips-qemu powerpc-pseries arm-hackkit ia64-sim mips-rbhma4200 powerpc-up arm-integrator ia64-sn2 mips-rbhma4500 s390 arm-iop13xx ia64-tiger mips-rm200 s390-allnoconfig arm-iop32x ia64-up mips-sb1250-swarm s390-defconfig arm-iop33x ia64-zx1 mips-sead s390-up arm-ixp2000 m68k mips-tb0219 sparc arm-ixp23xx m68k-amiga mips-tb0226 sparc-allnoconfig arm-ixp4xx m68k-apollo mips-tb0287 sparc-defconfig arm-jornada720 m68k-atari mips-workpad sparc-up arm-kafa m68k-bvme6000 mips-wrppmc sparc64 arm-kb9202 m68k-hp300 mips-yosemite sparc64-allnoconfig arm-ks8695 m68k-mac parisc sparc64-defconfig arm-lart m68k-mvme147 parisc-allnoconfig sparc64-up arm-lpd270 m68k-mvme16x parisc-defconfig um-x86_64 arm-lpd7a400 m68k-q40 parisc-up x86_64 arm-lpd7a404 m68k-sun3 powerpc x86_64-allnoconfig arm-lubbock m68k-sun3x powerpc-cell x86_64-defconfig arm-lusl7200 mips powerpc-celleb x86_64-up arm-mainstone mips-atlas powerpc-chrp32 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/4e950f6f0189f65f8bf069cf2272649ef418f5e4;Remove fs.h from mm.h;yes;no;no 353;MDY6Q29tbWl0MjMyNTI5ODo0ZTk1MGY2ZjAxODlmNjVmOGJmMDY5Y2YyMjcyNjQ5ZWY0MThmNWU0;Alexey Dobriyan;Linus Torvalds;"Remove fs.h from mm.h Remove fs.h from mm.h. For this, 1) Uninline vma_wants_writenotify(). It's pretty huge anyway. 2) Add back fs.h or less bloated headers (err.h) to files that need it. As result, on x86_64 allyesconfig, fs.h dependencies cut down from 3929 files rebuilt down to 3444 (-12.3%). Cross-compile tested without regressions on my two usual configs and (sigh): alpha arm-mx1ads mips-bigsur powerpc-ebony alpha-allnoconfig arm-neponset mips-capcella powerpc-g5 alpha-defconfig arm-netwinder mips-cobalt powerpc-holly alpha-up arm-netx mips-db1000 powerpc-iseries arm arm-ns9xxx mips-db1100 powerpc-linkstation arm-assabet arm-omap_h2_1610 mips-db1200 powerpc-lite5200 arm-at91rm9200dk arm-onearm mips-db1500 powerpc-maple arm-at91rm9200ek arm-picotux200 mips-db1550 powerpc-mpc7448_hpc2 arm-at91sam9260ek arm-pleb mips-ddb5477 powerpc-mpc8272_ads arm-at91sam9261ek arm-pnx4008 mips-decstation powerpc-mpc8313_rdb arm-at91sam9263ek arm-pxa255-idp mips-e55 powerpc-mpc832x_mds arm-at91sam9rlek arm-realview mips-emma2rh powerpc-mpc832x_rdb arm-ateb9200 arm-realview-smp mips-excite powerpc-mpc834x_itx arm-badge4 arm-rpc mips-fulong powerpc-mpc834x_itxgp arm-carmeva arm-s3c2410 mips-ip22 powerpc-mpc834x_mds arm-cerfcube arm-shannon mips-ip27 powerpc-mpc836x_mds arm-clps7500 arm-shark mips-ip32 powerpc-mpc8540_ads arm-collie arm-simpad mips-jazz powerpc-mpc8544_ds arm-corgi arm-spitz mips-jmr3927 powerpc-mpc8560_ads arm-csb337 arm-trizeps4 mips-malta powerpc-mpc8568mds arm-csb637 arm-versatile mips-mipssim powerpc-mpc85xx_cds arm-ebsa110 i386 mips-mpc30x powerpc-mpc8641_hpcn arm-edb7211 i386-allnoconfig mips-msp71xx powerpc-mpc866_ads arm-em_x270 i386-defconfig mips-ocelot powerpc-mpc885_ads arm-ep93xx i386-up mips-pb1100 powerpc-pasemi arm-footbridge ia64 mips-pb1500 powerpc-pmac32 arm-fortunet ia64-allnoconfig mips-pb1550 powerpc-ppc64 arm-h3600 ia64-bigsur mips-pnx8550-jbs powerpc-prpmc2800 arm-h7201 ia64-defconfig mips-pnx8550-stb810 powerpc-ps3 arm-h7202 ia64-gensparse mips-qemu powerpc-pseries arm-hackkit ia64-sim mips-rbhma4200 powerpc-up arm-integrator ia64-sn2 mips-rbhma4500 s390 arm-iop13xx ia64-tiger mips-rm200 s390-allnoconfig arm-iop32x ia64-up mips-sb1250-swarm s390-defconfig arm-iop33x ia64-zx1 mips-sead s390-up arm-ixp2000 m68k mips-tb0219 sparc arm-ixp23xx m68k-amiga mips-tb0226 sparc-allnoconfig arm-ixp4xx m68k-apollo mips-tb0287 sparc-defconfig arm-jornada720 m68k-atari mips-workpad sparc-up arm-kafa m68k-bvme6000 mips-wrppmc sparc64 arm-kb9202 m68k-hp300 mips-yosemite sparc64-allnoconfig arm-ks8695 m68k-mac parisc sparc64-defconfig arm-lart m68k-mvme147 parisc-allnoconfig sparc64-up arm-lpd270 m68k-mvme16x parisc-defconfig um-x86_64 arm-lpd7a400 m68k-q40 parisc-up x86_64 arm-lpd7a404 m68k-sun3 powerpc x86_64-allnoconfig arm-lubbock m68k-sun3x powerpc-cell x86_64-defconfig arm-lusl7200 mips powerpc-celleb x86_64-up arm-mainstone mips-atlas powerpc-chrp32 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/4e950f6f0189f65f8bf069cf2272649ef418f5e4;Remove fs.h from mm.h;yes;no;no 353;MDY6Q29tbWl0MjMyNTI5ODo0ZTk1MGY2ZjAxODlmNjVmOGJmMDY5Y2YyMjcyNjQ5ZWY0MThmNWU0;Alexey Dobriyan;Linus Torvalds;"Remove fs.h from mm.h Remove fs.h from mm.h. For this, 1) Uninline vma_wants_writenotify(). It's pretty huge anyway. 2) Add back fs.h or less bloated headers (err.h) to files that need it. As result, on x86_64 allyesconfig, fs.h dependencies cut down from 3929 files rebuilt down to 3444 (-12.3%). Cross-compile tested without regressions on my two usual configs and (sigh): alpha arm-mx1ads mips-bigsur powerpc-ebony alpha-allnoconfig arm-neponset mips-capcella powerpc-g5 alpha-defconfig arm-netwinder mips-cobalt powerpc-holly alpha-up arm-netx mips-db1000 powerpc-iseries arm arm-ns9xxx mips-db1100 powerpc-linkstation arm-assabet arm-omap_h2_1610 mips-db1200 powerpc-lite5200 arm-at91rm9200dk arm-onearm mips-db1500 powerpc-maple arm-at91rm9200ek arm-picotux200 mips-db1550 powerpc-mpc7448_hpc2 arm-at91sam9260ek arm-pleb mips-ddb5477 powerpc-mpc8272_ads arm-at91sam9261ek arm-pnx4008 mips-decstation powerpc-mpc8313_rdb arm-at91sam9263ek arm-pxa255-idp mips-e55 powerpc-mpc832x_mds arm-at91sam9rlek arm-realview mips-emma2rh powerpc-mpc832x_rdb arm-ateb9200 arm-realview-smp mips-excite powerpc-mpc834x_itx arm-badge4 arm-rpc mips-fulong powerpc-mpc834x_itxgp arm-carmeva arm-s3c2410 mips-ip22 powerpc-mpc834x_mds arm-cerfcube arm-shannon mips-ip27 powerpc-mpc836x_mds arm-clps7500 arm-shark mips-ip32 powerpc-mpc8540_ads arm-collie arm-simpad mips-jazz powerpc-mpc8544_ds arm-corgi arm-spitz mips-jmr3927 powerpc-mpc8560_ads arm-csb337 arm-trizeps4 mips-malta powerpc-mpc8568mds arm-csb637 arm-versatile mips-mipssim powerpc-mpc85xx_cds arm-ebsa110 i386 mips-mpc30x powerpc-mpc8641_hpcn arm-edb7211 i386-allnoconfig mips-msp71xx powerpc-mpc866_ads arm-em_x270 i386-defconfig mips-ocelot powerpc-mpc885_ads arm-ep93xx i386-up mips-pb1100 powerpc-pasemi arm-footbridge ia64 mips-pb1500 powerpc-pmac32 arm-fortunet ia64-allnoconfig mips-pb1550 powerpc-ppc64 arm-h3600 ia64-bigsur mips-pnx8550-jbs powerpc-prpmc2800 arm-h7201 ia64-defconfig mips-pnx8550-stb810 powerpc-ps3 arm-h7202 ia64-gensparse mips-qemu powerpc-pseries arm-hackkit ia64-sim mips-rbhma4200 powerpc-up arm-integrator ia64-sn2 mips-rbhma4500 s390 arm-iop13xx ia64-tiger mips-rm200 s390-allnoconfig arm-iop32x ia64-up mips-sb1250-swarm s390-defconfig arm-iop33x ia64-zx1 mips-sead s390-up arm-ixp2000 m68k mips-tb0219 sparc arm-ixp23xx m68k-amiga mips-tb0226 sparc-allnoconfig arm-ixp4xx m68k-apollo mips-tb0287 sparc-defconfig arm-jornada720 m68k-atari mips-workpad sparc-up arm-kafa m68k-bvme6000 mips-wrppmc sparc64 arm-kb9202 m68k-hp300 mips-yosemite sparc64-allnoconfig arm-ks8695 m68k-mac parisc sparc64-defconfig arm-lart m68k-mvme147 parisc-allnoconfig sparc64-up arm-lpd270 m68k-mvme16x parisc-defconfig um-x86_64 arm-lpd7a400 m68k-q40 parisc-up x86_64 arm-lpd7a404 m68k-sun3 powerpc x86_64-allnoconfig arm-lubbock m68k-sun3x powerpc-cell x86_64-defconfig arm-lusl7200 mips powerpc-celleb x86_64-up arm-mainstone mips-atlas powerpc-chrp32 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/4e950f6f0189f65f8bf069cf2272649ef418f5e4;"For this, 1) Uninline vma_wants_writenotify()";yes;no;no 353;MDY6Q29tbWl0MjMyNTI5ODo0ZTk1MGY2ZjAxODlmNjVmOGJmMDY5Y2YyMjcyNjQ5ZWY0MThmNWU0;Alexey Dobriyan;Linus Torvalds;"Remove fs.h from mm.h Remove fs.h from mm.h. For this, 1) Uninline vma_wants_writenotify(). It's pretty huge anyway. 2) Add back fs.h or less bloated headers (err.h) to files that need it. As result, on x86_64 allyesconfig, fs.h dependencies cut down from 3929 files rebuilt down to 3444 (-12.3%). Cross-compile tested without regressions on my two usual configs and (sigh): alpha arm-mx1ads mips-bigsur powerpc-ebony alpha-allnoconfig arm-neponset mips-capcella powerpc-g5 alpha-defconfig arm-netwinder mips-cobalt powerpc-holly alpha-up arm-netx mips-db1000 powerpc-iseries arm arm-ns9xxx mips-db1100 powerpc-linkstation arm-assabet arm-omap_h2_1610 mips-db1200 powerpc-lite5200 arm-at91rm9200dk arm-onearm mips-db1500 powerpc-maple arm-at91rm9200ek arm-picotux200 mips-db1550 powerpc-mpc7448_hpc2 arm-at91sam9260ek arm-pleb mips-ddb5477 powerpc-mpc8272_ads arm-at91sam9261ek arm-pnx4008 mips-decstation powerpc-mpc8313_rdb arm-at91sam9263ek arm-pxa255-idp mips-e55 powerpc-mpc832x_mds arm-at91sam9rlek arm-realview mips-emma2rh powerpc-mpc832x_rdb arm-ateb9200 arm-realview-smp mips-excite powerpc-mpc834x_itx arm-badge4 arm-rpc mips-fulong powerpc-mpc834x_itxgp arm-carmeva arm-s3c2410 mips-ip22 powerpc-mpc834x_mds arm-cerfcube arm-shannon mips-ip27 powerpc-mpc836x_mds arm-clps7500 arm-shark mips-ip32 powerpc-mpc8540_ads arm-collie arm-simpad mips-jazz powerpc-mpc8544_ds arm-corgi arm-spitz mips-jmr3927 powerpc-mpc8560_ads arm-csb337 arm-trizeps4 mips-malta powerpc-mpc8568mds arm-csb637 arm-versatile mips-mipssim powerpc-mpc85xx_cds arm-ebsa110 i386 mips-mpc30x powerpc-mpc8641_hpcn arm-edb7211 i386-allnoconfig mips-msp71xx powerpc-mpc866_ads arm-em_x270 i386-defconfig mips-ocelot powerpc-mpc885_ads arm-ep93xx i386-up mips-pb1100 powerpc-pasemi arm-footbridge ia64 mips-pb1500 powerpc-pmac32 arm-fortunet ia64-allnoconfig mips-pb1550 powerpc-ppc64 arm-h3600 ia64-bigsur mips-pnx8550-jbs powerpc-prpmc2800 arm-h7201 ia64-defconfig mips-pnx8550-stb810 powerpc-ps3 arm-h7202 ia64-gensparse mips-qemu powerpc-pseries arm-hackkit ia64-sim mips-rbhma4200 powerpc-up arm-integrator ia64-sn2 mips-rbhma4500 s390 arm-iop13xx ia64-tiger mips-rm200 s390-allnoconfig arm-iop32x ia64-up mips-sb1250-swarm s390-defconfig arm-iop33x ia64-zx1 mips-sead s390-up arm-ixp2000 m68k mips-tb0219 sparc arm-ixp23xx m68k-amiga mips-tb0226 sparc-allnoconfig arm-ixp4xx m68k-apollo mips-tb0287 sparc-defconfig arm-jornada720 m68k-atari mips-workpad sparc-up arm-kafa m68k-bvme6000 mips-wrppmc sparc64 arm-kb9202 m68k-hp300 mips-yosemite sparc64-allnoconfig arm-ks8695 m68k-mac parisc sparc64-defconfig arm-lart m68k-mvme147 parisc-allnoconfig sparc64-up arm-lpd270 m68k-mvme16x parisc-defconfig um-x86_64 arm-lpd7a400 m68k-q40 parisc-up x86_64 arm-lpd7a404 m68k-sun3 powerpc x86_64-allnoconfig arm-lubbock m68k-sun3x powerpc-cell x86_64-defconfig arm-lusl7200 mips powerpc-celleb x86_64-up arm-mainstone mips-atlas powerpc-chrp32 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/4e950f6f0189f65f8bf069cf2272649ef418f5e4;It's pretty huge anyway;no;yes;yes 353;MDY6Q29tbWl0MjMyNTI5ODo0ZTk1MGY2ZjAxODlmNjVmOGJmMDY5Y2YyMjcyNjQ5ZWY0MThmNWU0;Alexey Dobriyan;Linus Torvalds;"Remove fs.h from mm.h Remove fs.h from mm.h. For this, 1) Uninline vma_wants_writenotify(). It's pretty huge anyway. 2) Add back fs.h or less bloated headers (err.h) to files that need it. As result, on x86_64 allyesconfig, fs.h dependencies cut down from 3929 files rebuilt down to 3444 (-12.3%). Cross-compile tested without regressions on my two usual configs and (sigh): alpha arm-mx1ads mips-bigsur powerpc-ebony alpha-allnoconfig arm-neponset mips-capcella powerpc-g5 alpha-defconfig arm-netwinder mips-cobalt powerpc-holly alpha-up arm-netx mips-db1000 powerpc-iseries arm arm-ns9xxx mips-db1100 powerpc-linkstation arm-assabet arm-omap_h2_1610 mips-db1200 powerpc-lite5200 arm-at91rm9200dk arm-onearm mips-db1500 powerpc-maple arm-at91rm9200ek arm-picotux200 mips-db1550 powerpc-mpc7448_hpc2 arm-at91sam9260ek arm-pleb mips-ddb5477 powerpc-mpc8272_ads arm-at91sam9261ek arm-pnx4008 mips-decstation powerpc-mpc8313_rdb arm-at91sam9263ek arm-pxa255-idp mips-e55 powerpc-mpc832x_mds arm-at91sam9rlek arm-realview mips-emma2rh powerpc-mpc832x_rdb arm-ateb9200 arm-realview-smp mips-excite powerpc-mpc834x_itx arm-badge4 arm-rpc mips-fulong powerpc-mpc834x_itxgp arm-carmeva arm-s3c2410 mips-ip22 powerpc-mpc834x_mds arm-cerfcube arm-shannon mips-ip27 powerpc-mpc836x_mds arm-clps7500 arm-shark mips-ip32 powerpc-mpc8540_ads arm-collie arm-simpad mips-jazz powerpc-mpc8544_ds arm-corgi arm-spitz mips-jmr3927 powerpc-mpc8560_ads arm-csb337 arm-trizeps4 mips-malta powerpc-mpc8568mds arm-csb637 arm-versatile mips-mipssim powerpc-mpc85xx_cds arm-ebsa110 i386 mips-mpc30x powerpc-mpc8641_hpcn arm-edb7211 i386-allnoconfig mips-msp71xx powerpc-mpc866_ads arm-em_x270 i386-defconfig mips-ocelot powerpc-mpc885_ads arm-ep93xx i386-up mips-pb1100 powerpc-pasemi arm-footbridge ia64 mips-pb1500 powerpc-pmac32 arm-fortunet ia64-allnoconfig mips-pb1550 powerpc-ppc64 arm-h3600 ia64-bigsur mips-pnx8550-jbs powerpc-prpmc2800 arm-h7201 ia64-defconfig mips-pnx8550-stb810 powerpc-ps3 arm-h7202 ia64-gensparse mips-qemu powerpc-pseries arm-hackkit ia64-sim mips-rbhma4200 powerpc-up arm-integrator ia64-sn2 mips-rbhma4500 s390 arm-iop13xx ia64-tiger mips-rm200 s390-allnoconfig arm-iop32x ia64-up mips-sb1250-swarm s390-defconfig arm-iop33x ia64-zx1 mips-sead s390-up arm-ixp2000 m68k mips-tb0219 sparc arm-ixp23xx m68k-amiga mips-tb0226 sparc-allnoconfig arm-ixp4xx m68k-apollo mips-tb0287 sparc-defconfig arm-jornada720 m68k-atari mips-workpad sparc-up arm-kafa m68k-bvme6000 mips-wrppmc sparc64 arm-kb9202 m68k-hp300 mips-yosemite sparc64-allnoconfig arm-ks8695 m68k-mac parisc sparc64-defconfig arm-lart m68k-mvme147 parisc-allnoconfig sparc64-up arm-lpd270 m68k-mvme16x parisc-defconfig um-x86_64 arm-lpd7a400 m68k-q40 parisc-up x86_64 arm-lpd7a404 m68k-sun3 powerpc x86_64-allnoconfig arm-lubbock m68k-sun3x powerpc-cell x86_64-defconfig arm-lusl7200 mips powerpc-celleb x86_64-up arm-mainstone mips-atlas powerpc-chrp32 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/4e950f6f0189f65f8bf069cf2272649ef418f5e4; 2) Add back fs.h or less bloated headers (err.h) to files that need it;yes;yes;no 353;MDY6Q29tbWl0MjMyNTI5ODo0ZTk1MGY2ZjAxODlmNjVmOGJmMDY5Y2YyMjcyNjQ5ZWY0MThmNWU0;Alexey Dobriyan;Linus Torvalds;"Remove fs.h from mm.h Remove fs.h from mm.h. For this, 1) Uninline vma_wants_writenotify(). It's pretty huge anyway. 2) Add back fs.h or less bloated headers (err.h) to files that need it. As result, on x86_64 allyesconfig, fs.h dependencies cut down from 3929 files rebuilt down to 3444 (-12.3%). Cross-compile tested without regressions on my two usual configs and (sigh): alpha arm-mx1ads mips-bigsur powerpc-ebony alpha-allnoconfig arm-neponset mips-capcella powerpc-g5 alpha-defconfig arm-netwinder mips-cobalt powerpc-holly alpha-up arm-netx mips-db1000 powerpc-iseries arm arm-ns9xxx mips-db1100 powerpc-linkstation arm-assabet arm-omap_h2_1610 mips-db1200 powerpc-lite5200 arm-at91rm9200dk arm-onearm mips-db1500 powerpc-maple arm-at91rm9200ek arm-picotux200 mips-db1550 powerpc-mpc7448_hpc2 arm-at91sam9260ek arm-pleb mips-ddb5477 powerpc-mpc8272_ads arm-at91sam9261ek arm-pnx4008 mips-decstation powerpc-mpc8313_rdb arm-at91sam9263ek arm-pxa255-idp mips-e55 powerpc-mpc832x_mds arm-at91sam9rlek arm-realview mips-emma2rh powerpc-mpc832x_rdb arm-ateb9200 arm-realview-smp mips-excite powerpc-mpc834x_itx arm-badge4 arm-rpc mips-fulong powerpc-mpc834x_itxgp arm-carmeva arm-s3c2410 mips-ip22 powerpc-mpc834x_mds arm-cerfcube arm-shannon mips-ip27 powerpc-mpc836x_mds arm-clps7500 arm-shark mips-ip32 powerpc-mpc8540_ads arm-collie arm-simpad mips-jazz powerpc-mpc8544_ds arm-corgi arm-spitz mips-jmr3927 powerpc-mpc8560_ads arm-csb337 arm-trizeps4 mips-malta powerpc-mpc8568mds arm-csb637 arm-versatile mips-mipssim powerpc-mpc85xx_cds arm-ebsa110 i386 mips-mpc30x powerpc-mpc8641_hpcn arm-edb7211 i386-allnoconfig mips-msp71xx powerpc-mpc866_ads arm-em_x270 i386-defconfig mips-ocelot powerpc-mpc885_ads arm-ep93xx i386-up mips-pb1100 powerpc-pasemi arm-footbridge ia64 mips-pb1500 powerpc-pmac32 arm-fortunet ia64-allnoconfig mips-pb1550 powerpc-ppc64 arm-h3600 ia64-bigsur mips-pnx8550-jbs powerpc-prpmc2800 arm-h7201 ia64-defconfig mips-pnx8550-stb810 powerpc-ps3 arm-h7202 ia64-gensparse mips-qemu powerpc-pseries arm-hackkit ia64-sim mips-rbhma4200 powerpc-up arm-integrator ia64-sn2 mips-rbhma4500 s390 arm-iop13xx ia64-tiger mips-rm200 s390-allnoconfig arm-iop32x ia64-up mips-sb1250-swarm s390-defconfig arm-iop33x ia64-zx1 mips-sead s390-up arm-ixp2000 m68k mips-tb0219 sparc arm-ixp23xx m68k-amiga mips-tb0226 sparc-allnoconfig arm-ixp4xx m68k-apollo mips-tb0287 sparc-defconfig arm-jornada720 m68k-atari mips-workpad sparc-up arm-kafa m68k-bvme6000 mips-wrppmc sparc64 arm-kb9202 m68k-hp300 mips-yosemite sparc64-allnoconfig arm-ks8695 m68k-mac parisc sparc64-defconfig arm-lart m68k-mvme147 parisc-allnoconfig sparc64-up arm-lpd270 m68k-mvme16x parisc-defconfig um-x86_64 arm-lpd7a400 m68k-q40 parisc-up x86_64 arm-lpd7a404 m68k-sun3 powerpc x86_64-allnoconfig arm-lubbock m68k-sun3x powerpc-cell x86_64-defconfig arm-lusl7200 mips powerpc-celleb x86_64-up arm-mainstone mips-atlas powerpc-chrp32 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/4e950f6f0189f65f8bf069cf2272649ef418f5e4;"As result, on x86_64 allyesconfig, fs.h dependencies cut down from 3929 files rebuilt down to 3444 (-12.3%)";yes;yes;no 353;MDY6Q29tbWl0MjMyNTI5ODo0ZTk1MGY2ZjAxODlmNjVmOGJmMDY5Y2YyMjcyNjQ5ZWY0MThmNWU0;Alexey Dobriyan;Linus Torvalds;"Remove fs.h from mm.h Remove fs.h from mm.h. For this, 1) Uninline vma_wants_writenotify(). It's pretty huge anyway. 2) Add back fs.h or less bloated headers (err.h) to files that need it. As result, on x86_64 allyesconfig, fs.h dependencies cut down from 3929 files rebuilt down to 3444 (-12.3%). Cross-compile tested without regressions on my two usual configs and (sigh): alpha arm-mx1ads mips-bigsur powerpc-ebony alpha-allnoconfig arm-neponset mips-capcella powerpc-g5 alpha-defconfig arm-netwinder mips-cobalt powerpc-holly alpha-up arm-netx mips-db1000 powerpc-iseries arm arm-ns9xxx mips-db1100 powerpc-linkstation arm-assabet arm-omap_h2_1610 mips-db1200 powerpc-lite5200 arm-at91rm9200dk arm-onearm mips-db1500 powerpc-maple arm-at91rm9200ek arm-picotux200 mips-db1550 powerpc-mpc7448_hpc2 arm-at91sam9260ek arm-pleb mips-ddb5477 powerpc-mpc8272_ads arm-at91sam9261ek arm-pnx4008 mips-decstation powerpc-mpc8313_rdb arm-at91sam9263ek arm-pxa255-idp mips-e55 powerpc-mpc832x_mds arm-at91sam9rlek arm-realview mips-emma2rh powerpc-mpc832x_rdb arm-ateb9200 arm-realview-smp mips-excite powerpc-mpc834x_itx arm-badge4 arm-rpc mips-fulong powerpc-mpc834x_itxgp arm-carmeva arm-s3c2410 mips-ip22 powerpc-mpc834x_mds arm-cerfcube arm-shannon mips-ip27 powerpc-mpc836x_mds arm-clps7500 arm-shark mips-ip32 powerpc-mpc8540_ads arm-collie arm-simpad mips-jazz powerpc-mpc8544_ds arm-corgi arm-spitz mips-jmr3927 powerpc-mpc8560_ads arm-csb337 arm-trizeps4 mips-malta powerpc-mpc8568mds arm-csb637 arm-versatile mips-mipssim powerpc-mpc85xx_cds arm-ebsa110 i386 mips-mpc30x powerpc-mpc8641_hpcn arm-edb7211 i386-allnoconfig mips-msp71xx powerpc-mpc866_ads arm-em_x270 i386-defconfig mips-ocelot powerpc-mpc885_ads arm-ep93xx i386-up mips-pb1100 powerpc-pasemi arm-footbridge ia64 mips-pb1500 powerpc-pmac32 arm-fortunet ia64-allnoconfig mips-pb1550 powerpc-ppc64 arm-h3600 ia64-bigsur mips-pnx8550-jbs powerpc-prpmc2800 arm-h7201 ia64-defconfig mips-pnx8550-stb810 powerpc-ps3 arm-h7202 ia64-gensparse mips-qemu powerpc-pseries arm-hackkit ia64-sim mips-rbhma4200 powerpc-up arm-integrator ia64-sn2 mips-rbhma4500 s390 arm-iop13xx ia64-tiger mips-rm200 s390-allnoconfig arm-iop32x ia64-up mips-sb1250-swarm s390-defconfig arm-iop33x ia64-zx1 mips-sead s390-up arm-ixp2000 m68k mips-tb0219 sparc arm-ixp23xx m68k-amiga mips-tb0226 sparc-allnoconfig arm-ixp4xx m68k-apollo mips-tb0287 sparc-defconfig arm-jornada720 m68k-atari mips-workpad sparc-up arm-kafa m68k-bvme6000 mips-wrppmc sparc64 arm-kb9202 m68k-hp300 mips-yosemite sparc64-allnoconfig arm-ks8695 m68k-mac parisc sparc64-defconfig arm-lart m68k-mvme147 parisc-allnoconfig sparc64-up arm-lpd270 m68k-mvme16x parisc-defconfig um-x86_64 arm-lpd7a400 m68k-q40 parisc-up x86_64 arm-lpd7a404 m68k-sun3 powerpc x86_64-allnoconfig arm-lubbock m68k-sun3x powerpc-cell x86_64-defconfig arm-lusl7200 mips powerpc-celleb x86_64-up arm-mainstone mips-atlas powerpc-chrp32 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/4e950f6f0189f65f8bf069cf2272649ef418f5e4;Cross-compile tested without regressions on my two usual configs and (sigh);yes;no;no 354;MDY6Q29tbWl0MjMyNTI5ODoyYjQ1YWIzMzk4YTBiYTExOWIxZjY3MmM3YzU2ZmQ1YTQzMWI3ZjBh;David Rientjes;Linus Torvalds;"oom: fix constraint deadlock Fixes a deadlock in the OOM killer for allocations that are not __GFP_HARDWALL. Before the OOM killer checks for the allocation constraint, it takes callback_mutex. constrained_alloc() iterates through each zone in the allocation zonelist and calls cpuset_zone_allowed_softwall() to determine whether an allocation for gfp_mask is possible. If a zone's node is not in the OOM-triggering task's mems_allowed, it is not exiting, and we did not fail on a __GFP_HARDWALL allocation, cpuset_zone_allowed_softwall() attempts to take callback_mutex to check the nearest exclusive ancestor of current's cpuset. This results in deadlock. We now take callback_mutex after iterating through the zonelist since we don't need it yet. Cc: Andi Kleen <ak@suse.de> Cc: Nick Piggin <npiggin@suse.de> Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: Martin J. Bligh <mbligh@mbligh.org> Signed-off-by: David Rientjes <rientjes@google.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/2b45ab3398a0ba119b1f672c7c56fd5a431b7f0a;oom: fix constraint deadlock;yes;yes;no 354;MDY6Q29tbWl0MjMyNTI5ODoyYjQ1YWIzMzk4YTBiYTExOWIxZjY3MmM3YzU2ZmQ1YTQzMWI3ZjBh;David Rientjes;Linus Torvalds;"oom: fix constraint deadlock Fixes a deadlock in the OOM killer for allocations that are not __GFP_HARDWALL. Before the OOM killer checks for the allocation constraint, it takes callback_mutex. constrained_alloc() iterates through each zone in the allocation zonelist and calls cpuset_zone_allowed_softwall() to determine whether an allocation for gfp_mask is possible. If a zone's node is not in the OOM-triggering task's mems_allowed, it is not exiting, and we did not fail on a __GFP_HARDWALL allocation, cpuset_zone_allowed_softwall() attempts to take callback_mutex to check the nearest exclusive ancestor of current's cpuset. This results in deadlock. We now take callback_mutex after iterating through the zonelist since we don't need it yet. Cc: Andi Kleen <ak@suse.de> Cc: Nick Piggin <npiggin@suse.de> Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: Martin J. Bligh <mbligh@mbligh.org> Signed-off-by: David Rientjes <rientjes@google.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/2b45ab3398a0ba119b1f672c7c56fd5a431b7f0a;"Fixes a deadlock in the OOM killer for allocations that are not __GFP_HARDWALL";yes;yes;yes 354;MDY6Q29tbWl0MjMyNTI5ODoyYjQ1YWIzMzk4YTBiYTExOWIxZjY3MmM3YzU2ZmQ1YTQzMWI3ZjBh;David Rientjes;Linus Torvalds;"oom: fix constraint deadlock Fixes a deadlock in the OOM killer for allocations that are not __GFP_HARDWALL. Before the OOM killer checks for the allocation constraint, it takes callback_mutex. constrained_alloc() iterates through each zone in the allocation zonelist and calls cpuset_zone_allowed_softwall() to determine whether an allocation for gfp_mask is possible. If a zone's node is not in the OOM-triggering task's mems_allowed, it is not exiting, and we did not fail on a __GFP_HARDWALL allocation, cpuset_zone_allowed_softwall() attempts to take callback_mutex to check the nearest exclusive ancestor of current's cpuset. This results in deadlock. We now take callback_mutex after iterating through the zonelist since we don't need it yet. Cc: Andi Kleen <ak@suse.de> Cc: Nick Piggin <npiggin@suse.de> Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: Martin J. Bligh <mbligh@mbligh.org> Signed-off-by: David Rientjes <rientjes@google.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/2b45ab3398a0ba119b1f672c7c56fd5a431b7f0a;"Before the OOM killer checks for the allocation constraint, it takes callback_mutex";no;no;yes 354;MDY6Q29tbWl0MjMyNTI5ODoyYjQ1YWIzMzk4YTBiYTExOWIxZjY3MmM3YzU2ZmQ1YTQzMWI3ZjBh;David Rientjes;Linus Torvalds;"oom: fix constraint deadlock Fixes a deadlock in the OOM killer for allocations that are not __GFP_HARDWALL. Before the OOM killer checks for the allocation constraint, it takes callback_mutex. constrained_alloc() iterates through each zone in the allocation zonelist and calls cpuset_zone_allowed_softwall() to determine whether an allocation for gfp_mask is possible. If a zone's node is not in the OOM-triggering task's mems_allowed, it is not exiting, and we did not fail on a __GFP_HARDWALL allocation, cpuset_zone_allowed_softwall() attempts to take callback_mutex to check the nearest exclusive ancestor of current's cpuset. This results in deadlock. We now take callback_mutex after iterating through the zonelist since we don't need it yet. Cc: Andi Kleen <ak@suse.de> Cc: Nick Piggin <npiggin@suse.de> Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: Martin J. Bligh <mbligh@mbligh.org> Signed-off-by: David Rientjes <rientjes@google.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/2b45ab3398a0ba119b1f672c7c56fd5a431b7f0a;"constrained_alloc() iterates through each zone in the allocation zonelist and calls cpuset_zone_allowed_softwall() to determine whether an allocation for gfp_mask is possible";no;no;yes 354;MDY6Q29tbWl0MjMyNTI5ODoyYjQ1YWIzMzk4YTBiYTExOWIxZjY3MmM3YzU2ZmQ1YTQzMWI3ZjBh;David Rientjes;Linus Torvalds;"oom: fix constraint deadlock Fixes a deadlock in the OOM killer for allocations that are not __GFP_HARDWALL. Before the OOM killer checks for the allocation constraint, it takes callback_mutex. constrained_alloc() iterates through each zone in the allocation zonelist and calls cpuset_zone_allowed_softwall() to determine whether an allocation for gfp_mask is possible. If a zone's node is not in the OOM-triggering task's mems_allowed, it is not exiting, and we did not fail on a __GFP_HARDWALL allocation, cpuset_zone_allowed_softwall() attempts to take callback_mutex to check the nearest exclusive ancestor of current's cpuset. This results in deadlock. We now take callback_mutex after iterating through the zonelist since we don't need it yet. Cc: Andi Kleen <ak@suse.de> Cc: Nick Piggin <npiggin@suse.de> Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: Martin J. Bligh <mbligh@mbligh.org> Signed-off-by: David Rientjes <rientjes@google.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/2b45ab3398a0ba119b1f672c7c56fd5a431b7f0a;" If a zone's node is not in the OOM-triggering task's mems_allowed, it is not exiting, and we did not fail on a __GFP_HARDWALL allocation, cpuset_zone_allowed_softwall() attempts to take callback_mutex to check the nearest exclusive ancestor of current's cpuset";no;no;yes 354;MDY6Q29tbWl0MjMyNTI5ODoyYjQ1YWIzMzk4YTBiYTExOWIxZjY3MmM3YzU2ZmQ1YTQzMWI3ZjBh;David Rientjes;Linus Torvalds;"oom: fix constraint deadlock Fixes a deadlock in the OOM killer for allocations that are not __GFP_HARDWALL. Before the OOM killer checks for the allocation constraint, it takes callback_mutex. constrained_alloc() iterates through each zone in the allocation zonelist and calls cpuset_zone_allowed_softwall() to determine whether an allocation for gfp_mask is possible. If a zone's node is not in the OOM-triggering task's mems_allowed, it is not exiting, and we did not fail on a __GFP_HARDWALL allocation, cpuset_zone_allowed_softwall() attempts to take callback_mutex to check the nearest exclusive ancestor of current's cpuset. This results in deadlock. We now take callback_mutex after iterating through the zonelist since we don't need it yet. Cc: Andi Kleen <ak@suse.de> Cc: Nick Piggin <npiggin@suse.de> Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: Martin J. Bligh <mbligh@mbligh.org> Signed-off-by: David Rientjes <rientjes@google.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/2b45ab3398a0ba119b1f672c7c56fd5a431b7f0a; This results in deadlock;no;yes;yes 354;MDY6Q29tbWl0MjMyNTI5ODoyYjQ1YWIzMzk4YTBiYTExOWIxZjY3MmM3YzU2ZmQ1YTQzMWI3ZjBh;David Rientjes;Linus Torvalds;"oom: fix constraint deadlock Fixes a deadlock in the OOM killer for allocations that are not __GFP_HARDWALL. Before the OOM killer checks for the allocation constraint, it takes callback_mutex. constrained_alloc() iterates through each zone in the allocation zonelist and calls cpuset_zone_allowed_softwall() to determine whether an allocation for gfp_mask is possible. If a zone's node is not in the OOM-triggering task's mems_allowed, it is not exiting, and we did not fail on a __GFP_HARDWALL allocation, cpuset_zone_allowed_softwall() attempts to take callback_mutex to check the nearest exclusive ancestor of current's cpuset. This results in deadlock. We now take callback_mutex after iterating through the zonelist since we don't need it yet. Cc: Andi Kleen <ak@suse.de> Cc: Nick Piggin <npiggin@suse.de> Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: Martin J. Bligh <mbligh@mbligh.org> Signed-off-by: David Rientjes <rientjes@google.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/2b45ab3398a0ba119b1f672c7c56fd5a431b7f0a;"We now take callback_mutex after iterating through the zonelist since we don't need it yet.";yes;yes;no 355;MDY6Q29tbWl0MjMyNTI5ODoyYjc0NGMwMWE1NGZlMGM5OTc0ZmYxYjI5NTIyZjI1ZjA3MDg0MDUz;Yasunori Goto;Linus Torvalds;"mm: fix handling of panic_on_oom when cpusets are in use The current panic_on_oom may not work if there is a process using cpusets/mempolicy, because other nodes' memory may remain. But some people want failover by panic ASAP even if they are used. This patch makes new setting for its request. This is tested on my ia64 box which has 3 nodes. Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> Cc: Christoph Lameter <clameter@sgi.com> Cc: Paul Jackson <pj@sgi.com> Cc: Ethan Solomita <solo@google.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/2b744c01a54fe0c9974ff1b29522f25f07084053;mm: fix handling of panic_on_oom when cpusets are in use;yes;yes;no 355;MDY6Q29tbWl0MjMyNTI5ODoyYjc0NGMwMWE1NGZlMGM5OTc0ZmYxYjI5NTIyZjI1ZjA3MDg0MDUz;Yasunori Goto;Linus Torvalds;"mm: fix handling of panic_on_oom when cpusets are in use The current panic_on_oom may not work if there is a process using cpusets/mempolicy, because other nodes' memory may remain. But some people want failover by panic ASAP even if they are used. This patch makes new setting for its request. This is tested on my ia64 box which has 3 nodes. Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> Cc: Christoph Lameter <clameter@sgi.com> Cc: Paul Jackson <pj@sgi.com> Cc: Ethan Solomita <solo@google.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/2b744c01a54fe0c9974ff1b29522f25f07084053;"The current panic_on_oom may not work if there is a process using cpusets/mempolicy, because other nodes' memory may remain";no;yes;yes 355;MDY6Q29tbWl0MjMyNTI5ODoyYjc0NGMwMWE1NGZlMGM5OTc0ZmYxYjI5NTIyZjI1ZjA3MDg0MDUz;Yasunori Goto;Linus Torvalds;"mm: fix handling of panic_on_oom when cpusets are in use The current panic_on_oom may not work if there is a process using cpusets/mempolicy, because other nodes' memory may remain. But some people want failover by panic ASAP even if they are used. This patch makes new setting for its request. This is tested on my ia64 box which has 3 nodes. Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> Cc: Christoph Lameter <clameter@sgi.com> Cc: Paul Jackson <pj@sgi.com> Cc: Ethan Solomita <solo@google.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/2b744c01a54fe0c9974ff1b29522f25f07084053;" But some people want failover by panic ASAP even if they are used";yes;yes;yes 355;MDY6Q29tbWl0MjMyNTI5ODoyYjc0NGMwMWE1NGZlMGM5OTc0ZmYxYjI5NTIyZjI1ZjA3MDg0MDUz;Yasunori Goto;Linus Torvalds;"mm: fix handling of panic_on_oom when cpusets are in use The current panic_on_oom may not work if there is a process using cpusets/mempolicy, because other nodes' memory may remain. But some people want failover by panic ASAP even if they are used. This patch makes new setting for its request. This is tested on my ia64 box which has 3 nodes. Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> Cc: Christoph Lameter <clameter@sgi.com> Cc: Paul Jackson <pj@sgi.com> Cc: Ethan Solomita <solo@google.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/2b744c01a54fe0c9974ff1b29522f25f07084053;" This patch makes new setting for its request";yes;no;no 355;MDY6Q29tbWl0MjMyNTI5ODoyYjc0NGMwMWE1NGZlMGM5OTc0ZmYxYjI5NTIyZjI1ZjA3MDg0MDUz;Yasunori Goto;Linus Torvalds;"mm: fix handling of panic_on_oom when cpusets are in use The current panic_on_oom may not work if there is a process using cpusets/mempolicy, because other nodes' memory may remain. But some people want failover by panic ASAP even if they are used. This patch makes new setting for its request. This is tested on my ia64 box which has 3 nodes. Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> Cc: Christoph Lameter <clameter@sgi.com> Cc: Paul Jackson <pj@sgi.com> Cc: Ethan Solomita <solo@google.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/2b744c01a54fe0c9974ff1b29522f25f07084053;This is tested on my ia64 box which has 3 nodes.;yes;no;no 356;MDY6Q29tbWl0MjMyNTI5ODo5YTgyNzgyZjhmNTgyMTlkMGM2ZGM1ZjAyMTFjZTMwMWFkZjZjNmY0;Joshua N Pritikin;Linus Torvalds;"allow oom_adj of saintly processes If the badness of a process is zero then oom_adj>0 has no effect. This patch makes sure that the oom_adj shift actually increases badness points appropriately. Signed-off-by: Joshua N. Pritikin <jpritikin@pobox.com> Cc: Andrea Arcangeli <andrea@novell.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9a82782f8f58219d0c6dc5f0211ce301adf6c6f4;allow oom_adj of saintly processes;yes;no;no 356;MDY6Q29tbWl0MjMyNTI5ODo5YTgyNzgyZjhmNTgyMTlkMGM2ZGM1ZjAyMTFjZTMwMWFkZjZjNmY0;Joshua N Pritikin;Linus Torvalds;"allow oom_adj of saintly processes If the badness of a process is zero then oom_adj>0 has no effect. This patch makes sure that the oom_adj shift actually increases badness points appropriately. Signed-off-by: Joshua N. Pritikin <jpritikin@pobox.com> Cc: Andrea Arcangeli <andrea@novell.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9a82782f8f58219d0c6dc5f0211ce301adf6c6f4;If the badness of a process is zero then oom_adj>0 has no effect;yes;no;yes 356;MDY6Q29tbWl0MjMyNTI5ODo5YTgyNzgyZjhmNTgyMTlkMGM2ZGM1ZjAyMTFjZTMwMWFkZjZjNmY0;Joshua N Pritikin;Linus Torvalds;"allow oom_adj of saintly processes If the badness of a process is zero then oom_adj>0 has no effect. This patch makes sure that the oom_adj shift actually increases badness points appropriately. Signed-off-by: Joshua N. Pritikin <jpritikin@pobox.com> Cc: Andrea Arcangeli <andrea@novell.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/9a82782f8f58219d0c6dc5f0211ce301adf6c6f4;" This patch makes sure that the oom_adj shift actually increases badness points appropriately.";yes;yes;no 357;MDY6Q29tbWl0MjMyNTI5ODozZDEyNGNiYmEzMTY3MzdhZjhmM2E2OTU5ZWRiOTViYmQxMzBhNGQ4;Hugh Dickins;Linus Torvalds;"fix OOM killing processes wrongly thought MPOL_BIND I only have CONFIG_NUMA=y for build testing: surprised when trying a memhog to see lots of other processes killed with ""No available memory (MPOL_BIND)"". memhog is killed correctly once we initialize nodemask in constrained_alloc(). Signed-off-by: Hugh Dickins <hugh@veritas.com> Acked-by: Christoph Lameter <clameter@sgi.com> Acked-by: William Irwin <bill.irwin@oracle.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3d124cbba316737af8f3a6959edb95bbd130a4d8;fix OOM killing processes wrongly thought MPOL_BIND;yes;yes;no 357;MDY6Q29tbWl0MjMyNTI5ODozZDEyNGNiYmEzMTY3MzdhZjhmM2E2OTU5ZWRiOTViYmQxMzBhNGQ4;Hugh Dickins;Linus Torvalds;"fix OOM killing processes wrongly thought MPOL_BIND I only have CONFIG_NUMA=y for build testing: surprised when trying a memhog to see lots of other processes killed with ""No available memory (MPOL_BIND)"". memhog is killed correctly once we initialize nodemask in constrained_alloc(). Signed-off-by: Hugh Dickins <hugh@veritas.com> Acked-by: Christoph Lameter <clameter@sgi.com> Acked-by: William Irwin <bill.irwin@oracle.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3d124cbba316737af8f3a6959edb95bbd130a4d8;"I only have CONFIG_NUMA=y for build testing: surprised when trying a memhog to see lots of other processes killed with ""No available memory (MPOL_BIND)""";no;yes;yes 357;MDY6Q29tbWl0MjMyNTI5ODozZDEyNGNiYmEzMTY3MzdhZjhmM2E2OTU5ZWRiOTViYmQxMzBhNGQ4;Hugh Dickins;Linus Torvalds;"fix OOM killing processes wrongly thought MPOL_BIND I only have CONFIG_NUMA=y for build testing: surprised when trying a memhog to see lots of other processes killed with ""No available memory (MPOL_BIND)"". memhog is killed correctly once we initialize nodemask in constrained_alloc(). Signed-off-by: Hugh Dickins <hugh@veritas.com> Acked-by: Christoph Lameter <clameter@sgi.com> Acked-by: William Irwin <bill.irwin@oracle.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/3d124cbba316737af8f3a6959edb95bbd130a4d8;" memhog is killed correctly once we initialize nodemask in constrained_alloc().";yes;no;yes 358;MDY6Q29tbWl0MjMyNTI5ODo2NTBhN2M5NzRmMWI5MWRlOTczMmMwZjcyMGU3OTI4MzdmOGFiZmQ2;David Rientjes;Linus Torvalds;"oom: kill all threads that share mm with killed task oom_kill_task() calls __oom_kill_task() to OOM kill a selected task. When finding other threads that share an mm with that task, we need to kill those individual threads and not the same one. (Bug introduced by f2a2a7108aa0039ba7a5fe7a0d2ecef2219a7584) Acked-by: William Irwin <bill.irwin@oracle.com> Acked-by: Christoph Lameter <clameter@engr.sgi.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Andrew Morton <akpm@osdl.org> Cc: Andi Kleen <ak@suse.de> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/650a7c974f1b91de9732c0f720e792837f8abfd6;oom: kill all threads that share mm with killed task;yes;no;no 358;MDY6Q29tbWl0MjMyNTI5ODo2NTBhN2M5NzRmMWI5MWRlOTczMmMwZjcyMGU3OTI4MzdmOGFiZmQ2;David Rientjes;Linus Torvalds;"oom: kill all threads that share mm with killed task oom_kill_task() calls __oom_kill_task() to OOM kill a selected task. When finding other threads that share an mm with that task, we need to kill those individual threads and not the same one. (Bug introduced by f2a2a7108aa0039ba7a5fe7a0d2ecef2219a7584) Acked-by: William Irwin <bill.irwin@oracle.com> Acked-by: Christoph Lameter <clameter@engr.sgi.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Andrew Morton <akpm@osdl.org> Cc: Andi Kleen <ak@suse.de> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/650a7c974f1b91de9732c0f720e792837f8abfd6;oom_kill_task() calls __oom_kill_task() to OOM kill a selected task;no;no;yes 358;MDY6Q29tbWl0MjMyNTI5ODo2NTBhN2M5NzRmMWI5MWRlOTczMmMwZjcyMGU3OTI4MzdmOGFiZmQ2;David Rientjes;Linus Torvalds;"oom: kill all threads that share mm with killed task oom_kill_task() calls __oom_kill_task() to OOM kill a selected task. When finding other threads that share an mm with that task, we need to kill those individual threads and not the same one. (Bug introduced by f2a2a7108aa0039ba7a5fe7a0d2ecef2219a7584) Acked-by: William Irwin <bill.irwin@oracle.com> Acked-by: Christoph Lameter <clameter@engr.sgi.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Andrew Morton <akpm@osdl.org> Cc: Andi Kleen <ak@suse.de> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/650a7c974f1b91de9732c0f720e792837f8abfd6;"When finding other threads that share an mm with that task, we need to kill those individual threads and not the same one";yes;yes;no 358;MDY6Q29tbWl0MjMyNTI5ODo2NTBhN2M5NzRmMWI5MWRlOTczMmMwZjcyMGU3OTI4MzdmOGFiZmQ2;David Rientjes;Linus Torvalds;"oom: kill all threads that share mm with killed task oom_kill_task() calls __oom_kill_task() to OOM kill a selected task. When finding other threads that share an mm with that task, we need to kill those individual threads and not the same one. (Bug introduced by f2a2a7108aa0039ba7a5fe7a0d2ecef2219a7584) Acked-by: William Irwin <bill.irwin@oracle.com> Acked-by: Christoph Lameter <clameter@engr.sgi.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Andrew Morton <akpm@osdl.org> Cc: Andi Kleen <ak@suse.de> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/650a7c974f1b91de9732c0f720e792837f8abfd6;(Bug introduced by f2a2a7108aa0039ba7a5fe7a0d2ecef2219a7584);no;yes;yes 359;MDY6Q29tbWl0MjMyNTI5ODozNWFlODM0ZmEwMmJhODljZmJkNGE4MDg5MmMwZTQ1OGZkNmQ1YzBi;Ankita Garg;Linus Torvalds;"[PATCH] oom fix: prevent oom from killing a process with children/sibling unkillable Looking at oom_kill.c, found that the intention to not kill the selected process if any of its children/siblings has OOM_DISABLE set, is not being met. Signed-off-by: Ankita Garg <ankita@in.ibm.com> Acked-by: Nick Piggin <npiggin@suse.de> Acked-by: William Irwin <wli@holomorphy.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/35ae834fa02ba89cfbd4a80892c0e458fd6d5c0b;[PATCH] oom fix: prevent oom from killing a process with children/sibling unkillable;yes;yes;no 359;MDY6Q29tbWl0MjMyNTI5ODozNWFlODM0ZmEwMmJhODljZmJkNGE4MDg5MmMwZTQ1OGZkNmQ1YzBi;Ankita Garg;Linus Torvalds;"[PATCH] oom fix: prevent oom from killing a process with children/sibling unkillable Looking at oom_kill.c, found that the intention to not kill the selected process if any of its children/siblings has OOM_DISABLE set, is not being met. Signed-off-by: Ankita Garg <ankita@in.ibm.com> Acked-by: Nick Piggin <npiggin@suse.de> Acked-by: William Irwin <wli@holomorphy.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>";https://api.github.com/repos/torvalds/linux/git/commits/35ae834fa02ba89cfbd4a80892c0e458fd6d5c0b;"Looking at oom_kill.c, found that the intention to not kill the selected process if any of its children/siblings has OOM_DISABLE set, is not being met.";no;yes;yes 360;MDY6Q29tbWl0MjMyNTI5ODo3YmEzNDg1OTQ3ZWU3YmM4OWExN2Y4NjI1MGZlOWI2OTJhNjE1ZGZm;Hugh Dickins;Linus Torvalds;"[PATCH] fix OOM killing of swapoff These days, if you swapoff when there isn't enough memory, OOM killer gives ""BUG: scheduling while atomic"" and the machine hangs: badness() needs to do its PF_SWAPOFF return after the task_unlock (tasklist_lock is also held here, so p isn't going to be freed: PF_SWAPOFF might get turned off at any moment, but that doesn't really matter). Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/7ba3485947ee7bc89a17f86250fe9b692a615dff;[PATCH] fix OOM killing of swapoff;yes;yes;no 360;MDY6Q29tbWl0MjMyNTI5ODo3YmEzNDg1OTQ3ZWU3YmM4OWExN2Y4NjI1MGZlOWI2OTJhNjE1ZGZm;Hugh Dickins;Linus Torvalds;"[PATCH] fix OOM killing of swapoff These days, if you swapoff when there isn't enough memory, OOM killer gives ""BUG: scheduling while atomic"" and the machine hangs: badness() needs to do its PF_SWAPOFF return after the task_unlock (tasklist_lock is also held here, so p isn't going to be freed: PF_SWAPOFF might get turned off at any moment, but that doesn't really matter). Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/7ba3485947ee7bc89a17f86250fe9b692a615dff;"These days, if you swapoff when there isn't enough memory, OOM killer gives ""BUG: scheduling while atomic"" and the machine hangs: badness() needs to do its PF_SWAPOFF return after the task_unlock (tasklist_lock is also held here, so p isn't going to be freed: PF_SWAPOFF might get turned off at any moment, but that doesn't really matter).";yes;yes;yes 361;MDY6Q29tbWl0MjMyNTI5ODo5NmFjNTkxM2Y0ZTQ1YzZhMWI5ODM1MGYyYzBhOGJiM2FiZTI2NDZh;KAMEZAWA Hiroyuki;Linus Torvalds;"[PATCH] fix oom killer kills current every time if there is memory-less-node take2 constrained_alloc(), which is called to detect where oom is from, checks passed zone_list(). If zone_list doesn't include all nodes, it thinks oom is from mempolicy. But there is memory-less-node. memory-less-node's zones are never included in zonelist[]. contstrained_alloc() should get memory_less_node into count. Otherwise, it always thinks 'oom is from mempolicy'. This means that current process dies at any time. This patch fix it. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Paul Jackson <pj@sgi.com> Cc: Christoph Lameter <clameter@engr.sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/96ac5913f4e45c6a1b98350f2c0a8bb3abe2646a;[PATCH] fix oom killer kills current every time if there is memory-less-node take2;yes;yes;no 361;MDY6Q29tbWl0MjMyNTI5ODo5NmFjNTkxM2Y0ZTQ1YzZhMWI5ODM1MGYyYzBhOGJiM2FiZTI2NDZh;KAMEZAWA Hiroyuki;Linus Torvalds;"[PATCH] fix oom killer kills current every time if there is memory-less-node take2 constrained_alloc(), which is called to detect where oom is from, checks passed zone_list(). If zone_list doesn't include all nodes, it thinks oom is from mempolicy. But there is memory-less-node. memory-less-node's zones are never included in zonelist[]. contstrained_alloc() should get memory_less_node into count. Otherwise, it always thinks 'oom is from mempolicy'. This means that current process dies at any time. This patch fix it. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Paul Jackson <pj@sgi.com> Cc: Christoph Lameter <clameter@engr.sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/96ac5913f4e45c6a1b98350f2c0a8bb3abe2646a;"constrained_alloc(), which is called to detect where oom is from, checks passed zone_list()";no;no;yes 361;MDY6Q29tbWl0MjMyNTI5ODo5NmFjNTkxM2Y0ZTQ1YzZhMWI5ODM1MGYyYzBhOGJiM2FiZTI2NDZh;KAMEZAWA Hiroyuki;Linus Torvalds;"[PATCH] fix oom killer kills current every time if there is memory-less-node take2 constrained_alloc(), which is called to detect where oom is from, checks passed zone_list(). If zone_list doesn't include all nodes, it thinks oom is from mempolicy. But there is memory-less-node. memory-less-node's zones are never included in zonelist[]. contstrained_alloc() should get memory_less_node into count. Otherwise, it always thinks 'oom is from mempolicy'. This means that current process dies at any time. This patch fix it. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Paul Jackson <pj@sgi.com> Cc: Christoph Lameter <clameter@engr.sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/96ac5913f4e45c6a1b98350f2c0a8bb3abe2646a;" If zone_list doesn't include all nodes, it thinks oom is from mempolicy";no;no;yes 361;MDY6Q29tbWl0MjMyNTI5ODo5NmFjNTkxM2Y0ZTQ1YzZhMWI5ODM1MGYyYzBhOGJiM2FiZTI2NDZh;KAMEZAWA Hiroyuki;Linus Torvalds;"[PATCH] fix oom killer kills current every time if there is memory-less-node take2 constrained_alloc(), which is called to detect where oom is from, checks passed zone_list(). If zone_list doesn't include all nodes, it thinks oom is from mempolicy. But there is memory-less-node. memory-less-node's zones are never included in zonelist[]. contstrained_alloc() should get memory_less_node into count. Otherwise, it always thinks 'oom is from mempolicy'. This means that current process dies at any time. This patch fix it. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Paul Jackson <pj@sgi.com> Cc: Christoph Lameter <clameter@engr.sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/96ac5913f4e45c6a1b98350f2c0a8bb3abe2646a;But there is memory-less-node;no;no;yes 361;MDY6Q29tbWl0MjMyNTI5ODo5NmFjNTkxM2Y0ZTQ1YzZhMWI5ODM1MGYyYzBhOGJiM2FiZTI2NDZh;KAMEZAWA Hiroyuki;Linus Torvalds;"[PATCH] fix oom killer kills current every time if there is memory-less-node take2 constrained_alloc(), which is called to detect where oom is from, checks passed zone_list(). If zone_list doesn't include all nodes, it thinks oom is from mempolicy. But there is memory-less-node. memory-less-node's zones are never included in zonelist[]. contstrained_alloc() should get memory_less_node into count. Otherwise, it always thinks 'oom is from mempolicy'. This means that current process dies at any time. This patch fix it. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Paul Jackson <pj@sgi.com> Cc: Christoph Lameter <clameter@engr.sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/96ac5913f4e45c6a1b98350f2c0a8bb3abe2646a;" memory-less-node's zones are never included in zonelist[]";no;no;yes 361;MDY6Q29tbWl0MjMyNTI5ODo5NmFjNTkxM2Y0ZTQ1YzZhMWI5ODM1MGYyYzBhOGJiM2FiZTI2NDZh;KAMEZAWA Hiroyuki;Linus Torvalds;"[PATCH] fix oom killer kills current every time if there is memory-less-node take2 constrained_alloc(), which is called to detect where oom is from, checks passed zone_list(). If zone_list doesn't include all nodes, it thinks oom is from mempolicy. But there is memory-less-node. memory-less-node's zones are never included in zonelist[]. contstrained_alloc() should get memory_less_node into count. Otherwise, it always thinks 'oom is from mempolicy'. This means that current process dies at any time. This patch fix it. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Paul Jackson <pj@sgi.com> Cc: Christoph Lameter <clameter@engr.sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/96ac5913f4e45c6a1b98350f2c0a8bb3abe2646a;contstrained_alloc() should get memory_less_node into count;yes;yes;no 361;MDY6Q29tbWl0MjMyNTI5ODo5NmFjNTkxM2Y0ZTQ1YzZhMWI5ODM1MGYyYzBhOGJiM2FiZTI2NDZh;KAMEZAWA Hiroyuki;Linus Torvalds;"[PATCH] fix oom killer kills current every time if there is memory-less-node take2 constrained_alloc(), which is called to detect where oom is from, checks passed zone_list(). If zone_list doesn't include all nodes, it thinks oom is from mempolicy. But there is memory-less-node. memory-less-node's zones are never included in zonelist[]. contstrained_alloc() should get memory_less_node into count. Otherwise, it always thinks 'oom is from mempolicy'. This means that current process dies at any time. This patch fix it. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Paul Jackson <pj@sgi.com> Cc: Christoph Lameter <clameter@engr.sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/96ac5913f4e45c6a1b98350f2c0a8bb3abe2646a;" Otherwise, it always thinks 'oom is from mempolicy'";no;yes;yes 361;MDY6Q29tbWl0MjMyNTI5ODo5NmFjNTkxM2Y0ZTQ1YzZhMWI5ODM1MGYyYzBhOGJiM2FiZTI2NDZh;KAMEZAWA Hiroyuki;Linus Torvalds;"[PATCH] fix oom killer kills current every time if there is memory-less-node take2 constrained_alloc(), which is called to detect where oom is from, checks passed zone_list(). If zone_list doesn't include all nodes, it thinks oom is from mempolicy. But there is memory-less-node. memory-less-node's zones are never included in zonelist[]. contstrained_alloc() should get memory_less_node into count. Otherwise, it always thinks 'oom is from mempolicy'. This means that current process dies at any time. This patch fix it. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Paul Jackson <pj@sgi.com> Cc: Christoph Lameter <clameter@engr.sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/96ac5913f4e45c6a1b98350f2c0a8bb3abe2646a;" This means that current process dies at any time";no;yes;yes 361;MDY6Q29tbWl0MjMyNTI5ODo5NmFjNTkxM2Y0ZTQ1YzZhMWI5ODM1MGYyYzBhOGJiM2FiZTI2NDZh;KAMEZAWA Hiroyuki;Linus Torvalds;"[PATCH] fix oom killer kills current every time if there is memory-less-node take2 constrained_alloc(), which is called to detect where oom is from, checks passed zone_list(). If zone_list doesn't include all nodes, it thinks oom is from mempolicy. But there is memory-less-node. memory-less-node's zones are never included in zonelist[]. contstrained_alloc() should get memory_less_node into count. Otherwise, it always thinks 'oom is from mempolicy'. This means that current process dies at any time. This patch fix it. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Paul Jackson <pj@sgi.com> Cc: Christoph Lameter <clameter@engr.sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/96ac5913f4e45c6a1b98350f2c0a8bb3abe2646a; This patch fix it.;yes;yes;no 362;MDY6Q29tbWl0MjMyNTI5ODowMmEwZTUzZDgyMjdhZmY1ZTYyZTA0MzNmODJjMTJjMWMyODA1ZmQ2;Paul Jackson;Linus Torvalds;"[PATCH] cpuset: rework cpuset_zone_allowed api Elaborate the API for calling cpuset_zone_allowed(), so that users have to explicitly choose between the two variants: cpuset_zone_allowed_hardwall() cpuset_zone_allowed_softwall() Until now, whether or not you got the hardwall flavor depended solely on whether or not you or'd in the __GFP_HARDWALL gfp flag to the gfp_mask argument. If you didn't specify __GFP_HARDWALL, you implicitly got the softwall version. Unfortunately, this meant that users would end up with the softwall version without thinking about it. Since only the softwall version might sleep, this led to bugs with possible sleeping in interrupt context on more than one occassion. The hardwall version requires that the current tasks mems_allowed allows the node of the specified zone (or that you're in interrupt or that __GFP_THISNODE is set or that you're on a one cpuset system.) The softwall version, depending on the gfp_mask, might allow a node if it was allowed in the nearest enclusing cpuset marked mem_exclusive (which requires taking the cpuset lock 'callback_mutex' to evaluate.) This patch removes the cpuset_zone_allowed() call, and forces the caller to explicitly choose between the hardwall and the softwall case. If the caller wants the gfp_mask to determine this choice, they should (1) be sure they can sleep or that __GFP_HARDWALL is set, and (2) invoke the cpuset_zone_allowed_softwall() routine. This adds another 100 or 200 bytes to the kernel text space, due to the few lines of nearly duplicate code at the top of both cpuset_zone_allowed_* routines. It should save a few instructions executed for the calls that turned into calls of cpuset_zone_allowed_hardwall, thanks to not having to set (before the call) then check (within the call) the __GFP_HARDWALL flag. For the most critical call, from get_page_from_freelist(), the same instructions are executed as before -- the old cpuset_zone_allowed() routine it used to call is the same code as the cpuset_zone_allowed_softwall() routine that it calls now. Not a perfect win, but seems worth it, to reduce this chance of hitting a sleeping with irq off complaint again. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/02a0e53d8227aff5e62e0433f82c12c1c2805fd6;[PATCH] cpuset: rework cpuset_zone_allowed api;yes;no;no 362;MDY6Q29tbWl0MjMyNTI5ODowMmEwZTUzZDgyMjdhZmY1ZTYyZTA0MzNmODJjMTJjMWMyODA1ZmQ2;Paul Jackson;Linus Torvalds;"[PATCH] cpuset: rework cpuset_zone_allowed api Elaborate the API for calling cpuset_zone_allowed(), so that users have to explicitly choose between the two variants: cpuset_zone_allowed_hardwall() cpuset_zone_allowed_softwall() Until now, whether or not you got the hardwall flavor depended solely on whether or not you or'd in the __GFP_HARDWALL gfp flag to the gfp_mask argument. If you didn't specify __GFP_HARDWALL, you implicitly got the softwall version. Unfortunately, this meant that users would end up with the softwall version without thinking about it. Since only the softwall version might sleep, this led to bugs with possible sleeping in interrupt context on more than one occassion. The hardwall version requires that the current tasks mems_allowed allows the node of the specified zone (or that you're in interrupt or that __GFP_THISNODE is set or that you're on a one cpuset system.) The softwall version, depending on the gfp_mask, might allow a node if it was allowed in the nearest enclusing cpuset marked mem_exclusive (which requires taking the cpuset lock 'callback_mutex' to evaluate.) This patch removes the cpuset_zone_allowed() call, and forces the caller to explicitly choose between the hardwall and the softwall case. If the caller wants the gfp_mask to determine this choice, they should (1) be sure they can sleep or that __GFP_HARDWALL is set, and (2) invoke the cpuset_zone_allowed_softwall() routine. This adds another 100 or 200 bytes to the kernel text space, due to the few lines of nearly duplicate code at the top of both cpuset_zone_allowed_* routines. It should save a few instructions executed for the calls that turned into calls of cpuset_zone_allowed_hardwall, thanks to not having to set (before the call) then check (within the call) the __GFP_HARDWALL flag. For the most critical call, from get_page_from_freelist(), the same instructions are executed as before -- the old cpuset_zone_allowed() routine it used to call is the same code as the cpuset_zone_allowed_softwall() routine that it calls now. Not a perfect win, but seems worth it, to reduce this chance of hitting a sleeping with irq off complaint again. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/02a0e53d8227aff5e62e0433f82c12c1c2805fd6;"Elaborate the API for calling cpuset_zone_allowed(), so that users have to explicitly choose between the two variants";yes;yes;no 362;MDY6Q29tbWl0MjMyNTI5ODowMmEwZTUzZDgyMjdhZmY1ZTYyZTA0MzNmODJjMTJjMWMyODA1ZmQ2;Paul Jackson;Linus Torvalds;"[PATCH] cpuset: rework cpuset_zone_allowed api Elaborate the API for calling cpuset_zone_allowed(), so that users have to explicitly choose between the two variants: cpuset_zone_allowed_hardwall() cpuset_zone_allowed_softwall() Until now, whether or not you got the hardwall flavor depended solely on whether or not you or'd in the __GFP_HARDWALL gfp flag to the gfp_mask argument. If you didn't specify __GFP_HARDWALL, you implicitly got the softwall version. Unfortunately, this meant that users would end up with the softwall version without thinking about it. Since only the softwall version might sleep, this led to bugs with possible sleeping in interrupt context on more than one occassion. The hardwall version requires that the current tasks mems_allowed allows the node of the specified zone (or that you're in interrupt or that __GFP_THISNODE is set or that you're on a one cpuset system.) The softwall version, depending on the gfp_mask, might allow a node if it was allowed in the nearest enclusing cpuset marked mem_exclusive (which requires taking the cpuset lock 'callback_mutex' to evaluate.) This patch removes the cpuset_zone_allowed() call, and forces the caller to explicitly choose between the hardwall and the softwall case. If the caller wants the gfp_mask to determine this choice, they should (1) be sure they can sleep or that __GFP_HARDWALL is set, and (2) invoke the cpuset_zone_allowed_softwall() routine. This adds another 100 or 200 bytes to the kernel text space, due to the few lines of nearly duplicate code at the top of both cpuset_zone_allowed_* routines. It should save a few instructions executed for the calls that turned into calls of cpuset_zone_allowed_hardwall, thanks to not having to set (before the call) then check (within the call) the __GFP_HARDWALL flag. For the most critical call, from get_page_from_freelist(), the same instructions are executed as before -- the old cpuset_zone_allowed() routine it used to call is the same code as the cpuset_zone_allowed_softwall() routine that it calls now. Not a perfect win, but seems worth it, to reduce this chance of hitting a sleeping with irq off complaint again. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/02a0e53d8227aff5e62e0433f82c12c1c2805fd6;" cpuset_zone_allowed_hardwall() cpuset_zone_allowed_softwall() Until now, whether or not you got the hardwall flavor depended solely on whether or not you or'd in the __GFP_HARDWALL gfp flag to the gfp_mask argument";no;no;yes 362;MDY6Q29tbWl0MjMyNTI5ODowMmEwZTUzZDgyMjdhZmY1ZTYyZTA0MzNmODJjMTJjMWMyODA1ZmQ2;Paul Jackson;Linus Torvalds;"[PATCH] cpuset: rework cpuset_zone_allowed api Elaborate the API for calling cpuset_zone_allowed(), so that users have to explicitly choose between the two variants: cpuset_zone_allowed_hardwall() cpuset_zone_allowed_softwall() Until now, whether or not you got the hardwall flavor depended solely on whether or not you or'd in the __GFP_HARDWALL gfp flag to the gfp_mask argument. If you didn't specify __GFP_HARDWALL, you implicitly got the softwall version. Unfortunately, this meant that users would end up with the softwall version without thinking about it. Since only the softwall version might sleep, this led to bugs with possible sleeping in interrupt context on more than one occassion. The hardwall version requires that the current tasks mems_allowed allows the node of the specified zone (or that you're in interrupt or that __GFP_THISNODE is set or that you're on a one cpuset system.) The softwall version, depending on the gfp_mask, might allow a node if it was allowed in the nearest enclusing cpuset marked mem_exclusive (which requires taking the cpuset lock 'callback_mutex' to evaluate.) This patch removes the cpuset_zone_allowed() call, and forces the caller to explicitly choose between the hardwall and the softwall case. If the caller wants the gfp_mask to determine this choice, they should (1) be sure they can sleep or that __GFP_HARDWALL is set, and (2) invoke the cpuset_zone_allowed_softwall() routine. This adds another 100 or 200 bytes to the kernel text space, due to the few lines of nearly duplicate code at the top of both cpuset_zone_allowed_* routines. It should save a few instructions executed for the calls that turned into calls of cpuset_zone_allowed_hardwall, thanks to not having to set (before the call) then check (within the call) the __GFP_HARDWALL flag. For the most critical call, from get_page_from_freelist(), the same instructions are executed as before -- the old cpuset_zone_allowed() routine it used to call is the same code as the cpuset_zone_allowed_softwall() routine that it calls now. Not a perfect win, but seems worth it, to reduce this chance of hitting a sleeping with irq off complaint again. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/02a0e53d8227aff5e62e0433f82c12c1c2805fd6;"If you didn't specify __GFP_HARDWALL, you implicitly got the softwall version";no;no;yes 362;MDY6Q29tbWl0MjMyNTI5ODowMmEwZTUzZDgyMjdhZmY1ZTYyZTA0MzNmODJjMTJjMWMyODA1ZmQ2;Paul Jackson;Linus Torvalds;"[PATCH] cpuset: rework cpuset_zone_allowed api Elaborate the API for calling cpuset_zone_allowed(), so that users have to explicitly choose between the two variants: cpuset_zone_allowed_hardwall() cpuset_zone_allowed_softwall() Until now, whether or not you got the hardwall flavor depended solely on whether or not you or'd in the __GFP_HARDWALL gfp flag to the gfp_mask argument. If you didn't specify __GFP_HARDWALL, you implicitly got the softwall version. Unfortunately, this meant that users would end up with the softwall version without thinking about it. Since only the softwall version might sleep, this led to bugs with possible sleeping in interrupt context on more than one occassion. The hardwall version requires that the current tasks mems_allowed allows the node of the specified zone (or that you're in interrupt or that __GFP_THISNODE is set or that you're on a one cpuset system.) The softwall version, depending on the gfp_mask, might allow a node if it was allowed in the nearest enclusing cpuset marked mem_exclusive (which requires taking the cpuset lock 'callback_mutex' to evaluate.) This patch removes the cpuset_zone_allowed() call, and forces the caller to explicitly choose between the hardwall and the softwall case. If the caller wants the gfp_mask to determine this choice, they should (1) be sure they can sleep or that __GFP_HARDWALL is set, and (2) invoke the cpuset_zone_allowed_softwall() routine. This adds another 100 or 200 bytes to the kernel text space, due to the few lines of nearly duplicate code at the top of both cpuset_zone_allowed_* routines. It should save a few instructions executed for the calls that turned into calls of cpuset_zone_allowed_hardwall, thanks to not having to set (before the call) then check (within the call) the __GFP_HARDWALL flag. For the most critical call, from get_page_from_freelist(), the same instructions are executed as before -- the old cpuset_zone_allowed() routine it used to call is the same code as the cpuset_zone_allowed_softwall() routine that it calls now. Not a perfect win, but seems worth it, to reduce this chance of hitting a sleeping with irq off complaint again. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/02a0e53d8227aff5e62e0433f82c12c1c2805fd6;"Unfortunately, this meant that users would end up with the softwall version without thinking about it";no;yes;yes 362;MDY6Q29tbWl0MjMyNTI5ODowMmEwZTUzZDgyMjdhZmY1ZTYyZTA0MzNmODJjMTJjMWMyODA1ZmQ2;Paul Jackson;Linus Torvalds;"[PATCH] cpuset: rework cpuset_zone_allowed api Elaborate the API for calling cpuset_zone_allowed(), so that users have to explicitly choose between the two variants: cpuset_zone_allowed_hardwall() cpuset_zone_allowed_softwall() Until now, whether or not you got the hardwall flavor depended solely on whether or not you or'd in the __GFP_HARDWALL gfp flag to the gfp_mask argument. If you didn't specify __GFP_HARDWALL, you implicitly got the softwall version. Unfortunately, this meant that users would end up with the softwall version without thinking about it. Since only the softwall version might sleep, this led to bugs with possible sleeping in interrupt context on more than one occassion. The hardwall version requires that the current tasks mems_allowed allows the node of the specified zone (or that you're in interrupt or that __GFP_THISNODE is set or that you're on a one cpuset system.) The softwall version, depending on the gfp_mask, might allow a node if it was allowed in the nearest enclusing cpuset marked mem_exclusive (which requires taking the cpuset lock 'callback_mutex' to evaluate.) This patch removes the cpuset_zone_allowed() call, and forces the caller to explicitly choose between the hardwall and the softwall case. If the caller wants the gfp_mask to determine this choice, they should (1) be sure they can sleep or that __GFP_HARDWALL is set, and (2) invoke the cpuset_zone_allowed_softwall() routine. This adds another 100 or 200 bytes to the kernel text space, due to the few lines of nearly duplicate code at the top of both cpuset_zone_allowed_* routines. It should save a few instructions executed for the calls that turned into calls of cpuset_zone_allowed_hardwall, thanks to not having to set (before the call) then check (within the call) the __GFP_HARDWALL flag. For the most critical call, from get_page_from_freelist(), the same instructions are executed as before -- the old cpuset_zone_allowed() routine it used to call is the same code as the cpuset_zone_allowed_softwall() routine that it calls now. Not a perfect win, but seems worth it, to reduce this chance of hitting a sleeping with irq off complaint again. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/02a0e53d8227aff5e62e0433f82c12c1c2805fd6;" Since only the softwall version might sleep, this led to bugs with possible sleeping in interrupt context on more than one occassion";no;yes;yes 362;MDY6Q29tbWl0MjMyNTI5ODowMmEwZTUzZDgyMjdhZmY1ZTYyZTA0MzNmODJjMTJjMWMyODA1ZmQ2;Paul Jackson;Linus Torvalds;"[PATCH] cpuset: rework cpuset_zone_allowed api Elaborate the API for calling cpuset_zone_allowed(), so that users have to explicitly choose between the two variants: cpuset_zone_allowed_hardwall() cpuset_zone_allowed_softwall() Until now, whether or not you got the hardwall flavor depended solely on whether or not you or'd in the __GFP_HARDWALL gfp flag to the gfp_mask argument. If you didn't specify __GFP_HARDWALL, you implicitly got the softwall version. Unfortunately, this meant that users would end up with the softwall version without thinking about it. Since only the softwall version might sleep, this led to bugs with possible sleeping in interrupt context on more than one occassion. The hardwall version requires that the current tasks mems_allowed allows the node of the specified zone (or that you're in interrupt or that __GFP_THISNODE is set or that you're on a one cpuset system.) The softwall version, depending on the gfp_mask, might allow a node if it was allowed in the nearest enclusing cpuset marked mem_exclusive (which requires taking the cpuset lock 'callback_mutex' to evaluate.) This patch removes the cpuset_zone_allowed() call, and forces the caller to explicitly choose between the hardwall and the softwall case. If the caller wants the gfp_mask to determine this choice, they should (1) be sure they can sleep or that __GFP_HARDWALL is set, and (2) invoke the cpuset_zone_allowed_softwall() routine. This adds another 100 or 200 bytes to the kernel text space, due to the few lines of nearly duplicate code at the top of both cpuset_zone_allowed_* routines. It should save a few instructions executed for the calls that turned into calls of cpuset_zone_allowed_hardwall, thanks to not having to set (before the call) then check (within the call) the __GFP_HARDWALL flag. For the most critical call, from get_page_from_freelist(), the same instructions are executed as before -- the old cpuset_zone_allowed() routine it used to call is the same code as the cpuset_zone_allowed_softwall() routine that it calls now. Not a perfect win, but seems worth it, to reduce this chance of hitting a sleeping with irq off complaint again. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/02a0e53d8227aff5e62e0433f82c12c1c2805fd6;"The hardwall version requires that the current tasks mems_allowed allows the node of the specified zone (or that you're in interrupt or that __GFP_THISNODE is set or that you're on a one cpuset system.) The softwall version, depending on the gfp_mask, might allow a node if it was allowed in the nearest enclusing cpuset marked mem_exclusive (which requires taking the cpuset lock 'callback_mutex' to evaluate.) This patch removes the cpuset_zone_allowed() call, and forces the caller to explicitly choose between the hardwall and the softwall case";yes;yes;yes 362;MDY6Q29tbWl0MjMyNTI5ODowMmEwZTUzZDgyMjdhZmY1ZTYyZTA0MzNmODJjMTJjMWMyODA1ZmQ2;Paul Jackson;Linus Torvalds;"[PATCH] cpuset: rework cpuset_zone_allowed api Elaborate the API for calling cpuset_zone_allowed(), so that users have to explicitly choose between the two variants: cpuset_zone_allowed_hardwall() cpuset_zone_allowed_softwall() Until now, whether or not you got the hardwall flavor depended solely on whether or not you or'd in the __GFP_HARDWALL gfp flag to the gfp_mask argument. If you didn't specify __GFP_HARDWALL, you implicitly got the softwall version. Unfortunately, this meant that users would end up with the softwall version without thinking about it. Since only the softwall version might sleep, this led to bugs with possible sleeping in interrupt context on more than one occassion. The hardwall version requires that the current tasks mems_allowed allows the node of the specified zone (or that you're in interrupt or that __GFP_THISNODE is set or that you're on a one cpuset system.) The softwall version, depending on the gfp_mask, might allow a node if it was allowed in the nearest enclusing cpuset marked mem_exclusive (which requires taking the cpuset lock 'callback_mutex' to evaluate.) This patch removes the cpuset_zone_allowed() call, and forces the caller to explicitly choose between the hardwall and the softwall case. If the caller wants the gfp_mask to determine this choice, they should (1) be sure they can sleep or that __GFP_HARDWALL is set, and (2) invoke the cpuset_zone_allowed_softwall() routine. This adds another 100 or 200 bytes to the kernel text space, due to the few lines of nearly duplicate code at the top of both cpuset_zone_allowed_* routines. It should save a few instructions executed for the calls that turned into calls of cpuset_zone_allowed_hardwall, thanks to not having to set (before the call) then check (within the call) the __GFP_HARDWALL flag. For the most critical call, from get_page_from_freelist(), the same instructions are executed as before -- the old cpuset_zone_allowed() routine it used to call is the same code as the cpuset_zone_allowed_softwall() routine that it calls now. Not a perfect win, but seems worth it, to reduce this chance of hitting a sleeping with irq off complaint again. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/02a0e53d8227aff5e62e0433f82c12c1c2805fd6;"If the caller wants the gfp_mask to determine this choice, they should (1) be sure they can sleep or that __GFP_HARDWALL is set, and (2) invoke the cpuset_zone_allowed_softwall() routine";yes;no;no 362;MDY6Q29tbWl0MjMyNTI5ODowMmEwZTUzZDgyMjdhZmY1ZTYyZTA0MzNmODJjMTJjMWMyODA1ZmQ2;Paul Jackson;Linus Torvalds;"[PATCH] cpuset: rework cpuset_zone_allowed api Elaborate the API for calling cpuset_zone_allowed(), so that users have to explicitly choose between the two variants: cpuset_zone_allowed_hardwall() cpuset_zone_allowed_softwall() Until now, whether or not you got the hardwall flavor depended solely on whether or not you or'd in the __GFP_HARDWALL gfp flag to the gfp_mask argument. If you didn't specify __GFP_HARDWALL, you implicitly got the softwall version. Unfortunately, this meant that users would end up with the softwall version without thinking about it. Since only the softwall version might sleep, this led to bugs with possible sleeping in interrupt context on more than one occassion. The hardwall version requires that the current tasks mems_allowed allows the node of the specified zone (or that you're in interrupt or that __GFP_THISNODE is set or that you're on a one cpuset system.) The softwall version, depending on the gfp_mask, might allow a node if it was allowed in the nearest enclusing cpuset marked mem_exclusive (which requires taking the cpuset lock 'callback_mutex' to evaluate.) This patch removes the cpuset_zone_allowed() call, and forces the caller to explicitly choose between the hardwall and the softwall case. If the caller wants the gfp_mask to determine this choice, they should (1) be sure they can sleep or that __GFP_HARDWALL is set, and (2) invoke the cpuset_zone_allowed_softwall() routine. This adds another 100 or 200 bytes to the kernel text space, due to the few lines of nearly duplicate code at the top of both cpuset_zone_allowed_* routines. It should save a few instructions executed for the calls that turned into calls of cpuset_zone_allowed_hardwall, thanks to not having to set (before the call) then check (within the call) the __GFP_HARDWALL flag. For the most critical call, from get_page_from_freelist(), the same instructions are executed as before -- the old cpuset_zone_allowed() routine it used to call is the same code as the cpuset_zone_allowed_softwall() routine that it calls now. Not a perfect win, but seems worth it, to reduce this chance of hitting a sleeping with irq off complaint again. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/02a0e53d8227aff5e62e0433f82c12c1c2805fd6;"This adds another 100 or 200 bytes to the kernel text space, due to the few lines of nearly duplicate code at the top of both cpuset_zone_allowed_* routines";yes;no;no 362;MDY6Q29tbWl0MjMyNTI5ODowMmEwZTUzZDgyMjdhZmY1ZTYyZTA0MzNmODJjMTJjMWMyODA1ZmQ2;Paul Jackson;Linus Torvalds;"[PATCH] cpuset: rework cpuset_zone_allowed api Elaborate the API for calling cpuset_zone_allowed(), so that users have to explicitly choose between the two variants: cpuset_zone_allowed_hardwall() cpuset_zone_allowed_softwall() Until now, whether or not you got the hardwall flavor depended solely on whether or not you or'd in the __GFP_HARDWALL gfp flag to the gfp_mask argument. If you didn't specify __GFP_HARDWALL, you implicitly got the softwall version. Unfortunately, this meant that users would end up with the softwall version without thinking about it. Since only the softwall version might sleep, this led to bugs with possible sleeping in interrupt context on more than one occassion. The hardwall version requires that the current tasks mems_allowed allows the node of the specified zone (or that you're in interrupt or that __GFP_THISNODE is set or that you're on a one cpuset system.) The softwall version, depending on the gfp_mask, might allow a node if it was allowed in the nearest enclusing cpuset marked mem_exclusive (which requires taking the cpuset lock 'callback_mutex' to evaluate.) This patch removes the cpuset_zone_allowed() call, and forces the caller to explicitly choose between the hardwall and the softwall case. If the caller wants the gfp_mask to determine this choice, they should (1) be sure they can sleep or that __GFP_HARDWALL is set, and (2) invoke the cpuset_zone_allowed_softwall() routine. This adds another 100 or 200 bytes to the kernel text space, due to the few lines of nearly duplicate code at the top of both cpuset_zone_allowed_* routines. It should save a few instructions executed for the calls that turned into calls of cpuset_zone_allowed_hardwall, thanks to not having to set (before the call) then check (within the call) the __GFP_HARDWALL flag. For the most critical call, from get_page_from_freelist(), the same instructions are executed as before -- the old cpuset_zone_allowed() routine it used to call is the same code as the cpuset_zone_allowed_softwall() routine that it calls now. Not a perfect win, but seems worth it, to reduce this chance of hitting a sleeping with irq off complaint again. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/02a0e53d8227aff5e62e0433f82c12c1c2805fd6;" It should save a few instructions executed for the calls that turned into calls of cpuset_zone_allowed_hardwall, thanks to not having to set (before the call) then check (within the call) the __GFP_HARDWALL flag";yes;yes;yes 362;MDY6Q29tbWl0MjMyNTI5ODowMmEwZTUzZDgyMjdhZmY1ZTYyZTA0MzNmODJjMTJjMWMyODA1ZmQ2;Paul Jackson;Linus Torvalds;"[PATCH] cpuset: rework cpuset_zone_allowed api Elaborate the API for calling cpuset_zone_allowed(), so that users have to explicitly choose between the two variants: cpuset_zone_allowed_hardwall() cpuset_zone_allowed_softwall() Until now, whether or not you got the hardwall flavor depended solely on whether or not you or'd in the __GFP_HARDWALL gfp flag to the gfp_mask argument. If you didn't specify __GFP_HARDWALL, you implicitly got the softwall version. Unfortunately, this meant that users would end up with the softwall version without thinking about it. Since only the softwall version might sleep, this led to bugs with possible sleeping in interrupt context on more than one occassion. The hardwall version requires that the current tasks mems_allowed allows the node of the specified zone (or that you're in interrupt or that __GFP_THISNODE is set or that you're on a one cpuset system.) The softwall version, depending on the gfp_mask, might allow a node if it was allowed in the nearest enclusing cpuset marked mem_exclusive (which requires taking the cpuset lock 'callback_mutex' to evaluate.) This patch removes the cpuset_zone_allowed() call, and forces the caller to explicitly choose between the hardwall and the softwall case. If the caller wants the gfp_mask to determine this choice, they should (1) be sure they can sleep or that __GFP_HARDWALL is set, and (2) invoke the cpuset_zone_allowed_softwall() routine. This adds another 100 or 200 bytes to the kernel text space, due to the few lines of nearly duplicate code at the top of both cpuset_zone_allowed_* routines. It should save a few instructions executed for the calls that turned into calls of cpuset_zone_allowed_hardwall, thanks to not having to set (before the call) then check (within the call) the __GFP_HARDWALL flag. For the most critical call, from get_page_from_freelist(), the same instructions are executed as before -- the old cpuset_zone_allowed() routine it used to call is the same code as the cpuset_zone_allowed_softwall() routine that it calls now. Not a perfect win, but seems worth it, to reduce this chance of hitting a sleeping with irq off complaint again. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/02a0e53d8227aff5e62e0433f82c12c1c2805fd6;"For the most critical call, from get_page_from_freelist(), the same instructions are executed as before -- the old cpuset_zone_allowed() routine it used to call is the same code as the cpuset_zone_allowed_softwall() routine that it calls now";yes;no;yes 362;MDY6Q29tbWl0MjMyNTI5ODowMmEwZTUzZDgyMjdhZmY1ZTYyZTA0MzNmODJjMTJjMWMyODA1ZmQ2;Paul Jackson;Linus Torvalds;"[PATCH] cpuset: rework cpuset_zone_allowed api Elaborate the API for calling cpuset_zone_allowed(), so that users have to explicitly choose between the two variants: cpuset_zone_allowed_hardwall() cpuset_zone_allowed_softwall() Until now, whether or not you got the hardwall flavor depended solely on whether or not you or'd in the __GFP_HARDWALL gfp flag to the gfp_mask argument. If you didn't specify __GFP_HARDWALL, you implicitly got the softwall version. Unfortunately, this meant that users would end up with the softwall version without thinking about it. Since only the softwall version might sleep, this led to bugs with possible sleeping in interrupt context on more than one occassion. The hardwall version requires that the current tasks mems_allowed allows the node of the specified zone (or that you're in interrupt or that __GFP_THISNODE is set or that you're on a one cpuset system.) The softwall version, depending on the gfp_mask, might allow a node if it was allowed in the nearest enclusing cpuset marked mem_exclusive (which requires taking the cpuset lock 'callback_mutex' to evaluate.) This patch removes the cpuset_zone_allowed() call, and forces the caller to explicitly choose between the hardwall and the softwall case. If the caller wants the gfp_mask to determine this choice, they should (1) be sure they can sleep or that __GFP_HARDWALL is set, and (2) invoke the cpuset_zone_allowed_softwall() routine. This adds another 100 or 200 bytes to the kernel text space, due to the few lines of nearly duplicate code at the top of both cpuset_zone_allowed_* routines. It should save a few instructions executed for the calls that turned into calls of cpuset_zone_allowed_hardwall, thanks to not having to set (before the call) then check (within the call) the __GFP_HARDWALL flag. For the most critical call, from get_page_from_freelist(), the same instructions are executed as before -- the old cpuset_zone_allowed() routine it used to call is the same code as the cpuset_zone_allowed_softwall() routine that it calls now. Not a perfect win, but seems worth it, to reduce this chance of hitting a sleeping with irq off complaint again. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/02a0e53d8227aff5e62e0433f82c12c1c2805fd6;"Not a perfect win, but seems worth it, to reduce this chance of hitting a sleeping with irq off complaint again.";yes;yes;no 363;MDY6Q29tbWl0MjMyNTI5ODpmMmEyYTcxMDhhYTAwMzliYTdhNWZlN2EwZDJlY2VmMjIxOWE3NTg0;Nick Piggin;Linus Torvalds;"[PATCH] oom: less memdie Don't cause all threads in all other thread groups to gain TIF_MEMDIE otherwise we'll get a thundering herd eating our memory reserve. This may not be the optimal scheme, but it fits our policy of allowing just one TIF_MEMDIE in the system at once. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/f2a2a7108aa0039ba7a5fe7a0d2ecef2219a7584;[PATCH] oom: less memdie;yes;no;no 363;MDY6Q29tbWl0MjMyNTI5ODpmMmEyYTcxMDhhYTAwMzliYTdhNWZlN2EwZDJlY2VmMjIxOWE3NTg0;Nick Piggin;Linus Torvalds;"[PATCH] oom: less memdie Don't cause all threads in all other thread groups to gain TIF_MEMDIE otherwise we'll get a thundering herd eating our memory reserve. This may not be the optimal scheme, but it fits our policy of allowing just one TIF_MEMDIE in the system at once. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/f2a2a7108aa0039ba7a5fe7a0d2ecef2219a7584;"Don't cause all threads in all other thread groups to gain TIF_MEMDIE otherwise we'll get a thundering herd eating our memory reserve";yes;yes;no 363;MDY6Q29tbWl0MjMyNTI5ODpmMmEyYTcxMDhhYTAwMzliYTdhNWZlN2EwZDJlY2VmMjIxOWE3NTg0;Nick Piggin;Linus Torvalds;"[PATCH] oom: less memdie Don't cause all threads in all other thread groups to gain TIF_MEMDIE otherwise we'll get a thundering herd eating our memory reserve. This may not be the optimal scheme, but it fits our policy of allowing just one TIF_MEMDIE in the system at once. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/f2a2a7108aa0039ba7a5fe7a0d2ecef2219a7584;" This may not be the optimal scheme, but it fits our policy of allowing just one TIF_MEMDIE in the system at once.";yes;yes;yes 364;MDY6Q29tbWl0MjMyNTI5ODpmM2FmMzhkMzBjMTg1MzhkMDY5YTk1ZTYyNGEzZGI3YzNkNDg2YTFl;Nick Piggin;Linus Torvalds;"[PATCH] oom: cleanup messages Clean up the OOM killer messages to be more consistent. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/f3af38d30c18538d069a95e624a3db7c3d486a1e;[PATCH] oom: cleanup messages;yes;yes;no 364;MDY6Q29tbWl0MjMyNTI5ODpmM2FmMzhkMzBjMTg1MzhkMDY5YTk1ZTYyNGEzZGI3YzNkNDg2YTFl;Nick Piggin;Linus Torvalds;"[PATCH] oom: cleanup messages Clean up the OOM killer messages to be more consistent. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/f3af38d30c18538d069a95e624a3db7c3d486a1e;Clean up the OOM killer messages to be more consistent.;yes;yes;no 365;MDY6Q29tbWl0MjMyNTI5ODpjMzNlMGZjYTM1MDhmMGFhMzg3YjFjMTBkMGVmMTU4MTAyZGViMTQw;Nick Piggin;Linus Torvalds;"[PATCH] oom: don't kill unkillable children or siblings Abort the kill if any of our threads have OOM_DISABLE set. Having this test here also prevents any OOM_DISABLE child of the ""selected"" process from being killed. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/c33e0fca3508f0aa387b1c10d0ef158102deb140;[PATCH] oom: don't kill unkillable children or siblings;yes;yes;no 365;MDY6Q29tbWl0MjMyNTI5ODpjMzNlMGZjYTM1MDhmMGFhMzg3YjFjMTBkMGVmMTU4MTAyZGViMTQw;Nick Piggin;Linus Torvalds;"[PATCH] oom: don't kill unkillable children or siblings Abort the kill if any of our threads have OOM_DISABLE set. Having this test here also prevents any OOM_DISABLE child of the ""selected"" process from being killed. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/c33e0fca3508f0aa387b1c10d0ef158102deb140;Abort the kill if any of our threads have OOM_DISABLE set;yes;no;no 365;MDY6Q29tbWl0MjMyNTI5ODpjMzNlMGZjYTM1MDhmMGFhMzg3YjFjMTBkMGVmMTU4MTAyZGViMTQw;Nick Piggin;Linus Torvalds;"[PATCH] oom: don't kill unkillable children or siblings Abort the kill if any of our threads have OOM_DISABLE set. Having this test here also prevents any OOM_DISABLE child of the ""selected"" process from being killed. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/c33e0fca3508f0aa387b1c10d0ef158102deb140;" Having this test here also prevents any OOM_DISABLE child of the ""selected"" process from being killed.";yes;yes;no 366;MDY6Q29tbWl0MjMyNTI5ODo4YWM3NzNiNGY3M2FmYTZmZDY2Njk1MTMxMTAzOTQ0Yjk3NWQ1ZDVj;Alexey Dobriyan;Linus Torvalds;"[PATCH] OOM killer meets userspace headers Despite mm.h is not being exported header, it does contain one thing which is part of userspace ABI -- value disabling OOM killer for given process. So, a) create and export include/linux/oom.h b) move OOM_DISABLE define there. c) turn bounding values of /proc/$PID/oom_adj into defines and export them too. Note: mass __KERNEL__ removal will be done later. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/8ac773b4f73afa6fd66695131103944b975d5d5c;[PATCH] OOM killer meets userspace headers;yes;no;no 366;MDY6Q29tbWl0MjMyNTI5ODo4YWM3NzNiNGY3M2FmYTZmZDY2Njk1MTMxMTAzOTQ0Yjk3NWQ1ZDVj;Alexey Dobriyan;Linus Torvalds;"[PATCH] OOM killer meets userspace headers Despite mm.h is not being exported header, it does contain one thing which is part of userspace ABI -- value disabling OOM killer for given process. So, a) create and export include/linux/oom.h b) move OOM_DISABLE define there. c) turn bounding values of /proc/$PID/oom_adj into defines and export them too. Note: mass __KERNEL__ removal will be done later. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/8ac773b4f73afa6fd66695131103944b975d5d5c;"Despite mm.h is not being exported header, it does contain one thing which is part of userspace ABI -- value disabling OOM killer for given process";no;yes;yes 366;MDY6Q29tbWl0MjMyNTI5ODo4YWM3NzNiNGY3M2FmYTZmZDY2Njk1MTMxMTAzOTQ0Yjk3NWQ1ZDVj;Alexey Dobriyan;Linus Torvalds;"[PATCH] OOM killer meets userspace headers Despite mm.h is not being exported header, it does contain one thing which is part of userspace ABI -- value disabling OOM killer for given process. So, a) create and export include/linux/oom.h b) move OOM_DISABLE define there. c) turn bounding values of /proc/$PID/oom_adj into defines and export them too. Note: mass __KERNEL__ removal will be done later. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/8ac773b4f73afa6fd66695131103944b975d5d5c;"So, a) create and export include/linux/oom.h b) move OOM_DISABLE define there";yes;no;no 366;MDY6Q29tbWl0MjMyNTI5ODo4YWM3NzNiNGY3M2FmYTZmZDY2Njk1MTMxMTAzOTQ0Yjk3NWQ1ZDVj;Alexey Dobriyan;Linus Torvalds;"[PATCH] OOM killer meets userspace headers Despite mm.h is not being exported header, it does contain one thing which is part of userspace ABI -- value disabling OOM killer for given process. So, a) create and export include/linux/oom.h b) move OOM_DISABLE define there. c) turn bounding values of /proc/$PID/oom_adj into defines and export them too. Note: mass __KERNEL__ removal will be done later. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/8ac773b4f73afa6fd66695131103944b975d5d5c;"c) turn bounding values of /proc/$PID/oom_adj into defines and export them too";yes;no;no 366;MDY6Q29tbWl0MjMyNTI5ODo4YWM3NzNiNGY3M2FmYTZmZDY2Njk1MTMxMTAzOTQ0Yjk3NWQ1ZDVj;Alexey Dobriyan;Linus Torvalds;"[PATCH] OOM killer meets userspace headers Despite mm.h is not being exported header, it does contain one thing which is part of userspace ABI -- value disabling OOM killer for given process. So, a) create and export include/linux/oom.h b) move OOM_DISABLE define there. c) turn bounding values of /proc/$PID/oom_adj into defines and export them too. Note: mass __KERNEL__ removal will be done later. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/8ac773b4f73afa6fd66695131103944b975d5d5c;Note: mass __KERNEL__ removal will be done later.;yes;no;yes 367;MDY6Q29tbWl0MjMyNTI5ODpiNzg0ODNhNGJhNjBkNWQ5MDkzMDI2MmE1MzNhNzg0ZTFkOWRmNjYw;Nick Piggin;Linus Torvalds;"[PATCH] oom: don't kill current when another OOM in progress A previous patch to allow an exiting task to OOM kill itself (and thereby avoid a little deadlock) introduced a problem. We don't want the PF_EXITING task, even if it is 'current', to access mem reserves if there is already a TIF_MEMDIE process in the system sucking up reserves. Also make the commenting a little bit clearer, and note that our current scheme of effectively single threading the OOM killer is not itself perfect. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/b78483a4ba60d5d90930262a533a784e1d9df660;[PATCH] oom: don't kill current when another OOM in progress;yes;no;no 367;MDY6Q29tbWl0MjMyNTI5ODpiNzg0ODNhNGJhNjBkNWQ5MDkzMDI2MmE1MzNhNzg0ZTFkOWRmNjYw;Nick Piggin;Linus Torvalds;"[PATCH] oom: don't kill current when another OOM in progress A previous patch to allow an exiting task to OOM kill itself (and thereby avoid a little deadlock) introduced a problem. We don't want the PF_EXITING task, even if it is 'current', to access mem reserves if there is already a TIF_MEMDIE process in the system sucking up reserves. Also make the commenting a little bit clearer, and note that our current scheme of effectively single threading the OOM killer is not itself perfect. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/b78483a4ba60d5d90930262a533a784e1d9df660;"A previous patch to allow an exiting task to OOM kill itself (and thereby avoid a little deadlock) introduced a problem";no;yes;yes 367;MDY6Q29tbWl0MjMyNTI5ODpiNzg0ODNhNGJhNjBkNWQ5MDkzMDI2MmE1MzNhNzg0ZTFkOWRmNjYw;Nick Piggin;Linus Torvalds;"[PATCH] oom: don't kill current when another OOM in progress A previous patch to allow an exiting task to OOM kill itself (and thereby avoid a little deadlock) introduced a problem. We don't want the PF_EXITING task, even if it is 'current', to access mem reserves if there is already a TIF_MEMDIE process in the system sucking up reserves. Also make the commenting a little bit clearer, and note that our current scheme of effectively single threading the OOM killer is not itself perfect. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/b78483a4ba60d5d90930262a533a784e1d9df660;" We don't want the PF_EXITING task, even if it is 'current', to access mem reserves if there is already a TIF_MEMDIE process in the system sucking up reserves";yes;yes;no 367;MDY6Q29tbWl0MjMyNTI5ODpiNzg0ODNhNGJhNjBkNWQ5MDkzMDI2MmE1MzNhNzg0ZTFkOWRmNjYw;Nick Piggin;Linus Torvalds;"[PATCH] oom: don't kill current when another OOM in progress A previous patch to allow an exiting task to OOM kill itself (and thereby avoid a little deadlock) introduced a problem. We don't want the PF_EXITING task, even if it is 'current', to access mem reserves if there is already a TIF_MEMDIE process in the system sucking up reserves. Also make the commenting a little bit clearer, and note that our current scheme of effectively single threading the OOM killer is not itself perfect. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/b78483a4ba60d5d90930262a533a784e1d9df660;"Also make the commenting a little bit clearer, and note that our current scheme of effectively single threading the OOM killer is not itself perfect.";yes;yes;no 368;MDY6Q29tbWl0MjMyNTI5ODowMTAxN2EyMjcwNDRkNjRmYWNlMjU4OGZhYjk0MjdhMWRhMWJkYjlm;Oleg Nesterov;Linus Torvalds;"[PATCH] oom_kill_task(): cleanup ->mm checks - It is not possible to have task->mm == &init_mm. - task_lock() buys nothing for 'if (!p->mm)' check. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/01017a227044d64face2588fab9427a1da1bdb9f;[PATCH] oom_kill_task(): cleanup ->mm checks;yes;yes;no 368;MDY6Q29tbWl0MjMyNTI5ODowMTAxN2EyMjcwNDRkNjRmYWNlMjU4OGZhYjk0MjdhMWRhMWJkYjlm;Oleg Nesterov;Linus Torvalds;"[PATCH] oom_kill_task(): cleanup ->mm checks - It is not possible to have task->mm == &init_mm. - task_lock() buys nothing for 'if (!p->mm)' check. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/01017a227044d64face2588fab9427a1da1bdb9f;- It is not possible to have task->mm == &init_mm;no;yes;yes 368;MDY6Q29tbWl0MjMyNTI5ODowMTAxN2EyMjcwNDRkNjRmYWNlMjU4OGZhYjk0MjdhMWRhMWJkYjlm;Oleg Nesterov;Linus Torvalds;"[PATCH] oom_kill_task(): cleanup ->mm checks - It is not possible to have task->mm == &init_mm. - task_lock() buys nothing for 'if (!p->mm)' check. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/01017a227044d64face2588fab9427a1da1bdb9f;- task_lock() buys nothing for 'if (!p->mm)' check.;no;yes;yes 369;MDY6Q29tbWl0MjMyNTI5ODo5NzJjNGVhNTljOWRiZjgyNjQ3ZWU5NjY1ZDllOTQ1MjQxOTExYTUx;Oleg Nesterov;Linus Torvalds;"[PATCH] select_bad_process(): cleanup 'releasing' check No logic changes, but imho easier to read. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/972c4ea59c9dbf82647ee9665d9e945241911a51;[PATCH] select_bad_process(): cleanup 'releasing' check;yes;yes;no 369;MDY6Q29tbWl0MjMyNTI5ODo5NzJjNGVhNTljOWRiZjgyNjQ3ZWU5NjY1ZDllOTQ1MjQxOTExYTUx;Oleg Nesterov;Linus Torvalds;"[PATCH] select_bad_process(): cleanup 'releasing' check No logic changes, but imho easier to read. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/972c4ea59c9dbf82647ee9665d9e945241911a51;No logic changes, but imho easier to read.;yes;yes;no 370;MDY6Q29tbWl0MjMyNTI5ODoyODMyNGQxZGY2NDY1MjEyNTZlODMzODkyNDRhZGNjZTk4ZTg5ZmYy;Oleg Nesterov;Linus Torvalds;"[PATCH] select_bad_process(): kill a bogus PF_DEAD/TASK_DEAD check The only one usage of TASK_DEAD outside of last schedule path, select_bad_process: for_each_task(p) { if (!p->mm) continue; ... if (p->state == TASK_DEAD) continue; ... TASK_DEAD state is set at the end of do_exit(), this means that p->mm was already set == NULL by exit_mm(), so this task was already rejected by 'if (!p->mm)' above. Note also that the caller holds tasklist_lock, this means that p can't pass exit_notify() and then set TASK_DEAD when p->mm != NULL. Also, remove open-coded is_init(). Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Ingo Molnar <mingo@elte.hu> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/28324d1df646521256e83389244adcce98e89ff2;[PATCH] select_bad_process(): kill a bogus PF_DEAD/TASK_DEAD check;yes;yes;no 370;MDY6Q29tbWl0MjMyNTI5ODoyODMyNGQxZGY2NDY1MjEyNTZlODMzODkyNDRhZGNjZTk4ZTg5ZmYy;Oleg Nesterov;Linus Torvalds;"[PATCH] select_bad_process(): kill a bogus PF_DEAD/TASK_DEAD check The only one usage of TASK_DEAD outside of last schedule path, select_bad_process: for_each_task(p) { if (!p->mm) continue; ... if (p->state == TASK_DEAD) continue; ... TASK_DEAD state is set at the end of do_exit(), this means that p->mm was already set == NULL by exit_mm(), so this task was already rejected by 'if (!p->mm)' above. Note also that the caller holds tasklist_lock, this means that p can't pass exit_notify() and then set TASK_DEAD when p->mm != NULL. Also, remove open-coded is_init(). Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Ingo Molnar <mingo@elte.hu> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/28324d1df646521256e83389244adcce98e89ff2;"The only one usage of TASK_DEAD outside of last schedule path, select_bad_process";no;no;yes 370;MDY6Q29tbWl0MjMyNTI5ODoyODMyNGQxZGY2NDY1MjEyNTZlODMzODkyNDRhZGNjZTk4ZTg5ZmYy;Oleg Nesterov;Linus Torvalds;"[PATCH] select_bad_process(): kill a bogus PF_DEAD/TASK_DEAD check The only one usage of TASK_DEAD outside of last schedule path, select_bad_process: for_each_task(p) { if (!p->mm) continue; ... if (p->state == TASK_DEAD) continue; ... TASK_DEAD state is set at the end of do_exit(), this means that p->mm was already set == NULL by exit_mm(), so this task was already rejected by 'if (!p->mm)' above. Note also that the caller holds tasklist_lock, this means that p can't pass exit_notify() and then set TASK_DEAD when p->mm != NULL. Also, remove open-coded is_init(). Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Ingo Molnar <mingo@elte.hu> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/28324d1df646521256e83389244adcce98e89ff2;"TASK_DEAD state is set at the end of do_exit(), this means that p->mm was already set == NULL by exit_mm(), so this task was already rejected by 'if (!p->mm)' above";no;yes;yes 370;MDY6Q29tbWl0MjMyNTI5ODoyODMyNGQxZGY2NDY1MjEyNTZlODMzODkyNDRhZGNjZTk4ZTg5ZmYy;Oleg Nesterov;Linus Torvalds;"[PATCH] select_bad_process(): kill a bogus PF_DEAD/TASK_DEAD check The only one usage of TASK_DEAD outside of last schedule path, select_bad_process: for_each_task(p) { if (!p->mm) continue; ... if (p->state == TASK_DEAD) continue; ... TASK_DEAD state is set at the end of do_exit(), this means that p->mm was already set == NULL by exit_mm(), so this task was already rejected by 'if (!p->mm)' above. Note also that the caller holds tasklist_lock, this means that p can't pass exit_notify() and then set TASK_DEAD when p->mm != NULL. Also, remove open-coded is_init(). Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Ingo Molnar <mingo@elte.hu> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/28324d1df646521256e83389244adcce98e89ff2;"Note also that the caller holds tasklist_lock, this means that p can't pass exit_notify() and then set TASK_DEAD when p->mm != NULL";no;yes;yes 370;MDY6Q29tbWl0MjMyNTI5ODoyODMyNGQxZGY2NDY1MjEyNTZlODMzODkyNDRhZGNjZTk4ZTg5ZmYy;Oleg Nesterov;Linus Torvalds;"[PATCH] select_bad_process(): kill a bogus PF_DEAD/TASK_DEAD check The only one usage of TASK_DEAD outside of last schedule path, select_bad_process: for_each_task(p) { if (!p->mm) continue; ... if (p->state == TASK_DEAD) continue; ... TASK_DEAD state is set at the end of do_exit(), this means that p->mm was already set == NULL by exit_mm(), so this task was already rejected by 'if (!p->mm)' above. Note also that the caller holds tasklist_lock, this means that p can't pass exit_notify() and then set TASK_DEAD when p->mm != NULL. Also, remove open-coded is_init(). Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Ingo Molnar <mingo@elte.hu> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/28324d1df646521256e83389244adcce98e89ff2;Also, remove open-coded is_init().;yes;no;no 371;MDY6Q29tbWl0MjMyNTI5ODpjMzk0Y2M5ZmJiMzY3Zjg3ZmFhMjIyOGVjMmVhYmFjZDJkNDcwMWM2;Oleg Nesterov;Linus Torvalds;"[PATCH] introduce TASK_DEAD state I am not sure about this patch, I am asking Ingo to take a decision. task_struct->state == EXIT_DEAD is a very special case, to avoid a confusion it makes sense to introduce a new state, TASK_DEAD, while EXIT_DEAD should live only in ->exit_state as documented in sched.h. Note that this state is not visible to user-space, get_task_state() masks off unsuitable states. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/c394cc9fbb367f87faa2228ec2eabacd2d4701c6;[PATCH] introduce TASK_DEAD state;yes;no;no 371;MDY6Q29tbWl0MjMyNTI5ODpjMzk0Y2M5ZmJiMzY3Zjg3ZmFhMjIyOGVjMmVhYmFjZDJkNDcwMWM2;Oleg Nesterov;Linus Torvalds;"[PATCH] introduce TASK_DEAD state I am not sure about this patch, I am asking Ingo to take a decision. task_struct->state == EXIT_DEAD is a very special case, to avoid a confusion it makes sense to introduce a new state, TASK_DEAD, while EXIT_DEAD should live only in ->exit_state as documented in sched.h. Note that this state is not visible to user-space, get_task_state() masks off unsuitable states. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/c394cc9fbb367f87faa2228ec2eabacd2d4701c6;I am not sure about this patch, I am asking Ingo to take a decision;yes;no;no 371;MDY6Q29tbWl0MjMyNTI5ODpjMzk0Y2M5ZmJiMzY3Zjg3ZmFhMjIyOGVjMmVhYmFjZDJkNDcwMWM2;Oleg Nesterov;Linus Torvalds;"[PATCH] introduce TASK_DEAD state I am not sure about this patch, I am asking Ingo to take a decision. task_struct->state == EXIT_DEAD is a very special case, to avoid a confusion it makes sense to introduce a new state, TASK_DEAD, while EXIT_DEAD should live only in ->exit_state as documented in sched.h. Note that this state is not visible to user-space, get_task_state() masks off unsuitable states. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/c394cc9fbb367f87faa2228ec2eabacd2d4701c6;"task_struct->state == EXIT_DEAD is a very special case, to avoid a confusion it makes sense to introduce a new state, TASK_DEAD, while EXIT_DEAD should live only in ->exit_state as documented in sched.h";yes;yes;no 371;MDY6Q29tbWl0MjMyNTI5ODpjMzk0Y2M5ZmJiMzY3Zjg3ZmFhMjIyOGVjMmVhYmFjZDJkNDcwMWM2;Oleg Nesterov;Linus Torvalds;"[PATCH] introduce TASK_DEAD state I am not sure about this patch, I am asking Ingo to take a decision. task_struct->state == EXIT_DEAD is a very special case, to avoid a confusion it makes sense to introduce a new state, TASK_DEAD, while EXIT_DEAD should live only in ->exit_state as documented in sched.h. Note that this state is not visible to user-space, get_task_state() masks off unsuitable states. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/c394cc9fbb367f87faa2228ec2eabacd2d4701c6;"Note that this state is not visible to user-space, get_task_state() masks off unsuitable states.";yes;yes;yes 372;MDY6Q29tbWl0MjMyNTI5ODo1NWExMDFmOGY3MWEzZDNkYmRhN2I1Yzc3MDgzZmZlNDc1NTJmODMx;Oleg Nesterov;Linus Torvalds;"[PATCH] kill PF_DEAD flag After the previous change (->flags & PF_DEAD) <=> (->state == EXIT_DEAD), we don't need PF_DEAD any longer. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/55a101f8f71a3d3dbda7b5c77083ffe47552f831;[PATCH] kill PF_DEAD flag;yes;no;no 372;MDY6Q29tbWl0MjMyNTI5ODo1NWExMDFmOGY3MWEzZDNkYmRhN2I1Yzc3MDgzZmZlNDc1NTJmODMx;Oleg Nesterov;Linus Torvalds;"[PATCH] kill PF_DEAD flag After the previous change (->flags & PF_DEAD) <=> (->state == EXIT_DEAD), we don't need PF_DEAD any longer. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/55a101f8f71a3d3dbda7b5c77083ffe47552f831;"After the previous change (->flags & PF_DEAD) <=> (->state == EXIT_DEAD), we don't need PF_DEAD any longer.";no;yes;yes 373;MDY6Q29tbWl0MjMyNTI5ODpmNDAwZTE5OGIyZWQyNmNlNTViMjJhMTQxMmRlZDA4OTZlNzUxNmFj;Sukadev Bhattiprolu;Linus Torvalds;"[PATCH] pidspace: is_init() This is an updated version of Eric Biederman's is_init() patch. (http://lkml.org/lkml/2006/2/6/280). It applies cleanly to 2.6.18-rc3 and replaces a few more instances of ->pid == 1 with is_init(). Further, is_init() checks pid and thus removes dependency on Eric's other patches for now. Eric's original description: There are a lot of places in the kernel where we test for init because we give it special properties. Most significantly init must not die. This results in code all over the kernel test ->pid == 1. Introduce is_init to capture this case. With multiple pid spaces for all of the cases affected we are looking for only the first process on the system, not some other process that has pid == 1. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Serge Hallyn <serue@us.ibm.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: <lxc-devel@lists.sourceforge.net> Acked-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/f400e198b2ed26ce55b22a1412ded0896e7516ac;[PATCH] pidspace: is_init();yes;no;no 373;MDY6Q29tbWl0MjMyNTI5ODpmNDAwZTE5OGIyZWQyNmNlNTViMjJhMTQxMmRlZDA4OTZlNzUxNmFj;Sukadev Bhattiprolu;Linus Torvalds;"[PATCH] pidspace: is_init() This is an updated version of Eric Biederman's is_init() patch. (http://lkml.org/lkml/2006/2/6/280). It applies cleanly to 2.6.18-rc3 and replaces a few more instances of ->pid == 1 with is_init(). Further, is_init() checks pid and thus removes dependency on Eric's other patches for now. Eric's original description: There are a lot of places in the kernel where we test for init because we give it special properties. Most significantly init must not die. This results in code all over the kernel test ->pid == 1. Introduce is_init to capture this case. With multiple pid spaces for all of the cases affected we are looking for only the first process on the system, not some other process that has pid == 1. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Serge Hallyn <serue@us.ibm.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: <lxc-devel@lists.sourceforge.net> Acked-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/f400e198b2ed26ce55b22a1412ded0896e7516ac;This is an updated version of Eric Biederman's is_init() patch;yes;yes;no 373;MDY6Q29tbWl0MjMyNTI5ODpmNDAwZTE5OGIyZWQyNmNlNTViMjJhMTQxMmRlZDA4OTZlNzUxNmFj;Sukadev Bhattiprolu;Linus Torvalds;"[PATCH] pidspace: is_init() This is an updated version of Eric Biederman's is_init() patch. (http://lkml.org/lkml/2006/2/6/280). It applies cleanly to 2.6.18-rc3 and replaces a few more instances of ->pid == 1 with is_init(). Further, is_init() checks pid and thus removes dependency on Eric's other patches for now. Eric's original description: There are a lot of places in the kernel where we test for init because we give it special properties. Most significantly init must not die. This results in code all over the kernel test ->pid == 1. Introduce is_init to capture this case. With multiple pid spaces for all of the cases affected we are looking for only the first process on the system, not some other process that has pid == 1. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Serge Hallyn <serue@us.ibm.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: <lxc-devel@lists.sourceforge.net> Acked-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/f400e198b2ed26ce55b22a1412ded0896e7516ac;"( It applies cleanly to 2.6.18-rc3 and replaces a few more instances of ->pid == 1 with is_init()";yes;no;no 373;MDY6Q29tbWl0MjMyNTI5ODpmNDAwZTE5OGIyZWQyNmNlNTViMjJhMTQxMmRlZDA4OTZlNzUxNmFj;Sukadev Bhattiprolu;Linus Torvalds;"[PATCH] pidspace: is_init() This is an updated version of Eric Biederman's is_init() patch. (http://lkml.org/lkml/2006/2/6/280). It applies cleanly to 2.6.18-rc3 and replaces a few more instances of ->pid == 1 with is_init(). Further, is_init() checks pid and thus removes dependency on Eric's other patches for now. Eric's original description: There are a lot of places in the kernel where we test for init because we give it special properties. Most significantly init must not die. This results in code all over the kernel test ->pid == 1. Introduce is_init to capture this case. With multiple pid spaces for all of the cases affected we are looking for only the first process on the system, not some other process that has pid == 1. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Serge Hallyn <serue@us.ibm.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: <lxc-devel@lists.sourceforge.net> Acked-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/f400e198b2ed26ce55b22a1412ded0896e7516ac;"Further, is_init() checks pid and thus removes dependency on Eric's other patches for now";yes;yes;no 373;MDY6Q29tbWl0MjMyNTI5ODpmNDAwZTE5OGIyZWQyNmNlNTViMjJhMTQxMmRlZDA4OTZlNzUxNmFj;Sukadev Bhattiprolu;Linus Torvalds;"[PATCH] pidspace: is_init() This is an updated version of Eric Biederman's is_init() patch. (http://lkml.org/lkml/2006/2/6/280). It applies cleanly to 2.6.18-rc3 and replaces a few more instances of ->pid == 1 with is_init(). Further, is_init() checks pid and thus removes dependency on Eric's other patches for now. Eric's original description: There are a lot of places in the kernel where we test for init because we give it special properties. Most significantly init must not die. This results in code all over the kernel test ->pid == 1. Introduce is_init to capture this case. With multiple pid spaces for all of the cases affected we are looking for only the first process on the system, not some other process that has pid == 1. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Serge Hallyn <serue@us.ibm.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: <lxc-devel@lists.sourceforge.net> Acked-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/f400e198b2ed26ce55b22a1412ded0896e7516ac;Eric's original description;no;no;yes 373;MDY6Q29tbWl0MjMyNTI5ODpmNDAwZTE5OGIyZWQyNmNlNTViMjJhMTQxMmRlZDA4OTZlNzUxNmFj;Sukadev Bhattiprolu;Linus Torvalds;"[PATCH] pidspace: is_init() This is an updated version of Eric Biederman's is_init() patch. (http://lkml.org/lkml/2006/2/6/280). It applies cleanly to 2.6.18-rc3 and replaces a few more instances of ->pid == 1 with is_init(). Further, is_init() checks pid and thus removes dependency on Eric's other patches for now. Eric's original description: There are a lot of places in the kernel where we test for init because we give it special properties. Most significantly init must not die. This results in code all over the kernel test ->pid == 1. Introduce is_init to capture this case. With multiple pid spaces for all of the cases affected we are looking for only the first process on the system, not some other process that has pid == 1. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Serge Hallyn <serue@us.ibm.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: <lxc-devel@lists.sourceforge.net> Acked-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/f400e198b2ed26ce55b22a1412ded0896e7516ac;" There are a lot of places in the kernel where we test for init because we give it special properties";no;no;yes 373;MDY6Q29tbWl0MjMyNTI5ODpmNDAwZTE5OGIyZWQyNmNlNTViMjJhMTQxMmRlZDA4OTZlNzUxNmFj;Sukadev Bhattiprolu;Linus Torvalds;"[PATCH] pidspace: is_init() This is an updated version of Eric Biederman's is_init() patch. (http://lkml.org/lkml/2006/2/6/280). It applies cleanly to 2.6.18-rc3 and replaces a few more instances of ->pid == 1 with is_init(). Further, is_init() checks pid and thus removes dependency on Eric's other patches for now. Eric's original description: There are a lot of places in the kernel where we test for init because we give it special properties. Most significantly init must not die. This results in code all over the kernel test ->pid == 1. Introduce is_init to capture this case. With multiple pid spaces for all of the cases affected we are looking for only the first process on the system, not some other process that has pid == 1. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Serge Hallyn <serue@us.ibm.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: <lxc-devel@lists.sourceforge.net> Acked-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/f400e198b2ed26ce55b22a1412ded0896e7516ac;" Most significantly init must not die";no;no;yes 373;MDY6Q29tbWl0MjMyNTI5ODpmNDAwZTE5OGIyZWQyNmNlNTViMjJhMTQxMmRlZDA4OTZlNzUxNmFj;Sukadev Bhattiprolu;Linus Torvalds;"[PATCH] pidspace: is_init() This is an updated version of Eric Biederman's is_init() patch. (http://lkml.org/lkml/2006/2/6/280). It applies cleanly to 2.6.18-rc3 and replaces a few more instances of ->pid == 1 with is_init(). Further, is_init() checks pid and thus removes dependency on Eric's other patches for now. Eric's original description: There are a lot of places in the kernel where we test for init because we give it special properties. Most significantly init must not die. This results in code all over the kernel test ->pid == 1. Introduce is_init to capture this case. With multiple pid spaces for all of the cases affected we are looking for only the first process on the system, not some other process that has pid == 1. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Serge Hallyn <serue@us.ibm.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: <lxc-devel@lists.sourceforge.net> Acked-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/f400e198b2ed26ce55b22a1412ded0896e7516ac;" This results in code all over the kernel test ->pid == 1";no;yes;yes 373;MDY6Q29tbWl0MjMyNTI5ODpmNDAwZTE5OGIyZWQyNmNlNTViMjJhMTQxMmRlZDA4OTZlNzUxNmFj;Sukadev Bhattiprolu;Linus Torvalds;"[PATCH] pidspace: is_init() This is an updated version of Eric Biederman's is_init() patch. (http://lkml.org/lkml/2006/2/6/280). It applies cleanly to 2.6.18-rc3 and replaces a few more instances of ->pid == 1 with is_init(). Further, is_init() checks pid and thus removes dependency on Eric's other patches for now. Eric's original description: There are a lot of places in the kernel where we test for init because we give it special properties. Most significantly init must not die. This results in code all over the kernel test ->pid == 1. Introduce is_init to capture this case. With multiple pid spaces for all of the cases affected we are looking for only the first process on the system, not some other process that has pid == 1. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Serge Hallyn <serue@us.ibm.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: <lxc-devel@lists.sourceforge.net> Acked-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/f400e198b2ed26ce55b22a1412ded0896e7516ac; Introduce is_init to capture this case;yes;yes;yes 373;MDY6Q29tbWl0MjMyNTI5ODpmNDAwZTE5OGIyZWQyNmNlNTViMjJhMTQxMmRlZDA4OTZlNzUxNmFj;Sukadev Bhattiprolu;Linus Torvalds;"[PATCH] pidspace: is_init() This is an updated version of Eric Biederman's is_init() patch. (http://lkml.org/lkml/2006/2/6/280). It applies cleanly to 2.6.18-rc3 and replaces a few more instances of ->pid == 1 with is_init(). Further, is_init() checks pid and thus removes dependency on Eric's other patches for now. Eric's original description: There are a lot of places in the kernel where we test for init because we give it special properties. Most significantly init must not die. This results in code all over the kernel test ->pid == 1. Introduce is_init to capture this case. With multiple pid spaces for all of the cases affected we are looking for only the first process on the system, not some other process that has pid == 1. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Serge Hallyn <serue@us.ibm.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: <lxc-devel@lists.sourceforge.net> Acked-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/f400e198b2ed26ce55b22a1412ded0896e7516ac;" With multiple pid spaces for all of the cases affected we are looking for only the first process on the system, not some other process that has pid == 1.";yes;no;yes 374;MDY6Q29tbWl0MjMyNTI5ODo4OWZhMzAyNDJmYWNjYTI0OWFlYWQyYWFjMDNjNGM2OTc2NGY5MTFj;Christoph Lameter;Linus Torvalds;"[PATCH] NUMA: Add zone_to_nid function There are many places where we need to determine the node of a zone. Currently we use a difficult to read sequence of pointer dereferencing. Put that into an inline function and use throughout VM. Maybe we can find a way to optimize the lookup in the future. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/89fa30242facca249aead2aac03c4c69764f911c;[PATCH] NUMA: Add zone_to_nid function;yes;no;no 374;MDY6Q29tbWl0MjMyNTI5ODo4OWZhMzAyNDJmYWNjYTI0OWFlYWQyYWFjMDNjNGM2OTc2NGY5MTFj;Christoph Lameter;Linus Torvalds;"[PATCH] NUMA: Add zone_to_nid function There are many places where we need to determine the node of a zone. Currently we use a difficult to read sequence of pointer dereferencing. Put that into an inline function and use throughout VM. Maybe we can find a way to optimize the lookup in the future. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/89fa30242facca249aead2aac03c4c69764f911c;There are many places where we need to determine the node of a zone;no;yes;yes 374;MDY6Q29tbWl0MjMyNTI5ODo4OWZhMzAyNDJmYWNjYTI0OWFlYWQyYWFjMDNjNGM2OTc2NGY5MTFj;Christoph Lameter;Linus Torvalds;"[PATCH] NUMA: Add zone_to_nid function There are many places where we need to determine the node of a zone. Currently we use a difficult to read sequence of pointer dereferencing. Put that into an inline function and use throughout VM. Maybe we can find a way to optimize the lookup in the future. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/89fa30242facca249aead2aac03c4c69764f911c;Currently we use a difficult to read sequence of pointer dereferencing;no;yes;yes 374;MDY6Q29tbWl0MjMyNTI5ODo4OWZhMzAyNDJmYWNjYTI0OWFlYWQyYWFjMDNjNGM2OTc2NGY5MTFj;Christoph Lameter;Linus Torvalds;"[PATCH] NUMA: Add zone_to_nid function There are many places where we need to determine the node of a zone. Currently we use a difficult to read sequence of pointer dereferencing. Put that into an inline function and use throughout VM. Maybe we can find a way to optimize the lookup in the future. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/89fa30242facca249aead2aac03c4c69764f911c;Put that into an inline function and use throughout VM;yes;no;no 374;MDY6Q29tbWl0MjMyNTI5ODo4OWZhMzAyNDJmYWNjYTI0OWFlYWQyYWFjMDNjNGM2OTc2NGY5MTFj;Christoph Lameter;Linus Torvalds;"[PATCH] NUMA: Add zone_to_nid function There are many places where we need to determine the node of a zone. Currently we use a difficult to read sequence of pointer dereferencing. Put that into an inline function and use throughout VM. Maybe we can find a way to optimize the lookup in the future. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/89fa30242facca249aead2aac03c4c69764f911c;" Maybe we can find a way to optimize the lookup in the future.";yes;yes;no 375;MDY6Q29tbWl0MjMyNTI5ODo1YTI5MWI5OGIyMTE2ZDY2OTQ0OTg4NWFiZWYzMDAwZjc0NzUwNGIz;Ram Gupta;Linus Torvalds;"[PATCH] oom-kill: update comments to reflect current code Update the comments for __oom_kill_task() to reflect the code changes. Signed-off-by: Ram Gupta <r.gupta@astronautics.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/5a291b98b2116d669449885abef3000f747504b3;[PATCH] oom-kill: update comments to reflect current code;yes;yes;no 375;MDY6Q29tbWl0MjMyNTI5ODo1YTI5MWI5OGIyMTE2ZDY2OTQ0OTg4NWFiZWYzMDAwZjc0NzUwNGIz;Ram Gupta;Linus Torvalds;"[PATCH] oom-kill: update comments to reflect current code Update the comments for __oom_kill_task() to reflect the code changes. Signed-off-by: Ram Gupta <r.gupta@astronautics.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/5a291b98b2116d669449885abef3000f747504b3;Update the comments for __oom_kill_task() to reflect the code changes.;yes;yes;no 376;MDY6Q29tbWl0MjMyNTI5ODpiNzJmMTYwNDQzY2I3OGIyZjhhZGRhZTZlMzMxZDJhZGFhNzBmODY5;Nick Piggin;Linus Torvalds;"[PATCH] oom: more printk Print the name of the task invoking the OOM killer. Could make debugging easier. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/b72f160443cb78b2f8addae6e331d2adaa70f869;[PATCH] oom: more printk;yes;no;no 376;MDY6Q29tbWl0MjMyNTI5ODpiNzJmMTYwNDQzY2I3OGIyZjhhZGRhZTZlMzMxZDJhZGFhNzBmODY5;Nick Piggin;Linus Torvalds;"[PATCH] oom: more printk Print the name of the task invoking the OOM killer. Could make debugging easier. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/b72f160443cb78b2f8addae6e331d2adaa70f869;Print the name of the task invoking the OOM killer;yes;yes;no 376;MDY6Q29tbWl0MjMyNTI5ODpiNzJmMTYwNDQzY2I3OGIyZjhhZGRhZTZlMzMxZDJhZGFhNzBmODY5;Nick Piggin;Linus Torvalds;"[PATCH] oom: more printk Print the name of the task invoking the OOM killer. Could make debugging easier. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/b72f160443cb78b2f8addae6e331d2adaa70f869;" Could make debugging easier.";no;yes;no 377;MDY6Q29tbWl0MjMyNTI5ODo1MDgxZGRlMzNmN2E2MWQyOGQ5YjE4NWNjMzg2ZjEyY2I4MzdjN2E0;Nick Piggin;Linus Torvalds;"[PATCH] oom: kthread infinite loop fix Skip kernel threads, rather than having them return 0 from badness. Theoretically, badness might truncate all results to 0, thus a kernel thread might be picked first, causing an infinite loop. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/5081dde33f7a61d28d9b185cc386f12cb837c7a4;[PATCH] oom: kthread infinite loop fix;yes;yes;no 377;MDY6Q29tbWl0MjMyNTI5ODo1MDgxZGRlMzNmN2E2MWQyOGQ5YjE4NWNjMzg2ZjEyY2I4MzdjN2E0;Nick Piggin;Linus Torvalds;"[PATCH] oom: kthread infinite loop fix Skip kernel threads, rather than having them return 0 from badness. Theoretically, badness might truncate all results to 0, thus a kernel thread might be picked first, causing an infinite loop. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/5081dde33f7a61d28d9b185cc386f12cb837c7a4;Skip kernel threads, rather than having them return 0 from badness;yes;no;no 377;MDY6Q29tbWl0MjMyNTI5ODo1MDgxZGRlMzNmN2E2MWQyOGQ5YjE4NWNjMzg2ZjEyY2I4MzdjN2E0;Nick Piggin;Linus Torvalds;"[PATCH] oom: kthread infinite loop fix Skip kernel threads, rather than having them return 0 from badness. Theoretically, badness might truncate all results to 0, thus a kernel thread might be picked first, causing an infinite loop. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/5081dde33f7a61d28d9b185cc386f12cb837c7a4;"Theoretically, badness might truncate all results to 0, thus a kernel thread might be picked first, causing an infinite loop.";yes;yes;yes 378;MDY6Q29tbWl0MjMyNTI5ODphZjViOTEyNDM1ZGUzMmZiZWRlMDhjZWU5NDk0Mjk4MjNlZDQ5Nzgx;Nick Piggin;Linus Torvalds;"[PATCH] oom: swapoff tasks tweak PF_SWAPOFF processes currently cause select_bad_process to return straight away. Instead, give them high priority, so we will kill them first, however we also first ensure no parallel OOM kills are happening at the same time. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/af5b912435de32fbede08cee949429823ed49781;[PATCH] oom: swapoff tasks tweak;yes;no;no 378;MDY6Q29tbWl0MjMyNTI5ODphZjViOTEyNDM1ZGUzMmZiZWRlMDhjZWU5NDk0Mjk4MjNlZDQ5Nzgx;Nick Piggin;Linus Torvalds;"[PATCH] oom: swapoff tasks tweak PF_SWAPOFF processes currently cause select_bad_process to return straight away. Instead, give them high priority, so we will kill them first, however we also first ensure no parallel OOM kills are happening at the same time. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/af5b912435de32fbede08cee949429823ed49781;"PF_SWAPOFF processes currently cause select_bad_process to return straight away";no;no;yes 378;MDY6Q29tbWl0MjMyNTI5ODphZjViOTEyNDM1ZGUzMmZiZWRlMDhjZWU5NDk0Mjk4MjNlZDQ5Nzgx;Nick Piggin;Linus Torvalds;"[PATCH] oom: swapoff tasks tweak PF_SWAPOFF processes currently cause select_bad_process to return straight away. Instead, give them high priority, so we will kill them first, however we also first ensure no parallel OOM kills are happening at the same time. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/af5b912435de32fbede08cee949429823ed49781;" Instead, give them high priority, so we will kill them first, however we also first ensure no parallel OOM kills are happening at the same time.";yes;yes;no 379;MDY6Q29tbWl0MjMyNTI5ODo0YTNlZGUxMDdlNDIyYTBjNTNkMjgwMjRiMGFhOTAyY2EyMmE4NzY4;Nick Piggin;Linus Torvalds;"[PATCH] oom: handle oom_disable exiting Having the oomkilladj == OOM_DISABLE check before the releasing check means that oomkilladj == OOM_DISABLE tasks exiting will not stop the OOM killer. Moving the test down will give the desired behaviour. Also: it will allow them to ""OOM-kill"" themselves if they are exiting. As per the previous patch, this is required to prevent OOM killer deadlocks (and they don't actually get killed, because they're already exiting -- they're simply allowed access to memory reserves). Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/4a3ede107e422a0c53d28024b0aa902ca22a8768;[PATCH] oom: handle oom_disable exiting;yes;yes;no 379;MDY6Q29tbWl0MjMyNTI5ODo0YTNlZGUxMDdlNDIyYTBjNTNkMjgwMjRiMGFhOTAyY2EyMmE4NzY4;Nick Piggin;Linus Torvalds;"[PATCH] oom: handle oom_disable exiting Having the oomkilladj == OOM_DISABLE check before the releasing check means that oomkilladj == OOM_DISABLE tasks exiting will not stop the OOM killer. Moving the test down will give the desired behaviour. Also: it will allow them to ""OOM-kill"" themselves if they are exiting. As per the previous patch, this is required to prevent OOM killer deadlocks (and they don't actually get killed, because they're already exiting -- they're simply allowed access to memory reserves). Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/4a3ede107e422a0c53d28024b0aa902ca22a8768;"Having the oomkilladj == OOM_DISABLE check before the releasing check means that oomkilladj == OOM_DISABLE tasks exiting will not stop the OOM killer";no;no;yes 379;MDY6Q29tbWl0MjMyNTI5ODo0YTNlZGUxMDdlNDIyYTBjNTNkMjgwMjRiMGFhOTAyY2EyMmE4NzY4;Nick Piggin;Linus Torvalds;"[PATCH] oom: handle oom_disable exiting Having the oomkilladj == OOM_DISABLE check before the releasing check means that oomkilladj == OOM_DISABLE tasks exiting will not stop the OOM killer. Moving the test down will give the desired behaviour. Also: it will allow them to ""OOM-kill"" themselves if they are exiting. As per the previous patch, this is required to prevent OOM killer deadlocks (and they don't actually get killed, because they're already exiting -- they're simply allowed access to memory reserves). Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/4a3ede107e422a0c53d28024b0aa902ca22a8768;Moving the test down will give the desired behaviour;yes;yes;no 379;MDY6Q29tbWl0MjMyNTI5ODo0YTNlZGUxMDdlNDIyYTBjNTNkMjgwMjRiMGFhOTAyY2EyMmE4NzY4;Nick Piggin;Linus Torvalds;"[PATCH] oom: handle oom_disable exiting Having the oomkilladj == OOM_DISABLE check before the releasing check means that oomkilladj == OOM_DISABLE tasks exiting will not stop the OOM killer. Moving the test down will give the desired behaviour. Also: it will allow them to ""OOM-kill"" themselves if they are exiting. As per the previous patch, this is required to prevent OOM killer deadlocks (and they don't actually get killed, because they're already exiting -- they're simply allowed access to memory reserves). Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/4a3ede107e422a0c53d28024b0aa902ca22a8768;" Also: it will allow them to ""OOM-kill"" themselves if they are exiting";yes;yes;no 379;MDY6Q29tbWl0MjMyNTI5ODo0YTNlZGUxMDdlNDIyYTBjNTNkMjgwMjRiMGFhOTAyY2EyMmE4NzY4;Nick Piggin;Linus Torvalds;"[PATCH] oom: handle oom_disable exiting Having the oomkilladj == OOM_DISABLE check before the releasing check means that oomkilladj == OOM_DISABLE tasks exiting will not stop the OOM killer. Moving the test down will give the desired behaviour. Also: it will allow them to ""OOM-kill"" themselves if they are exiting. As per the previous patch, this is required to prevent OOM killer deadlocks (and they don't actually get killed, because they're already exiting -- they're simply allowed access to memory reserves). Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/4a3ede107e422a0c53d28024b0aa902ca22a8768;" As per the previous patch, this is required to prevent OOM killer deadlocks (and they don't actually get killed, because they're already exiting -- they're simply allowed access to memory reserves).";yes;yes;yes 380;MDY6Q29tbWl0MjMyNTI5ODo1MGVjM2JiZmZiZThhOTYzNDdjNTQ4MzJkNDgxMTBhNWJjOWU5ZmY4;Nick Piggin;Linus Torvalds;"[PATCH] oom: handle current exiting If current *is* exiting, it should actually be allowed to access reserved memory rather than OOM kill something else. Can't do this via a straight check in page_alloc.c because that would allow multiple tasks to use up reserves. Instead cause current to OOM-kill itself which will mark it as TIF_MEMDIE. The current procedure of simply aborting the OOM-kill if a task is exiting can lead to OOM deadlocks. In the case of killing a PF_EXITING task, don't make a lot of noise about it. This becomes more important in future patches, where we can ""kill"" OOM_DISABLE tasks. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/50ec3bbffbe8a96347c54832d48110a5bc9e9ff8;[PATCH] oom: handle current exiting;yes;yes;no 380;MDY6Q29tbWl0MjMyNTI5ODo1MGVjM2JiZmZiZThhOTYzNDdjNTQ4MzJkNDgxMTBhNWJjOWU5ZmY4;Nick Piggin;Linus Torvalds;"[PATCH] oom: handle current exiting If current *is* exiting, it should actually be allowed to access reserved memory rather than OOM kill something else. Can't do this via a straight check in page_alloc.c because that would allow multiple tasks to use up reserves. Instead cause current to OOM-kill itself which will mark it as TIF_MEMDIE. The current procedure of simply aborting the OOM-kill if a task is exiting can lead to OOM deadlocks. In the case of killing a PF_EXITING task, don't make a lot of noise about it. This becomes more important in future patches, where we can ""kill"" OOM_DISABLE tasks. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/50ec3bbffbe8a96347c54832d48110a5bc9e9ff8;"If current *is* exiting, it should actually be allowed to access reserved memory rather than OOM kill something else";yes;yes;yes 380;MDY6Q29tbWl0MjMyNTI5ODo1MGVjM2JiZmZiZThhOTYzNDdjNTQ4MzJkNDgxMTBhNWJjOWU5ZmY4;Nick Piggin;Linus Torvalds;"[PATCH] oom: handle current exiting If current *is* exiting, it should actually be allowed to access reserved memory rather than OOM kill something else. Can't do this via a straight check in page_alloc.c because that would allow multiple tasks to use up reserves. Instead cause current to OOM-kill itself which will mark it as TIF_MEMDIE. The current procedure of simply aborting the OOM-kill if a task is exiting can lead to OOM deadlocks. In the case of killing a PF_EXITING task, don't make a lot of noise about it. This becomes more important in future patches, where we can ""kill"" OOM_DISABLE tasks. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/50ec3bbffbe8a96347c54832d48110a5bc9e9ff8;" Can't do this via a straight check in page_alloc.c because that would allow multiple tasks to use up reserves";yes;yes;yes 380;MDY6Q29tbWl0MjMyNTI5ODo1MGVjM2JiZmZiZThhOTYzNDdjNTQ4MzJkNDgxMTBhNWJjOWU5ZmY4;Nick Piggin;Linus Torvalds;"[PATCH] oom: handle current exiting If current *is* exiting, it should actually be allowed to access reserved memory rather than OOM kill something else. Can't do this via a straight check in page_alloc.c because that would allow multiple tasks to use up reserves. Instead cause current to OOM-kill itself which will mark it as TIF_MEMDIE. The current procedure of simply aborting the OOM-kill if a task is exiting can lead to OOM deadlocks. In the case of killing a PF_EXITING task, don't make a lot of noise about it. This becomes more important in future patches, where we can ""kill"" OOM_DISABLE tasks. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/50ec3bbffbe8a96347c54832d48110a5bc9e9ff8;" Instead cause current to OOM-kill itself which will mark it as TIF_MEMDIE";yes;no;yes 380;MDY6Q29tbWl0MjMyNTI5ODo1MGVjM2JiZmZiZThhOTYzNDdjNTQ4MzJkNDgxMTBhNWJjOWU5ZmY4;Nick Piggin;Linus Torvalds;"[PATCH] oom: handle current exiting If current *is* exiting, it should actually be allowed to access reserved memory rather than OOM kill something else. Can't do this via a straight check in page_alloc.c because that would allow multiple tasks to use up reserves. Instead cause current to OOM-kill itself which will mark it as TIF_MEMDIE. The current procedure of simply aborting the OOM-kill if a task is exiting can lead to OOM deadlocks. In the case of killing a PF_EXITING task, don't make a lot of noise about it. This becomes more important in future patches, where we can ""kill"" OOM_DISABLE tasks. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/50ec3bbffbe8a96347c54832d48110a5bc9e9ff8;"The current procedure of simply aborting the OOM-kill if a task is exiting can lead to OOM deadlocks";no;yes;yes 380;MDY6Q29tbWl0MjMyNTI5ODo1MGVjM2JiZmZiZThhOTYzNDdjNTQ4MzJkNDgxMTBhNWJjOWU5ZmY4;Nick Piggin;Linus Torvalds;"[PATCH] oom: handle current exiting If current *is* exiting, it should actually be allowed to access reserved memory rather than OOM kill something else. Can't do this via a straight check in page_alloc.c because that would allow multiple tasks to use up reserves. Instead cause current to OOM-kill itself which will mark it as TIF_MEMDIE. The current procedure of simply aborting the OOM-kill if a task is exiting can lead to OOM deadlocks. In the case of killing a PF_EXITING task, don't make a lot of noise about it. This becomes more important in future patches, where we can ""kill"" OOM_DISABLE tasks. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/50ec3bbffbe8a96347c54832d48110a5bc9e9ff8;In the case of killing a PF_EXITING task, don't make a lot of noise about it;yes;no;no 380;MDY6Q29tbWl0MjMyNTI5ODo1MGVjM2JiZmZiZThhOTYzNDdjNTQ4MzJkNDgxMTBhNWJjOWU5ZmY4;Nick Piggin;Linus Torvalds;"[PATCH] oom: handle current exiting If current *is* exiting, it should actually be allowed to access reserved memory rather than OOM kill something else. Can't do this via a straight check in page_alloc.c because that would allow multiple tasks to use up reserves. Instead cause current to OOM-kill itself which will mark it as TIF_MEMDIE. The current procedure of simply aborting the OOM-kill if a task is exiting can lead to OOM deadlocks. In the case of killing a PF_EXITING task, don't make a lot of noise about it. This becomes more important in future patches, where we can ""kill"" OOM_DISABLE tasks. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/50ec3bbffbe8a96347c54832d48110a5bc9e9ff8;"This becomes more important in future patches, where we can ""kill"" OOM_DISABLE tasks.";yes;yes;no 381;MDY6Q29tbWl0MjMyNTI5ODo3ODg3YTNkYTc1M2UxYmE4MjQ0NTU2Y2M5YTJiMzhjODE1YmZlMjU2;Nick Piggin;Linus Torvalds;"[PATCH] oom: cpuset hint cpuset_excl_nodes_overlap does not always indicate that killing a task will not free any memory we for us. For example, we may be asking for an allocation from _anywhere_ in the machine, or the task in question may be pinning memory that is outside its cpuset. Fix this by just causing cpuset_excl_nodes_overlap to reduce the badness rather than disallow it. Signed-off-by: Nick Piggin <npiggin@suse.de> Acked-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/7887a3da753e1ba8244556cc9a2b38c815bfe256;[PATCH] oom: cpuset hint;yes;no;no 381;MDY6Q29tbWl0MjMyNTI5ODo3ODg3YTNkYTc1M2UxYmE4MjQ0NTU2Y2M5YTJiMzhjODE1YmZlMjU2;Nick Piggin;Linus Torvalds;"[PATCH] oom: cpuset hint cpuset_excl_nodes_overlap does not always indicate that killing a task will not free any memory we for us. For example, we may be asking for an allocation from _anywhere_ in the machine, or the task in question may be pinning memory that is outside its cpuset. Fix this by just causing cpuset_excl_nodes_overlap to reduce the badness rather than disallow it. Signed-off-by: Nick Piggin <npiggin@suse.de> Acked-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/7887a3da753e1ba8244556cc9a2b38c815bfe256;"cpuset_excl_nodes_overlap does not always indicate that killing a task will not free any memory we for us";no;yes;yes 381;MDY6Q29tbWl0MjMyNTI5ODo3ODg3YTNkYTc1M2UxYmE4MjQ0NTU2Y2M5YTJiMzhjODE1YmZlMjU2;Nick Piggin;Linus Torvalds;"[PATCH] oom: cpuset hint cpuset_excl_nodes_overlap does not always indicate that killing a task will not free any memory we for us. For example, we may be asking for an allocation from _anywhere_ in the machine, or the task in question may be pinning memory that is outside its cpuset. Fix this by just causing cpuset_excl_nodes_overlap to reduce the badness rather than disallow it. Signed-off-by: Nick Piggin <npiggin@suse.de> Acked-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/7887a3da753e1ba8244556cc9a2b38c815bfe256;" For example, we may be asking for an allocation from _anywhere_ in the machine, or the task in question may be pinning memory that is outside its cpuset";no;no;yes 381;MDY6Q29tbWl0MjMyNTI5ODo3ODg3YTNkYTc1M2UxYmE4MjQ0NTU2Y2M5YTJiMzhjODE1YmZlMjU2;Nick Piggin;Linus Torvalds;"[PATCH] oom: cpuset hint cpuset_excl_nodes_overlap does not always indicate that killing a task will not free any memory we for us. For example, we may be asking for an allocation from _anywhere_ in the machine, or the task in question may be pinning memory that is outside its cpuset. Fix this by just causing cpuset_excl_nodes_overlap to reduce the badness rather than disallow it. Signed-off-by: Nick Piggin <npiggin@suse.de> Acked-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/7887a3da753e1ba8244556cc9a2b38c815bfe256;" Fix this by just causing cpuset_excl_nodes_overlap to reduce the badness rather than disallow it.";yes;yes;no 382;MDY6Q29tbWl0MjMyNTI5ODo4YmM3MTlkM2NhYjg0MTQ5MzhmOWVhNmUzM2I1OGQ4ODEwZDE4MDY4;Martin Schwidefsky;Linus Torvalds;"[PATCH] out of memory notifier Add a notifer chain to the out of memory killer. If one of the registered callbacks could release some memory, do not kill the process but return and retry the allocation that forced the oom killer to run. The purpose of the notifier is to add a safety net in the presence of memory ballooners. If the resource manager inflated the balloon to a size where memory allocations can not be satisfied anymore, it is better to deflate the balloon a bit instead of killing processes. The implementation for the s390 ballooner is included. [akpm@osdl.org: cleanups] Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/8bc719d3cab8414938f9ea6e33b58d8810d18068;[PATCH] out of memory notifier;yes;no;no 382;MDY6Q29tbWl0MjMyNTI5ODo4YmM3MTlkM2NhYjg0MTQ5MzhmOWVhNmUzM2I1OGQ4ODEwZDE4MDY4;Martin Schwidefsky;Linus Torvalds;"[PATCH] out of memory notifier Add a notifer chain to the out of memory killer. If one of the registered callbacks could release some memory, do not kill the process but return and retry the allocation that forced the oom killer to run. The purpose of the notifier is to add a safety net in the presence of memory ballooners. If the resource manager inflated the balloon to a size where memory allocations can not be satisfied anymore, it is better to deflate the balloon a bit instead of killing processes. The implementation for the s390 ballooner is included. [akpm@osdl.org: cleanups] Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/8bc719d3cab8414938f9ea6e33b58d8810d18068;Add a notifer chain to the out of memory killer;yes;no;no 382;MDY6Q29tbWl0MjMyNTI5ODo4YmM3MTlkM2NhYjg0MTQ5MzhmOWVhNmUzM2I1OGQ4ODEwZDE4MDY4;Martin Schwidefsky;Linus Torvalds;"[PATCH] out of memory notifier Add a notifer chain to the out of memory killer. If one of the registered callbacks could release some memory, do not kill the process but return and retry the allocation that forced the oom killer to run. The purpose of the notifier is to add a safety net in the presence of memory ballooners. If the resource manager inflated the balloon to a size where memory allocations can not be satisfied anymore, it is better to deflate the balloon a bit instead of killing processes. The implementation for the s390 ballooner is included. [akpm@osdl.org: cleanups] Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/8bc719d3cab8414938f9ea6e33b58d8810d18068;" If one of the registered callbacks could release some memory, do not kill the process but return and retry the allocation that forced the oom killer to run";yes;no;no 382;MDY6Q29tbWl0MjMyNTI5ODo4YmM3MTlkM2NhYjg0MTQ5MzhmOWVhNmUzM2I1OGQ4ODEwZDE4MDY4;Martin Schwidefsky;Linus Torvalds;"[PATCH] out of memory notifier Add a notifer chain to the out of memory killer. If one of the registered callbacks could release some memory, do not kill the process but return and retry the allocation that forced the oom killer to run. The purpose of the notifier is to add a safety net in the presence of memory ballooners. If the resource manager inflated the balloon to a size where memory allocations can not be satisfied anymore, it is better to deflate the balloon a bit instead of killing processes. The implementation for the s390 ballooner is included. [akpm@osdl.org: cleanups] Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/8bc719d3cab8414938f9ea6e33b58d8810d18068;"The purpose of the notifier is to add a safety net in the presence of memory ballooners";yes;yes;no 382;MDY6Q29tbWl0MjMyNTI5ODo4YmM3MTlkM2NhYjg0MTQ5MzhmOWVhNmUzM2I1OGQ4ODEwZDE4MDY4;Martin Schwidefsky;Linus Torvalds;"[PATCH] out of memory notifier Add a notifer chain to the out of memory killer. If one of the registered callbacks could release some memory, do not kill the process but return and retry the allocation that forced the oom killer to run. The purpose of the notifier is to add a safety net in the presence of memory ballooners. If the resource manager inflated the balloon to a size where memory allocations can not be satisfied anymore, it is better to deflate the balloon a bit instead of killing processes. The implementation for the s390 ballooner is included. [akpm@osdl.org: cleanups] Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/8bc719d3cab8414938f9ea6e33b58d8810d18068;" If the resource manager inflated the balloon to a size where memory allocations can not be satisfied anymore, it is better to deflate the balloon a bit instead of killing processes";yes;yes;yes 382;MDY6Q29tbWl0MjMyNTI5ODo4YmM3MTlkM2NhYjg0MTQ5MzhmOWVhNmUzM2I1OGQ4ODEwZDE4MDY4;Martin Schwidefsky;Linus Torvalds;"[PATCH] out of memory notifier Add a notifer chain to the out of memory killer. If one of the registered callbacks could release some memory, do not kill the process but return and retry the allocation that forced the oom killer to run. The purpose of the notifier is to add a safety net in the presence of memory ballooners. If the resource manager inflated the balloon to a size where memory allocations can not be satisfied anymore, it is better to deflate the balloon a bit instead of killing processes. The implementation for the s390 ballooner is included. [akpm@osdl.org: cleanups] Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/8bc719d3cab8414938f9ea6e33b58d8810d18068;The implementation for the s390 ballooner is included.;yes;no;no 383;MDY6Q29tbWl0MjMyNTI5ODozNmM4YjU4Njg5NmY2MGNiOTFhNGZkNTI2MjMzMTkwYjM0MzE2YmFm;Ingo Molnar;Linus Torvalds;"[PATCH] sched: cleanup, remove task_t, convert to struct task_struct cleanup: remove task_t and convert all the uses to struct task_struct. I introduced it for the scheduler anno and it was a mistake. Conversion was mostly scripted, the result was reviewed and all secondary whitespace and style impact (if any) was fixed up by hand. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/36c8b586896f60cb91a4fd526233190b34316baf;[PATCH] sched: cleanup, remove task_t, convert to struct task_struct;yes;yes;no 383;MDY6Q29tbWl0MjMyNTI5ODozNmM4YjU4Njg5NmY2MGNiOTFhNGZkNTI2MjMzMTkwYjM0MzE2YmFm;Ingo Molnar;Linus Torvalds;"[PATCH] sched: cleanup, remove task_t, convert to struct task_struct cleanup: remove task_t and convert all the uses to struct task_struct. I introduced it for the scheduler anno and it was a mistake. Conversion was mostly scripted, the result was reviewed and all secondary whitespace and style impact (if any) was fixed up by hand. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/36c8b586896f60cb91a4fd526233190b34316baf;cleanup: remove task_t and convert all the uses to struct task_struct;yes;yes;no 383;MDY6Q29tbWl0MjMyNTI5ODozNmM4YjU4Njg5NmY2MGNiOTFhNGZkNTI2MjMzMTkwYjM0MzE2YmFm;Ingo Molnar;Linus Torvalds;"[PATCH] sched: cleanup, remove task_t, convert to struct task_struct cleanup: remove task_t and convert all the uses to struct task_struct. I introduced it for the scheduler anno and it was a mistake. Conversion was mostly scripted, the result was reviewed and all secondary whitespace and style impact (if any) was fixed up by hand. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/36c8b586896f60cb91a4fd526233190b34316baf;"I introduced it for the scheduler anno and it was a mistake";no;yes;yes 383;MDY6Q29tbWl0MjMyNTI5ODozNmM4YjU4Njg5NmY2MGNiOTFhNGZkNTI2MjMzMTkwYjM0MzE2YmFm;Ingo Molnar;Linus Torvalds;"[PATCH] sched: cleanup, remove task_t, convert to struct task_struct cleanup: remove task_t and convert all the uses to struct task_struct. I introduced it for the scheduler anno and it was a mistake. Conversion was mostly scripted, the result was reviewed and all secondary whitespace and style impact (if any) was fixed up by hand. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/36c8b586896f60cb91a4fd526233190b34316baf;"Conversion was mostly scripted, the result was reviewed and all secondary whitespace and style impact (if any) was fixed up by hand.";yes;no;yes 384;MDY6Q29tbWl0MjMyNTI5ODo2OTM3YTI1Y2ZmODE4ZDMyZDBmOWZmNThhNTE4YzlhYjk2NzYwYWVi;Dave Peterson;Linus Torvalds;"[PATCH] mm: fix typos in comments in mm/oom_kill.c This fixes a few typos in the comments in mm/oom_kill.c. Signed-off-by: David S. Peterson <dsp@llnl.gov> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/6937a25cff818d32d0f9ff58a518c9ab96760aeb;[PATCH] mm: fix typos in comments in mm/oom_kill.c;yes;yes;no 384;MDY6Q29tbWl0MjMyNTI5ODo2OTM3YTI1Y2ZmODE4ZDMyZDBmOWZmNThhNTE4YzlhYjk2NzYwYWVi;Dave Peterson;Linus Torvalds;"[PATCH] mm: fix typos in comments in mm/oom_kill.c This fixes a few typos in the comments in mm/oom_kill.c. Signed-off-by: David S. Peterson <dsp@llnl.gov> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/6937a25cff818d32d0f9ff58a518c9ab96760aeb;This fixes a few typos in the comments in mm/oom_kill.c.;yes;yes;no 385;MDY6Q29tbWl0MjMyNTI5ODpmYWRkOGZiZDE1M2MxMjk2M2Y4ZmUzYzllZjdmODk2N2YyODZmOThi;KAMEZAWA Hiroyuki;Linus Torvalds;"[PATCH] support for panic at OOM This patch adds panic_on_oom sysctl under sys.vm. When sysctl vm.panic_on_oom = 1, the kernel panics intead of killing rogue processes. And if vm.panic_on_oom is 0 the kernel will do oom_kill() in the same way as it does today. Of course, the default value is 0 and only root can modifies it. In general, oom_killer works well and kill rogue processes. So the whole system can survive. But there are environments where panic is preferable rather than kill some processes. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/fadd8fbd153c12963f8fe3c9ef7f8967f286f98b;[PATCH] support for panic at OOM;yes;no;no 385;MDY6Q29tbWl0MjMyNTI5ODpmYWRkOGZiZDE1M2MxMjk2M2Y4ZmUzYzllZjdmODk2N2YyODZmOThi;KAMEZAWA Hiroyuki;Linus Torvalds;"[PATCH] support for panic at OOM This patch adds panic_on_oom sysctl under sys.vm. When sysctl vm.panic_on_oom = 1, the kernel panics intead of killing rogue processes. And if vm.panic_on_oom is 0 the kernel will do oom_kill() in the same way as it does today. Of course, the default value is 0 and only root can modifies it. In general, oom_killer works well and kill rogue processes. So the whole system can survive. But there are environments where panic is preferable rather than kill some processes. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/fadd8fbd153c12963f8fe3c9ef7f8967f286f98b;This patch adds panic_on_oom sysctl under sys.vm;yes;no;no 385;MDY6Q29tbWl0MjMyNTI5ODpmYWRkOGZiZDE1M2MxMjk2M2Y4ZmUzYzllZjdmODk2N2YyODZmOThi;KAMEZAWA Hiroyuki;Linus Torvalds;"[PATCH] support for panic at OOM This patch adds panic_on_oom sysctl under sys.vm. When sysctl vm.panic_on_oom = 1, the kernel panics intead of killing rogue processes. And if vm.panic_on_oom is 0 the kernel will do oom_kill() in the same way as it does today. Of course, the default value is 0 and only root can modifies it. In general, oom_killer works well and kill rogue processes. So the whole system can survive. But there are environments where panic is preferable rather than kill some processes. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/fadd8fbd153c12963f8fe3c9ef7f8967f286f98b;"When sysctl vm.panic_on_oom = 1, the kernel panics intead of killing rogue processes";yes;no;yes 385;MDY6Q29tbWl0MjMyNTI5ODpmYWRkOGZiZDE1M2MxMjk2M2Y4ZmUzYzllZjdmODk2N2YyODZmOThi;KAMEZAWA Hiroyuki;Linus Torvalds;"[PATCH] support for panic at OOM This patch adds panic_on_oom sysctl under sys.vm. When sysctl vm.panic_on_oom = 1, the kernel panics intead of killing rogue processes. And if vm.panic_on_oom is 0 the kernel will do oom_kill() in the same way as it does today. Of course, the default value is 0 and only root can modifies it. In general, oom_killer works well and kill rogue processes. So the whole system can survive. But there are environments where panic is preferable rather than kill some processes. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/fadd8fbd153c12963f8fe3c9ef7f8967f286f98b;" And if vm.panic_on_oom is 0 the kernel will do oom_kill() in the same way as it does today";yes;no;yes 385;MDY6Q29tbWl0MjMyNTI5ODpmYWRkOGZiZDE1M2MxMjk2M2Y4ZmUzYzllZjdmODk2N2YyODZmOThi;KAMEZAWA Hiroyuki;Linus Torvalds;"[PATCH] support for panic at OOM This patch adds panic_on_oom sysctl under sys.vm. When sysctl vm.panic_on_oom = 1, the kernel panics intead of killing rogue processes. And if vm.panic_on_oom is 0 the kernel will do oom_kill() in the same way as it does today. Of course, the default value is 0 and only root can modifies it. In general, oom_killer works well and kill rogue processes. So the whole system can survive. But there are environments where panic is preferable rather than kill some processes. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/fadd8fbd153c12963f8fe3c9ef7f8967f286f98b;" Of course, the default value is 0 and only root can modifies it";yes;no;yes 385;MDY6Q29tbWl0MjMyNTI5ODpmYWRkOGZiZDE1M2MxMjk2M2Y4ZmUzYzllZjdmODk2N2YyODZmOThi;KAMEZAWA Hiroyuki;Linus Torvalds;"[PATCH] support for panic at OOM This patch adds panic_on_oom sysctl under sys.vm. When sysctl vm.panic_on_oom = 1, the kernel panics intead of killing rogue processes. And if vm.panic_on_oom is 0 the kernel will do oom_kill() in the same way as it does today. Of course, the default value is 0 and only root can modifies it. In general, oom_killer works well and kill rogue processes. So the whole system can survive. But there are environments where panic is preferable rather than kill some processes. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/fadd8fbd153c12963f8fe3c9ef7f8967f286f98b;In general, oom_killer works well and kill rogue processes;no;yes;yes 385;MDY6Q29tbWl0MjMyNTI5ODpmYWRkOGZiZDE1M2MxMjk2M2Y4ZmUzYzllZjdmODk2N2YyODZmOThi;KAMEZAWA Hiroyuki;Linus Torvalds;"[PATCH] support for panic at OOM This patch adds panic_on_oom sysctl under sys.vm. When sysctl vm.panic_on_oom = 1, the kernel panics intead of killing rogue processes. And if vm.panic_on_oom is 0 the kernel will do oom_kill() in the same way as it does today. Of course, the default value is 0 and only root can modifies it. In general, oom_killer works well and kill rogue processes. So the whole system can survive. But there are environments where panic is preferable rather than kill some processes. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/fadd8fbd153c12963f8fe3c9ef7f8967f286f98b;" So the whole system can survive";no;yes;yes 385;MDY6Q29tbWl0MjMyNTI5ODpmYWRkOGZiZDE1M2MxMjk2M2Y4ZmUzYzllZjdmODk2N2YyODZmOThi;KAMEZAWA Hiroyuki;Linus Torvalds;"[PATCH] support for panic at OOM This patch adds panic_on_oom sysctl under sys.vm. When sysctl vm.panic_on_oom = 1, the kernel panics intead of killing rogue processes. And if vm.panic_on_oom is 0 the kernel will do oom_kill() in the same way as it does today. Of course, the default value is 0 and only root can modifies it. In general, oom_killer works well and kill rogue processes. So the whole system can survive. But there are environments where panic is preferable rather than kill some processes. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/fadd8fbd153c12963f8fe3c9ef7f8967f286f98b;" But there are environments where panic is preferable rather than kill some processes.";yes;yes;yes 386;MDY6Q29tbWl0MjMyNTI5ODowMTMxNTkyMjdiODQwZGZkNDQxYmQyZTRjOGI0ZDc3ZmZiM2NjNDJl;Dave Peterson;Linus Torvalds;"[PATCH] mm: fix mm_struct reference counting bugs in mm/oom_kill.c Fix oom_kill_task() so it doesn't call mmput() (which may sleep) while holding tasklist_lock. Signed-off-by: David S. Peterson <dsp@llnl.gov> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/013159227b840dfd441bd2e4c8b4d77ffb3cc42e;[PATCH] mm: fix mm_struct reference counting bugs in mm/oom_kill.c;yes;yes;no 386;MDY6Q29tbWl0MjMyNTI5ODowMTMxNTkyMjdiODQwZGZkNDQxYmQyZTRjOGI0ZDc3ZmZiM2NjNDJl;Dave Peterson;Linus Torvalds;"[PATCH] mm: fix mm_struct reference counting bugs in mm/oom_kill.c Fix oom_kill_task() so it doesn't call mmput() (which may sleep) while holding tasklist_lock. Signed-off-by: David S. Peterson <dsp@llnl.gov> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/013159227b840dfd441bd2e4c8b4d77ffb3cc42e;"Fix oom_kill_task() so it doesn't call mmput() (which may sleep) while holding tasklist_lock.";yes;yes;yes 387;MDY6Q29tbWl0MjMyNTI5ODo5N2MyYzliODRkMGMxZWRmNDkyNmIxMzY2MWQ1YWYzZjBlZGNjYmNl;Andrew Morton;Linus Torvalds;"[PATCH] oom-kill: mm locking fix Dave Peterson <dsp@llnl.gov> points out that badness() is playing with mm_structs without taking a reference on them. mmput() can sleep, so taking a reference here (inside tasklist_lock) is hard. Fix it up via task_lock() instead. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/97c2c9b84d0c1edf4926b13661d5af3f0edccbce;[PATCH] oom-kill: mm locking fix;yes;yes;no 387;MDY6Q29tbWl0MjMyNTI5ODo5N2MyYzliODRkMGMxZWRmNDkyNmIxMzY2MWQ1YWYzZjBlZGNjYmNl;Andrew Morton;Linus Torvalds;"[PATCH] oom-kill: mm locking fix Dave Peterson <dsp@llnl.gov> points out that badness() is playing with mm_structs without taking a reference on them. mmput() can sleep, so taking a reference here (inside tasklist_lock) is hard. Fix it up via task_lock() instead. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/97c2c9b84d0c1edf4926b13661d5af3f0edccbce;"Dave Peterson <dsp@llnl.gov> points out that badness() is playing with mm_structs without taking a reference on them";no;no;yes 387;MDY6Q29tbWl0MjMyNTI5ODo5N2MyYzliODRkMGMxZWRmNDkyNmIxMzY2MWQ1YWYzZjBlZGNjYmNl;Andrew Morton;Linus Torvalds;"[PATCH] oom-kill: mm locking fix Dave Peterson <dsp@llnl.gov> points out that badness() is playing with mm_structs without taking a reference on them. mmput() can sleep, so taking a reference here (inside tasklist_lock) is hard. Fix it up via task_lock() instead. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/97c2c9b84d0c1edf4926b13661d5af3f0edccbce;"mmput() can sleep, so taking a reference here (inside tasklist_lock) is hard";no;yes;yes 387;MDY6Q29tbWl0MjMyNTI5ODo5N2MyYzliODRkMGMxZWRmNDkyNmIxMzY2MWQ1YWYzZjBlZGNjYmNl;Andrew Morton;Linus Torvalds;"[PATCH] oom-kill: mm locking fix Dave Peterson <dsp@llnl.gov> points out that badness() is playing with mm_structs without taking a reference on them. mmput() can sleep, so taking a reference here (inside tasklist_lock) is hard. Fix it up via task_lock() instead. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/97c2c9b84d0c1edf4926b13661d5af3f0edccbce; Fix it up via task_lock() instead.;yes;yes;no 388;MDY6Q29tbWl0MjMyNTI5ODoxNDBmZmNlYzRkZWYzZWUzYWY3NTY1YjJjZjFkM2IyNTgwZjdlMTgw;Andrew Morton;Linus Torvalds;"[PATCH] out_of_memory() locking fix I seem to have lost this read_unlock(). While we're there, let's turn that interruptible sleep unto uninterruptible, so we don't get a busywait if signal_pending(). (Again. We seem to have a habit of doing this). Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/140ffcec4def3ee3af7565b2cf1d3b2580f7e180;[PATCH] out_of_memory() locking fix;yes;yes;no 388;MDY6Q29tbWl0MjMyNTI5ODoxNDBmZmNlYzRkZWYzZWUzYWY3NTY1YjJjZjFkM2IyNTgwZjdlMTgw;Andrew Morton;Linus Torvalds;"[PATCH] out_of_memory() locking fix I seem to have lost this read_unlock(). While we're there, let's turn that interruptible sleep unto uninterruptible, so we don't get a busywait if signal_pending(). (Again. We seem to have a habit of doing this). Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/140ffcec4def3ee3af7565b2cf1d3b2580f7e180;I seem to have lost this read_unlock();yes;yes;yes 388;MDY6Q29tbWl0MjMyNTI5ODoxNDBmZmNlYzRkZWYzZWUzYWY3NTY1YjJjZjFkM2IyNTgwZjdlMTgw;Andrew Morton;Linus Torvalds;"[PATCH] out_of_memory() locking fix I seem to have lost this read_unlock(). While we're there, let's turn that interruptible sleep unto uninterruptible, so we don't get a busywait if signal_pending(). (Again. We seem to have a habit of doing this). Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/140ffcec4def3ee3af7565b2cf1d3b2580f7e180;"While we're there, let's turn that interruptible sleep unto uninterruptible, so we don't get a busywait if signal_pending()";yes;yes;no 388;MDY6Q29tbWl0MjMyNTI5ODoxNDBmZmNlYzRkZWYzZWUzYWY3NTY1YjJjZjFkM2IyNTgwZjdlMTgw;Andrew Morton;Linus Torvalds;"[PATCH] out_of_memory() locking fix I seem to have lost this read_unlock(). While we're there, let's turn that interruptible sleep unto uninterruptible, so we don't get a busywait if signal_pending(). (Again. We seem to have a habit of doing this). Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/140ffcec4def3ee3af7565b2cf1d3b2580f7e180; (Again;no;no;yes 388;MDY6Q29tbWl0MjMyNTI5ODoxNDBmZmNlYzRkZWYzZWUzYWY3NTY1YjJjZjFkM2IyNTgwZjdlMTgw;Andrew Morton;Linus Torvalds;"[PATCH] out_of_memory() locking fix I seem to have lost this read_unlock(). While we're there, let's turn that interruptible sleep unto uninterruptible, so we don't get a busywait if signal_pending(). (Again. We seem to have a habit of doing this). Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/140ffcec4def3ee3af7565b2cf1d3b2580f7e180;" We seem to have a habit of doing this).";no;no;yes 389;MDY6Q29tbWl0MjMyNTI5ODpkNjcxM2UwNDYzMzZmZmE5ODA2MDQxOGM0ZDJjNjUyNDM2MzllMTA3;Andrew Morton;Linus Torvalds;"[PATCH] out_of_memory(): use of uninitialised Under some circumstances `points' can get printed before it's initialised. Spotted by Carlos Martin <carlos@cmartin.tk>. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/d6713e046336ffa98060418c4d2c65243639e107;[PATCH] out_of_memory(): use of uninitialised;no;yes;no 389;MDY6Q29tbWl0MjMyNTI5ODpkNjcxM2UwNDYzMzZmZmE5ODA2MDQxOGM0ZDJjNjUyNDM2MzllMTA3;Andrew Morton;Linus Torvalds;"[PATCH] out_of_memory(): use of uninitialised Under some circumstances `points' can get printed before it's initialised. Spotted by Carlos Martin <carlos@cmartin.tk>. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/d6713e046336ffa98060418c4d2c65243639e107;Under some circumstances `points' can get printed before it's initialised;no;yes;yes 389;MDY6Q29tbWl0MjMyNTI5ODpkNjcxM2UwNDYzMzZmZmE5ODA2MDQxOGM0ZDJjNjUyNDM2MzllMTA3;Andrew Morton;Linus Torvalds;"[PATCH] out_of_memory(): use of uninitialised Under some circumstances `points' can get printed before it's initialised. Spotted by Carlos Martin <carlos@cmartin.tk>. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/d6713e046336ffa98060418c4d2c65243639e107;Spotted by Carlos Martin <carlos@cmartin.tk>.;no;no;yes 390;MDY6Q29tbWl0MjMyNTI5ODo5YjBmOGIwNDBhY2Q4ZGZkMjM4NjA3NTRjMGQwOWZmNGY0NGUyY2Jj;Christoph Lameter;Linus Torvalds;"[PATCH] Terminate process that fails on a constrained allocation Some allocations are restricted to a limited set of nodes (due to memory policies or cpuset constraints). If the page allocator is not able to find enough memory then that does not mean that overall system memory is low. In particular going postal and more or less randomly shooting at processes is not likely going to help the situation but may just lead to suicide (the whole system coming down). It is better to signal to the process that no memory exists given the constraints that the process (or the configuration of the process) has placed on the allocation behavior. The process may be killed but then the sysadmin or developer can investigate the situation. The solution is similar to what we do when running out of hugepages. This patch adds a check before we kill processes. At that point performance considerations do not matter much so we just scan the zonelist and reconstruct a list of nodes. If the list of nodes does not contain all online nodes then this is a constrained allocation and we should kill the current process. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/9b0f8b040acd8dfd23860754c0d09ff4f44e2cbc;[PATCH] Terminate process that fails on a constrained allocation;yes;no;no 390;MDY6Q29tbWl0MjMyNTI5ODo5YjBmOGIwNDBhY2Q4ZGZkMjM4NjA3NTRjMGQwOWZmNGY0NGUyY2Jj;Christoph Lameter;Linus Torvalds;"[PATCH] Terminate process that fails on a constrained allocation Some allocations are restricted to a limited set of nodes (due to memory policies or cpuset constraints). If the page allocator is not able to find enough memory then that does not mean that overall system memory is low. In particular going postal and more or less randomly shooting at processes is not likely going to help the situation but may just lead to suicide (the whole system coming down). It is better to signal to the process that no memory exists given the constraints that the process (or the configuration of the process) has placed on the allocation behavior. The process may be killed but then the sysadmin or developer can investigate the situation. The solution is similar to what we do when running out of hugepages. This patch adds a check before we kill processes. At that point performance considerations do not matter much so we just scan the zonelist and reconstruct a list of nodes. If the list of nodes does not contain all online nodes then this is a constrained allocation and we should kill the current process. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/9b0f8b040acd8dfd23860754c0d09ff4f44e2cbc;"Some allocations are restricted to a limited set of nodes (due to memory policies or cpuset constraints)";no;no;yes 390;MDY6Q29tbWl0MjMyNTI5ODo5YjBmOGIwNDBhY2Q4ZGZkMjM4NjA3NTRjMGQwOWZmNGY0NGUyY2Jj;Christoph Lameter;Linus Torvalds;"[PATCH] Terminate process that fails on a constrained allocation Some allocations are restricted to a limited set of nodes (due to memory policies or cpuset constraints). If the page allocator is not able to find enough memory then that does not mean that overall system memory is low. In particular going postal and more or less randomly shooting at processes is not likely going to help the situation but may just lead to suicide (the whole system coming down). It is better to signal to the process that no memory exists given the constraints that the process (or the configuration of the process) has placed on the allocation behavior. The process may be killed but then the sysadmin or developer can investigate the situation. The solution is similar to what we do when running out of hugepages. This patch adds a check before we kill processes. At that point performance considerations do not matter much so we just scan the zonelist and reconstruct a list of nodes. If the list of nodes does not contain all online nodes then this is a constrained allocation and we should kill the current process. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/9b0f8b040acd8dfd23860754c0d09ff4f44e2cbc;" If the page allocator is not able to find enough memory then that does not mean that overall system memory is low";no;no;yes 390;MDY6Q29tbWl0MjMyNTI5ODo5YjBmOGIwNDBhY2Q4ZGZkMjM4NjA3NTRjMGQwOWZmNGY0NGUyY2Jj;Christoph Lameter;Linus Torvalds;"[PATCH] Terminate process that fails on a constrained allocation Some allocations are restricted to a limited set of nodes (due to memory policies or cpuset constraints). If the page allocator is not able to find enough memory then that does not mean that overall system memory is low. In particular going postal and more or less randomly shooting at processes is not likely going to help the situation but may just lead to suicide (the whole system coming down). It is better to signal to the process that no memory exists given the constraints that the process (or the configuration of the process) has placed on the allocation behavior. The process may be killed but then the sysadmin or developer can investigate the situation. The solution is similar to what we do when running out of hugepages. This patch adds a check before we kill processes. At that point performance considerations do not matter much so we just scan the zonelist and reconstruct a list of nodes. If the list of nodes does not contain all online nodes then this is a constrained allocation and we should kill the current process. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/9b0f8b040acd8dfd23860754c0d09ff4f44e2cbc;"In particular going postal and more or less randomly shooting at processes is not likely going to help the situation but may just lead to suicide (the whole system coming down)";no;yes;yes 390;MDY6Q29tbWl0MjMyNTI5ODo5YjBmOGIwNDBhY2Q4ZGZkMjM4NjA3NTRjMGQwOWZmNGY0NGUyY2Jj;Christoph Lameter;Linus Torvalds;"[PATCH] Terminate process that fails on a constrained allocation Some allocations are restricted to a limited set of nodes (due to memory policies or cpuset constraints). If the page allocator is not able to find enough memory then that does not mean that overall system memory is low. In particular going postal and more or less randomly shooting at processes is not likely going to help the situation but may just lead to suicide (the whole system coming down). It is better to signal to the process that no memory exists given the constraints that the process (or the configuration of the process) has placed on the allocation behavior. The process may be killed but then the sysadmin or developer can investigate the situation. The solution is similar to what we do when running out of hugepages. This patch adds a check before we kill processes. At that point performance considerations do not matter much so we just scan the zonelist and reconstruct a list of nodes. If the list of nodes does not contain all online nodes then this is a constrained allocation and we should kill the current process. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/9b0f8b040acd8dfd23860754c0d09ff4f44e2cbc;"It is better to signal to the process that no memory exists given the constraints that the process (or the configuration of the process) has placed on the allocation behavior";yes;yes;no 390;MDY6Q29tbWl0MjMyNTI5ODo5YjBmOGIwNDBhY2Q4ZGZkMjM4NjA3NTRjMGQwOWZmNGY0NGUyY2Jj;Christoph Lameter;Linus Torvalds;"[PATCH] Terminate process that fails on a constrained allocation Some allocations are restricted to a limited set of nodes (due to memory policies or cpuset constraints). If the page allocator is not able to find enough memory then that does not mean that overall system memory is low. In particular going postal and more or less randomly shooting at processes is not likely going to help the situation but may just lead to suicide (the whole system coming down). It is better to signal to the process that no memory exists given the constraints that the process (or the configuration of the process) has placed on the allocation behavior. The process may be killed but then the sysadmin or developer can investigate the situation. The solution is similar to what we do when running out of hugepages. This patch adds a check before we kill processes. At that point performance considerations do not matter much so we just scan the zonelist and reconstruct a list of nodes. If the list of nodes does not contain all online nodes then this is a constrained allocation and we should kill the current process. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/9b0f8b040acd8dfd23860754c0d09ff4f44e2cbc;" The process may be killed but then the sysadmin or developer can investigate the situation";yes;no;no 390;MDY6Q29tbWl0MjMyNTI5ODo5YjBmOGIwNDBhY2Q4ZGZkMjM4NjA3NTRjMGQwOWZmNGY0NGUyY2Jj;Christoph Lameter;Linus Torvalds;"[PATCH] Terminate process that fails on a constrained allocation Some allocations are restricted to a limited set of nodes (due to memory policies or cpuset constraints). If the page allocator is not able to find enough memory then that does not mean that overall system memory is low. In particular going postal and more or less randomly shooting at processes is not likely going to help the situation but may just lead to suicide (the whole system coming down). It is better to signal to the process that no memory exists given the constraints that the process (or the configuration of the process) has placed on the allocation behavior. The process may be killed but then the sysadmin or developer can investigate the situation. The solution is similar to what we do when running out of hugepages. This patch adds a check before we kill processes. At that point performance considerations do not matter much so we just scan the zonelist and reconstruct a list of nodes. If the list of nodes does not contain all online nodes then this is a constrained allocation and we should kill the current process. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/9b0f8b040acd8dfd23860754c0d09ff4f44e2cbc;" The solution is similar to what we do when running out of hugepages";yes;yes;no 390;MDY6Q29tbWl0MjMyNTI5ODo5YjBmOGIwNDBhY2Q4ZGZkMjM4NjA3NTRjMGQwOWZmNGY0NGUyY2Jj;Christoph Lameter;Linus Torvalds;"[PATCH] Terminate process that fails on a constrained allocation Some allocations are restricted to a limited set of nodes (due to memory policies or cpuset constraints). If the page allocator is not able to find enough memory then that does not mean that overall system memory is low. In particular going postal and more or less randomly shooting at processes is not likely going to help the situation but may just lead to suicide (the whole system coming down). It is better to signal to the process that no memory exists given the constraints that the process (or the configuration of the process) has placed on the allocation behavior. The process may be killed but then the sysadmin or developer can investigate the situation. The solution is similar to what we do when running out of hugepages. This patch adds a check before we kill processes. At that point performance considerations do not matter much so we just scan the zonelist and reconstruct a list of nodes. If the list of nodes does not contain all online nodes then this is a constrained allocation and we should kill the current process. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/9b0f8b040acd8dfd23860754c0d09ff4f44e2cbc;This patch adds a check before we kill processes;yes;yes;no 390;MDY6Q29tbWl0MjMyNTI5ODo5YjBmOGIwNDBhY2Q4ZGZkMjM4NjA3NTRjMGQwOWZmNGY0NGUyY2Jj;Christoph Lameter;Linus Torvalds;"[PATCH] Terminate process that fails on a constrained allocation Some allocations are restricted to a limited set of nodes (due to memory policies or cpuset constraints). If the page allocator is not able to find enough memory then that does not mean that overall system memory is low. In particular going postal and more or less randomly shooting at processes is not likely going to help the situation but may just lead to suicide (the whole system coming down). It is better to signal to the process that no memory exists given the constraints that the process (or the configuration of the process) has placed on the allocation behavior. The process may be killed but then the sysadmin or developer can investigate the situation. The solution is similar to what we do when running out of hugepages. This patch adds a check before we kill processes. At that point performance considerations do not matter much so we just scan the zonelist and reconstruct a list of nodes. If the list of nodes does not contain all online nodes then this is a constrained allocation and we should kill the current process. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/9b0f8b040acd8dfd23860754c0d09ff4f44e2cbc;" At that point performance considerations do not matter much so we just scan the zonelist and reconstruct a list of nodes";yes;yes;yes 390;MDY6Q29tbWl0MjMyNTI5ODo5YjBmOGIwNDBhY2Q4ZGZkMjM4NjA3NTRjMGQwOWZmNGY0NGUyY2Jj;Christoph Lameter;Linus Torvalds;"[PATCH] Terminate process that fails on a constrained allocation Some allocations are restricted to a limited set of nodes (due to memory policies or cpuset constraints). If the page allocator is not able to find enough memory then that does not mean that overall system memory is low. In particular going postal and more or less randomly shooting at processes is not likely going to help the situation but may just lead to suicide (the whole system coming down). It is better to signal to the process that no memory exists given the constraints that the process (or the configuration of the process) has placed on the allocation behavior. The process may be killed but then the sysadmin or developer can investigate the situation. The solution is similar to what we do when running out of hugepages. This patch adds a check before we kill processes. At that point performance considerations do not matter much so we just scan the zonelist and reconstruct a list of nodes. If the list of nodes does not contain all online nodes then this is a constrained allocation and we should kill the current process. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/9b0f8b040acd8dfd23860754c0d09ff4f44e2cbc;" If the list of nodes does not contain all online nodes then this is a constrained allocation and we should kill the current process.";yes;yes;no 391;MDY6Q29tbWl0MjMyNTI5ODo5ODI3Yjc4MWYyMDgyOGU1Y2ViOTExYjg3OWYyNjhmNzhmZTkwODE1;Kurt Garloff;Linus Torvalds;"[PATCH] OOM kill: children accounting In the badness() calculation, there's currently this piece of code: /* * Processes which fork a lot of child processes are likely * a good choice. We add the vmsize of the children if they * have an own mm. This prevents forking servers to flood the * machine with an endless amount of children */ list_for_each(tsk, &p->children) { struct task_struct *chld; chld = list_entry(tsk, struct task_struct, sibling); if (chld->mm = p->mm && chld->mm) points += chld->mm->total_vm; } The intention is clear: If some server (apache) keeps spawning new children and we run OOM, we want to kill the father rather than picking a child. This -- to some degree -- also helps a bit with getting fork bombs under control, though I'd consider this a desirable side-effect rather than a feature. There's one problem with this: No matter how many or few children there are, if just one of them misbehaves, and all others (including the father) do everything right, we still always kill the whole family. This hits in real life; whether it's javascript in konqueror resulting in kdeinit (and thus the whole KDE session) being hit or just a classical server that spawns children. Sidenote: The killer does kill all direct children as well, not only the selected father, see oom_kill_process(). The idea in attached patch is that we do want to account the memory consumption of the (direct) children to the father -- however not fully. This maintains the property that fathers with too many children will still very likely be picked, whereas a single misbehaving child has the chance to be picked by the OOM killer. In the patch I account only half (rounded up) of the children's vm_size to the parent. This means that if one child eats more mem than the rest of the family, it will be picked, otherwise it's still the father and thus the whole family that gets selected. This is heuristics -- we could debate whether accounting for a fourth would be better than for half of it. Or -- if people would consider it worth the trouble -- make it a sysctl. For now I sticked to accounting for half, which should IMHO be a significant improvement. The patch does one more thing: As users tend to be irritated by the choice of killed processes (mainly because the children are killed first, despite some of them having a very low OOM score), I added some more output: The selected (father) process will be reported first and it's oom_score printed to syslog. Description: Only account for half of children's vm size in oom score calculation This should still give the parent enough point in case of fork bombs. If any child however has more than 50% of the vm size of all children together, it'll get a higher score and be elected. This patch also makes the kernel display the oom_score. Signed-off-by: Kurt Garloff <garloff@suse.de> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/9827b781f20828e5ceb911b879f268f78fe90815;[PATCH] OOM kill: children accounting;yes;no;no 391;MDY6Q29tbWl0MjMyNTI5ODo5ODI3Yjc4MWYyMDgyOGU1Y2ViOTExYjg3OWYyNjhmNzhmZTkwODE1;Kurt Garloff;Linus Torvalds;"[PATCH] OOM kill: children accounting In the badness() calculation, there's currently this piece of code: /* * Processes which fork a lot of child processes are likely * a good choice. We add the vmsize of the children if they * have an own mm. This prevents forking servers to flood the * machine with an endless amount of children */ list_for_each(tsk, &p->children) { struct task_struct *chld; chld = list_entry(tsk, struct task_struct, sibling); if (chld->mm = p->mm && chld->mm) points += chld->mm->total_vm; } The intention is clear: If some server (apache) keeps spawning new children and we run OOM, we want to kill the father rather than picking a child. This -- to some degree -- also helps a bit with getting fork bombs under control, though I'd consider this a desirable side-effect rather than a feature. There's one problem with this: No matter how many or few children there are, if just one of them misbehaves, and all others (including the father) do everything right, we still always kill the whole family. This hits in real life; whether it's javascript in konqueror resulting in kdeinit (and thus the whole KDE session) being hit or just a classical server that spawns children. Sidenote: The killer does kill all direct children as well, not only the selected father, see oom_kill_process(). The idea in attached patch is that we do want to account the memory consumption of the (direct) children to the father -- however not fully. This maintains the property that fathers with too many children will still very likely be picked, whereas a single misbehaving child has the chance to be picked by the OOM killer. In the patch I account only half (rounded up) of the children's vm_size to the parent. This means that if one child eats more mem than the rest of the family, it will be picked, otherwise it's still the father and thus the whole family that gets selected. This is heuristics -- we could debate whether accounting for a fourth would be better than for half of it. Or -- if people would consider it worth the trouble -- make it a sysctl. For now I sticked to accounting for half, which should IMHO be a significant improvement. The patch does one more thing: As users tend to be irritated by the choice of killed processes (mainly because the children are killed first, despite some of them having a very low OOM score), I added some more output: The selected (father) process will be reported first and it's oom_score printed to syslog. Description: Only account for half of children's vm size in oom score calculation This should still give the parent enough point in case of fork bombs. If any child however has more than 50% of the vm size of all children together, it'll get a higher score and be elected. This patch also makes the kernel display the oom_score. Signed-off-by: Kurt Garloff <garloff@suse.de> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/9827b781f20828e5ceb911b879f268f78fe90815;In the badness() calculation, there's currently this piece of code;no;no;yes 391;MDY6Q29tbWl0MjMyNTI5ODo5ODI3Yjc4MWYyMDgyOGU1Y2ViOTExYjg3OWYyNjhmNzhmZTkwODE1;Kurt Garloff;Linus Torvalds;"[PATCH] OOM kill: children accounting In the badness() calculation, there's currently this piece of code: /* * Processes which fork a lot of child processes are likely * a good choice. We add the vmsize of the children if they * have an own mm. This prevents forking servers to flood the * machine with an endless amount of children */ list_for_each(tsk, &p->children) { struct task_struct *chld; chld = list_entry(tsk, struct task_struct, sibling); if (chld->mm = p->mm && chld->mm) points += chld->mm->total_vm; } The intention is clear: If some server (apache) keeps spawning new children and we run OOM, we want to kill the father rather than picking a child. This -- to some degree -- also helps a bit with getting fork bombs under control, though I'd consider this a desirable side-effect rather than a feature. There's one problem with this: No matter how many or few children there are, if just one of them misbehaves, and all others (including the father) do everything right, we still always kill the whole family. This hits in real life; whether it's javascript in konqueror resulting in kdeinit (and thus the whole KDE session) being hit or just a classical server that spawns children. Sidenote: The killer does kill all direct children as well, not only the selected father, see oom_kill_process(). The idea in attached patch is that we do want to account the memory consumption of the (direct) children to the father -- however not fully. This maintains the property that fathers with too many children will still very likely be picked, whereas a single misbehaving child has the chance to be picked by the OOM killer. In the patch I account only half (rounded up) of the children's vm_size to the parent. This means that if one child eats more mem than the rest of the family, it will be picked, otherwise it's still the father and thus the whole family that gets selected. This is heuristics -- we could debate whether accounting for a fourth would be better than for half of it. Or -- if people would consider it worth the trouble -- make it a sysctl. For now I sticked to accounting for half, which should IMHO be a significant improvement. The patch does one more thing: As users tend to be irritated by the choice of killed processes (mainly because the children are killed first, despite some of them having a very low OOM score), I added some more output: The selected (father) process will be reported first and it's oom_score printed to syslog. Description: Only account for half of children's vm size in oom score calculation This should still give the parent enough point in case of fork bombs. If any child however has more than 50% of the vm size of all children together, it'll get a higher score and be elected. This patch also makes the kernel display the oom_score. Signed-off-by: Kurt Garloff <garloff@suse.de> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/9827b781f20828e5ceb911b879f268f78fe90815;"The intention is clear: If some server (apache) keeps spawning new children and we run OOM, we want to kill the father rather than picking a child";no;yes;yes 391;MDY6Q29tbWl0MjMyNTI5ODo5ODI3Yjc4MWYyMDgyOGU1Y2ViOTExYjg3OWYyNjhmNzhmZTkwODE1;Kurt Garloff;Linus Torvalds;"[PATCH] OOM kill: children accounting In the badness() calculation, there's currently this piece of code: /* * Processes which fork a lot of child processes are likely * a good choice. We add the vmsize of the children if they * have an own mm. This prevents forking servers to flood the * machine with an endless amount of children */ list_for_each(tsk, &p->children) { struct task_struct *chld; chld = list_entry(tsk, struct task_struct, sibling); if (chld->mm = p->mm && chld->mm) points += chld->mm->total_vm; } The intention is clear: If some server (apache) keeps spawning new children and we run OOM, we want to kill the father rather than picking a child. This -- to some degree -- also helps a bit with getting fork bombs under control, though I'd consider this a desirable side-effect rather than a feature. There's one problem with this: No matter how many or few children there are, if just one of them misbehaves, and all others (including the father) do everything right, we still always kill the whole family. This hits in real life; whether it's javascript in konqueror resulting in kdeinit (and thus the whole KDE session) being hit or just a classical server that spawns children. Sidenote: The killer does kill all direct children as well, not only the selected father, see oom_kill_process(). The idea in attached patch is that we do want to account the memory consumption of the (direct) children to the father -- however not fully. This maintains the property that fathers with too many children will still very likely be picked, whereas a single misbehaving child has the chance to be picked by the OOM killer. In the patch I account only half (rounded up) of the children's vm_size to the parent. This means that if one child eats more mem than the rest of the family, it will be picked, otherwise it's still the father and thus the whole family that gets selected. This is heuristics -- we could debate whether accounting for a fourth would be better than for half of it. Or -- if people would consider it worth the trouble -- make it a sysctl. For now I sticked to accounting for half, which should IMHO be a significant improvement. The patch does one more thing: As users tend to be irritated by the choice of killed processes (mainly because the children are killed first, despite some of them having a very low OOM score), I added some more output: The selected (father) process will be reported first and it's oom_score printed to syslog. Description: Only account for half of children's vm size in oom score calculation This should still give the parent enough point in case of fork bombs. If any child however has more than 50% of the vm size of all children together, it'll get a higher score and be elected. This patch also makes the kernel display the oom_score. Signed-off-by: Kurt Garloff <garloff@suse.de> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/9827b781f20828e5ceb911b879f268f78fe90815;"This -- to some degree -- also helps a bit with getting fork bombs under control, though I'd consider this a desirable side-effect rather than a feature";no;yes;yes 391;MDY6Q29tbWl0MjMyNTI5ODo5ODI3Yjc4MWYyMDgyOGU1Y2ViOTExYjg3OWYyNjhmNzhmZTkwODE1;Kurt Garloff;Linus Torvalds;"[PATCH] OOM kill: children accounting In the badness() calculation, there's currently this piece of code: /* * Processes which fork a lot of child processes are likely * a good choice. We add the vmsize of the children if they * have an own mm. This prevents forking servers to flood the * machine with an endless amount of children */ list_for_each(tsk, &p->children) { struct task_struct *chld; chld = list_entry(tsk, struct task_struct, sibling); if (chld->mm = p->mm && chld->mm) points += chld->mm->total_vm; } The intention is clear: If some server (apache) keeps spawning new children and we run OOM, we want to kill the father rather than picking a child. This -- to some degree -- also helps a bit with getting fork bombs under control, though I'd consider this a desirable side-effect rather than a feature. There's one problem with this: No matter how many or few children there are, if just one of them misbehaves, and all others (including the father) do everything right, we still always kill the whole family. This hits in real life; whether it's javascript in konqueror resulting in kdeinit (and thus the whole KDE session) being hit or just a classical server that spawns children. Sidenote: The killer does kill all direct children as well, not only the selected father, see oom_kill_process(). The idea in attached patch is that we do want to account the memory consumption of the (direct) children to the father -- however not fully. This maintains the property that fathers with too many children will still very likely be picked, whereas a single misbehaving child has the chance to be picked by the OOM killer. In the patch I account only half (rounded up) of the children's vm_size to the parent. This means that if one child eats more mem than the rest of the family, it will be picked, otherwise it's still the father and thus the whole family that gets selected. This is heuristics -- we could debate whether accounting for a fourth would be better than for half of it. Or -- if people would consider it worth the trouble -- make it a sysctl. For now I sticked to accounting for half, which should IMHO be a significant improvement. The patch does one more thing: As users tend to be irritated by the choice of killed processes (mainly because the children are killed first, despite some of them having a very low OOM score), I added some more output: The selected (father) process will be reported first and it's oom_score printed to syslog. Description: Only account for half of children's vm size in oom score calculation This should still give the parent enough point in case of fork bombs. If any child however has more than 50% of the vm size of all children together, it'll get a higher score and be elected. This patch also makes the kernel display the oom_score. Signed-off-by: Kurt Garloff <garloff@suse.de> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/9827b781f20828e5ceb911b879f268f78fe90815;"There's one problem with this: No matter how many or few children there are, if just one of them misbehaves, and all others (including the father) do everything right, we still always kill the whole family";no;yes;yes 391;MDY6Q29tbWl0MjMyNTI5ODo5ODI3Yjc4MWYyMDgyOGU1Y2ViOTExYjg3OWYyNjhmNzhmZTkwODE1;Kurt Garloff;Linus Torvalds;"[PATCH] OOM kill: children accounting In the badness() calculation, there's currently this piece of code: /* * Processes which fork a lot of child processes are likely * a good choice. We add the vmsize of the children if they * have an own mm. This prevents forking servers to flood the * machine with an endless amount of children */ list_for_each(tsk, &p->children) { struct task_struct *chld; chld = list_entry(tsk, struct task_struct, sibling); if (chld->mm = p->mm && chld->mm) points += chld->mm->total_vm; } The intention is clear: If some server (apache) keeps spawning new children and we run OOM, we want to kill the father rather than picking a child. This -- to some degree -- also helps a bit with getting fork bombs under control, though I'd consider this a desirable side-effect rather than a feature. There's one problem with this: No matter how many or few children there are, if just one of them misbehaves, and all others (including the father) do everything right, we still always kill the whole family. This hits in real life; whether it's javascript in konqueror resulting in kdeinit (and thus the whole KDE session) being hit or just a classical server that spawns children. Sidenote: The killer does kill all direct children as well, not only the selected father, see oom_kill_process(). The idea in attached patch is that we do want to account the memory consumption of the (direct) children to the father -- however not fully. This maintains the property that fathers with too many children will still very likely be picked, whereas a single misbehaving child has the chance to be picked by the OOM killer. In the patch I account only half (rounded up) of the children's vm_size to the parent. This means that if one child eats more mem than the rest of the family, it will be picked, otherwise it's still the father and thus the whole family that gets selected. This is heuristics -- we could debate whether accounting for a fourth would be better than for half of it. Or -- if people would consider it worth the trouble -- make it a sysctl. For now I sticked to accounting for half, which should IMHO be a significant improvement. The patch does one more thing: As users tend to be irritated by the choice of killed processes (mainly because the children are killed first, despite some of them having a very low OOM score), I added some more output: The selected (father) process will be reported first and it's oom_score printed to syslog. Description: Only account for half of children's vm size in oom score calculation This should still give the parent enough point in case of fork bombs. If any child however has more than 50% of the vm size of all children together, it'll get a higher score and be elected. This patch also makes the kernel display the oom_score. Signed-off-by: Kurt Garloff <garloff@suse.de> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/9827b781f20828e5ceb911b879f268f78fe90815;" This hits in real life; whether it's javascript in konqueror resulting in kdeinit (and thus the whole KDE session) being hit or just a classical server that spawns children";no;yes;yes 391;MDY6Q29tbWl0MjMyNTI5ODo5ODI3Yjc4MWYyMDgyOGU1Y2ViOTExYjg3OWYyNjhmNzhmZTkwODE1;Kurt Garloff;Linus Torvalds;"[PATCH] OOM kill: children accounting In the badness() calculation, there's currently this piece of code: /* * Processes which fork a lot of child processes are likely * a good choice. We add the vmsize of the children if they * have an own mm. This prevents forking servers to flood the * machine with an endless amount of children */ list_for_each(tsk, &p->children) { struct task_struct *chld; chld = list_entry(tsk, struct task_struct, sibling); if (chld->mm = p->mm && chld->mm) points += chld->mm->total_vm; } The intention is clear: If some server (apache) keeps spawning new children and we run OOM, we want to kill the father rather than picking a child. This -- to some degree -- also helps a bit with getting fork bombs under control, though I'd consider this a desirable side-effect rather than a feature. There's one problem with this: No matter how many or few children there are, if just one of them misbehaves, and all others (including the father) do everything right, we still always kill the whole family. This hits in real life; whether it's javascript in konqueror resulting in kdeinit (and thus the whole KDE session) being hit or just a classical server that spawns children. Sidenote: The killer does kill all direct children as well, not only the selected father, see oom_kill_process(). The idea in attached patch is that we do want to account the memory consumption of the (direct) children to the father -- however not fully. This maintains the property that fathers with too many children will still very likely be picked, whereas a single misbehaving child has the chance to be picked by the OOM killer. In the patch I account only half (rounded up) of the children's vm_size to the parent. This means that if one child eats more mem than the rest of the family, it will be picked, otherwise it's still the father and thus the whole family that gets selected. This is heuristics -- we could debate whether accounting for a fourth would be better than for half of it. Or -- if people would consider it worth the trouble -- make it a sysctl. For now I sticked to accounting for half, which should IMHO be a significant improvement. The patch does one more thing: As users tend to be irritated by the choice of killed processes (mainly because the children are killed first, despite some of them having a very low OOM score), I added some more output: The selected (father) process will be reported first and it's oom_score printed to syslog. Description: Only account for half of children's vm size in oom score calculation This should still give the parent enough point in case of fork bombs. If any child however has more than 50% of the vm size of all children together, it'll get a higher score and be elected. This patch also makes the kernel display the oom_score. Signed-off-by: Kurt Garloff <garloff@suse.de> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/9827b781f20828e5ceb911b879f268f78fe90815;"Sidenote: The killer does kill all direct children as well, not only the selected father, see oom_kill_process()";no;no;yes 391;MDY6Q29tbWl0MjMyNTI5ODo5ODI3Yjc4MWYyMDgyOGU1Y2ViOTExYjg3OWYyNjhmNzhmZTkwODE1;Kurt Garloff;Linus Torvalds;"[PATCH] OOM kill: children accounting In the badness() calculation, there's currently this piece of code: /* * Processes which fork a lot of child processes are likely * a good choice. We add the vmsize of the children if they * have an own mm. This prevents forking servers to flood the * machine with an endless amount of children */ list_for_each(tsk, &p->children) { struct task_struct *chld; chld = list_entry(tsk, struct task_struct, sibling); if (chld->mm = p->mm && chld->mm) points += chld->mm->total_vm; } The intention is clear: If some server (apache) keeps spawning new children and we run OOM, we want to kill the father rather than picking a child. This -- to some degree -- also helps a bit with getting fork bombs under control, though I'd consider this a desirable side-effect rather than a feature. There's one problem with this: No matter how many or few children there are, if just one of them misbehaves, and all others (including the father) do everything right, we still always kill the whole family. This hits in real life; whether it's javascript in konqueror resulting in kdeinit (and thus the whole KDE session) being hit or just a classical server that spawns children. Sidenote: The killer does kill all direct children as well, not only the selected father, see oom_kill_process(). The idea in attached patch is that we do want to account the memory consumption of the (direct) children to the father -- however not fully. This maintains the property that fathers with too many children will still very likely be picked, whereas a single misbehaving child has the chance to be picked by the OOM killer. In the patch I account only half (rounded up) of the children's vm_size to the parent. This means that if one child eats more mem than the rest of the family, it will be picked, otherwise it's still the father and thus the whole family that gets selected. This is heuristics -- we could debate whether accounting for a fourth would be better than for half of it. Or -- if people would consider it worth the trouble -- make it a sysctl. For now I sticked to accounting for half, which should IMHO be a significant improvement. The patch does one more thing: As users tend to be irritated by the choice of killed processes (mainly because the children are killed first, despite some of them having a very low OOM score), I added some more output: The selected (father) process will be reported first and it's oom_score printed to syslog. Description: Only account for half of children's vm size in oom score calculation This should still give the parent enough point in case of fork bombs. If any child however has more than 50% of the vm size of all children together, it'll get a higher score and be elected. This patch also makes the kernel display the oom_score. Signed-off-by: Kurt Garloff <garloff@suse.de> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/9827b781f20828e5ceb911b879f268f78fe90815;"The idea in attached patch is that we do want to account the memory consumption of the (direct) children to the father -- however not fully";yes;yes;no 391;MDY6Q29tbWl0MjMyNTI5ODo5ODI3Yjc4MWYyMDgyOGU1Y2ViOTExYjg3OWYyNjhmNzhmZTkwODE1;Kurt Garloff;Linus Torvalds;"[PATCH] OOM kill: children accounting In the badness() calculation, there's currently this piece of code: /* * Processes which fork a lot of child processes are likely * a good choice. We add the vmsize of the children if they * have an own mm. This prevents forking servers to flood the * machine with an endless amount of children */ list_for_each(tsk, &p->children) { struct task_struct *chld; chld = list_entry(tsk, struct task_struct, sibling); if (chld->mm = p->mm && chld->mm) points += chld->mm->total_vm; } The intention is clear: If some server (apache) keeps spawning new children and we run OOM, we want to kill the father rather than picking a child. This -- to some degree -- also helps a bit with getting fork bombs under control, though I'd consider this a desirable side-effect rather than a feature. There's one problem with this: No matter how many or few children there are, if just one of them misbehaves, and all others (including the father) do everything right, we still always kill the whole family. This hits in real life; whether it's javascript in konqueror resulting in kdeinit (and thus the whole KDE session) being hit or just a classical server that spawns children. Sidenote: The killer does kill all direct children as well, not only the selected father, see oom_kill_process(). The idea in attached patch is that we do want to account the memory consumption of the (direct) children to the father -- however not fully. This maintains the property that fathers with too many children will still very likely be picked, whereas a single misbehaving child has the chance to be picked by the OOM killer. In the patch I account only half (rounded up) of the children's vm_size to the parent. This means that if one child eats more mem than the rest of the family, it will be picked, otherwise it's still the father and thus the whole family that gets selected. This is heuristics -- we could debate whether accounting for a fourth would be better than for half of it. Or -- if people would consider it worth the trouble -- make it a sysctl. For now I sticked to accounting for half, which should IMHO be a significant improvement. The patch does one more thing: As users tend to be irritated by the choice of killed processes (mainly because the children are killed first, despite some of them having a very low OOM score), I added some more output: The selected (father) process will be reported first and it's oom_score printed to syslog. Description: Only account for half of children's vm size in oom score calculation This should still give the parent enough point in case of fork bombs. If any child however has more than 50% of the vm size of all children together, it'll get a higher score and be elected. This patch also makes the kernel display the oom_score. Signed-off-by: Kurt Garloff <garloff@suse.de> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/9827b781f20828e5ceb911b879f268f78fe90815;"This maintains the property that fathers with too many children will still very likely be picked, whereas a single misbehaving child has the chance to be picked by the OOM killer";yes;yes;no 391;MDY6Q29tbWl0MjMyNTI5ODo5ODI3Yjc4MWYyMDgyOGU1Y2ViOTExYjg3OWYyNjhmNzhmZTkwODE1;Kurt Garloff;Linus Torvalds;"[PATCH] OOM kill: children accounting In the badness() calculation, there's currently this piece of code: /* * Processes which fork a lot of child processes are likely * a good choice. We add the vmsize of the children if they * have an own mm. This prevents forking servers to flood the * machine with an endless amount of children */ list_for_each(tsk, &p->children) { struct task_struct *chld; chld = list_entry(tsk, struct task_struct, sibling); if (chld->mm = p->mm && chld->mm) points += chld->mm->total_vm; } The intention is clear: If some server (apache) keeps spawning new children and we run OOM, we want to kill the father rather than picking a child. This -- to some degree -- also helps a bit with getting fork bombs under control, though I'd consider this a desirable side-effect rather than a feature. There's one problem with this: No matter how many or few children there are, if just one of them misbehaves, and all others (including the father) do everything right, we still always kill the whole family. This hits in real life; whether it's javascript in konqueror resulting in kdeinit (and thus the whole KDE session) being hit or just a classical server that spawns children. Sidenote: The killer does kill all direct children as well, not only the selected father, see oom_kill_process(). The idea in attached patch is that we do want to account the memory consumption of the (direct) children to the father -- however not fully. This maintains the property that fathers with too many children will still very likely be picked, whereas a single misbehaving child has the chance to be picked by the OOM killer. In the patch I account only half (rounded up) of the children's vm_size to the parent. This means that if one child eats more mem than the rest of the family, it will be picked, otherwise it's still the father and thus the whole family that gets selected. This is heuristics -- we could debate whether accounting for a fourth would be better than for half of it. Or -- if people would consider it worth the trouble -- make it a sysctl. For now I sticked to accounting for half, which should IMHO be a significant improvement. The patch does one more thing: As users tend to be irritated by the choice of killed processes (mainly because the children are killed first, despite some of them having a very low OOM score), I added some more output: The selected (father) process will be reported first and it's oom_score printed to syslog. Description: Only account for half of children's vm size in oom score calculation This should still give the parent enough point in case of fork bombs. If any child however has more than 50% of the vm size of all children together, it'll get a higher score and be elected. This patch also makes the kernel display the oom_score. Signed-off-by: Kurt Garloff <garloff@suse.de> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/9827b781f20828e5ceb911b879f268f78fe90815;"In the patch I account only half (rounded up) of the children's vm_size to the parent";yes;no;no 391;MDY6Q29tbWl0MjMyNTI5ODo5ODI3Yjc4MWYyMDgyOGU1Y2ViOTExYjg3OWYyNjhmNzhmZTkwODE1;Kurt Garloff;Linus Torvalds;"[PATCH] OOM kill: children accounting In the badness() calculation, there's currently this piece of code: /* * Processes which fork a lot of child processes are likely * a good choice. We add the vmsize of the children if they * have an own mm. This prevents forking servers to flood the * machine with an endless amount of children */ list_for_each(tsk, &p->children) { struct task_struct *chld; chld = list_entry(tsk, struct task_struct, sibling); if (chld->mm = p->mm && chld->mm) points += chld->mm->total_vm; } The intention is clear: If some server (apache) keeps spawning new children and we run OOM, we want to kill the father rather than picking a child. This -- to some degree -- also helps a bit with getting fork bombs under control, though I'd consider this a desirable side-effect rather than a feature. There's one problem with this: No matter how many or few children there are, if just one of them misbehaves, and all others (including the father) do everything right, we still always kill the whole family. This hits in real life; whether it's javascript in konqueror resulting in kdeinit (and thus the whole KDE session) being hit or just a classical server that spawns children. Sidenote: The killer does kill all direct children as well, not only the selected father, see oom_kill_process(). The idea in attached patch is that we do want to account the memory consumption of the (direct) children to the father -- however not fully. This maintains the property that fathers with too many children will still very likely be picked, whereas a single misbehaving child has the chance to be picked by the OOM killer. In the patch I account only half (rounded up) of the children's vm_size to the parent. This means that if one child eats more mem than the rest of the family, it will be picked, otherwise it's still the father and thus the whole family that gets selected. This is heuristics -- we could debate whether accounting for a fourth would be better than for half of it. Or -- if people would consider it worth the trouble -- make it a sysctl. For now I sticked to accounting for half, which should IMHO be a significant improvement. The patch does one more thing: As users tend to be irritated by the choice of killed processes (mainly because the children are killed first, despite some of them having a very low OOM score), I added some more output: The selected (father) process will be reported first and it's oom_score printed to syslog. Description: Only account for half of children's vm size in oom score calculation This should still give the parent enough point in case of fork bombs. If any child however has more than 50% of the vm size of all children together, it'll get a higher score and be elected. This patch also makes the kernel display the oom_score. Signed-off-by: Kurt Garloff <garloff@suse.de> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/9827b781f20828e5ceb911b879f268f78fe90815;" This means that if one child eats more mem than the rest of the family, it will be picked, otherwise it's still the father and thus the whole family that gets selected";yes;yes;no 391;MDY6Q29tbWl0MjMyNTI5ODo5ODI3Yjc4MWYyMDgyOGU1Y2ViOTExYjg3OWYyNjhmNzhmZTkwODE1;Kurt Garloff;Linus Torvalds;"[PATCH] OOM kill: children accounting In the badness() calculation, there's currently this piece of code: /* * Processes which fork a lot of child processes are likely * a good choice. We add the vmsize of the children if they * have an own mm. This prevents forking servers to flood the * machine with an endless amount of children */ list_for_each(tsk, &p->children) { struct task_struct *chld; chld = list_entry(tsk, struct task_struct, sibling); if (chld->mm = p->mm && chld->mm) points += chld->mm->total_vm; } The intention is clear: If some server (apache) keeps spawning new children and we run OOM, we want to kill the father rather than picking a child. This -- to some degree -- also helps a bit with getting fork bombs under control, though I'd consider this a desirable side-effect rather than a feature. There's one problem with this: No matter how many or few children there are, if just one of them misbehaves, and all others (including the father) do everything right, we still always kill the whole family. This hits in real life; whether it's javascript in konqueror resulting in kdeinit (and thus the whole KDE session) being hit or just a classical server that spawns children. Sidenote: The killer does kill all direct children as well, not only the selected father, see oom_kill_process(). The idea in attached patch is that we do want to account the memory consumption of the (direct) children to the father -- however not fully. This maintains the property that fathers with too many children will still very likely be picked, whereas a single misbehaving child has the chance to be picked by the OOM killer. In the patch I account only half (rounded up) of the children's vm_size to the parent. This means that if one child eats more mem than the rest of the family, it will be picked, otherwise it's still the father and thus the whole family that gets selected. This is heuristics -- we could debate whether accounting for a fourth would be better than for half of it. Or -- if people would consider it worth the trouble -- make it a sysctl. For now I sticked to accounting for half, which should IMHO be a significant improvement. The patch does one more thing: As users tend to be irritated by the choice of killed processes (mainly because the children are killed first, despite some of them having a very low OOM score), I added some more output: The selected (father) process will be reported first and it's oom_score printed to syslog. Description: Only account for half of children's vm size in oom score calculation This should still give the parent enough point in case of fork bombs. If any child however has more than 50% of the vm size of all children together, it'll get a higher score and be elected. This patch also makes the kernel display the oom_score. Signed-off-by: Kurt Garloff <garloff@suse.de> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/9827b781f20828e5ceb911b879f268f78fe90815;"This is heuristics -- we could debate whether accounting for a fourth would be better than for half of it";yes;yes;no 391;MDY6Q29tbWl0MjMyNTI5ODo5ODI3Yjc4MWYyMDgyOGU1Y2ViOTExYjg3OWYyNjhmNzhmZTkwODE1;Kurt Garloff;Linus Torvalds;"[PATCH] OOM kill: children accounting In the badness() calculation, there's currently this piece of code: /* * Processes which fork a lot of child processes are likely * a good choice. We add the vmsize of the children if they * have an own mm. This prevents forking servers to flood the * machine with an endless amount of children */ list_for_each(tsk, &p->children) { struct task_struct *chld; chld = list_entry(tsk, struct task_struct, sibling); if (chld->mm = p->mm && chld->mm) points += chld->mm->total_vm; } The intention is clear: If some server (apache) keeps spawning new children and we run OOM, we want to kill the father rather than picking a child. This -- to some degree -- also helps a bit with getting fork bombs under control, though I'd consider this a desirable side-effect rather than a feature. There's one problem with this: No matter how many or few children there are, if just one of them misbehaves, and all others (including the father) do everything right, we still always kill the whole family. This hits in real life; whether it's javascript in konqueror resulting in kdeinit (and thus the whole KDE session) being hit or just a classical server that spawns children. Sidenote: The killer does kill all direct children as well, not only the selected father, see oom_kill_process(). The idea in attached patch is that we do want to account the memory consumption of the (direct) children to the father -- however not fully. This maintains the property that fathers with too many children will still very likely be picked, whereas a single misbehaving child has the chance to be picked by the OOM killer. In the patch I account only half (rounded up) of the children's vm_size to the parent. This means that if one child eats more mem than the rest of the family, it will be picked, otherwise it's still the father and thus the whole family that gets selected. This is heuristics -- we could debate whether accounting for a fourth would be better than for half of it. Or -- if people would consider it worth the trouble -- make it a sysctl. For now I sticked to accounting for half, which should IMHO be a significant improvement. The patch does one more thing: As users tend to be irritated by the choice of killed processes (mainly because the children are killed first, despite some of them having a very low OOM score), I added some more output: The selected (father) process will be reported first and it's oom_score printed to syslog. Description: Only account for half of children's vm size in oom score calculation This should still give the parent enough point in case of fork bombs. If any child however has more than 50% of the vm size of all children together, it'll get a higher score and be elected. This patch also makes the kernel display the oom_score. Signed-off-by: Kurt Garloff <garloff@suse.de> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/9827b781f20828e5ceb911b879f268f78fe90815;" Or -- if people would consider it worth the trouble -- make it a sysctl";yes;no;no 391;MDY6Q29tbWl0MjMyNTI5ODo5ODI3Yjc4MWYyMDgyOGU1Y2ViOTExYjg3OWYyNjhmNzhmZTkwODE1;Kurt Garloff;Linus Torvalds;"[PATCH] OOM kill: children accounting In the badness() calculation, there's currently this piece of code: /* * Processes which fork a lot of child processes are likely * a good choice. We add the vmsize of the children if they * have an own mm. This prevents forking servers to flood the * machine with an endless amount of children */ list_for_each(tsk, &p->children) { struct task_struct *chld; chld = list_entry(tsk, struct task_struct, sibling); if (chld->mm = p->mm && chld->mm) points += chld->mm->total_vm; } The intention is clear: If some server (apache) keeps spawning new children and we run OOM, we want to kill the father rather than picking a child. This -- to some degree -- also helps a bit with getting fork bombs under control, though I'd consider this a desirable side-effect rather than a feature. There's one problem with this: No matter how many or few children there are, if just one of them misbehaves, and all others (including the father) do everything right, we still always kill the whole family. This hits in real life; whether it's javascript in konqueror resulting in kdeinit (and thus the whole KDE session) being hit or just a classical server that spawns children. Sidenote: The killer does kill all direct children as well, not only the selected father, see oom_kill_process(). The idea in attached patch is that we do want to account the memory consumption of the (direct) children to the father -- however not fully. This maintains the property that fathers with too many children will still very likely be picked, whereas a single misbehaving child has the chance to be picked by the OOM killer. In the patch I account only half (rounded up) of the children's vm_size to the parent. This means that if one child eats more mem than the rest of the family, it will be picked, otherwise it's still the father and thus the whole family that gets selected. This is heuristics -- we could debate whether accounting for a fourth would be better than for half of it. Or -- if people would consider it worth the trouble -- make it a sysctl. For now I sticked to accounting for half, which should IMHO be a significant improvement. The patch does one more thing: As users tend to be irritated by the choice of killed processes (mainly because the children are killed first, despite some of them having a very low OOM score), I added some more output: The selected (father) process will be reported first and it's oom_score printed to syslog. Description: Only account for half of children's vm size in oom score calculation This should still give the parent enough point in case of fork bombs. If any child however has more than 50% of the vm size of all children together, it'll get a higher score and be elected. This patch also makes the kernel display the oom_score. Signed-off-by: Kurt Garloff <garloff@suse.de> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/9827b781f20828e5ceb911b879f268f78fe90815;" For now I sticked to accounting for half, which should IMHO be a significant improvement";yes;yes;no 391;MDY6Q29tbWl0MjMyNTI5ODo5ODI3Yjc4MWYyMDgyOGU1Y2ViOTExYjg3OWYyNjhmNzhmZTkwODE1;Kurt Garloff;Linus Torvalds;"[PATCH] OOM kill: children accounting In the badness() calculation, there's currently this piece of code: /* * Processes which fork a lot of child processes are likely * a good choice. We add the vmsize of the children if they * have an own mm. This prevents forking servers to flood the * machine with an endless amount of children */ list_for_each(tsk, &p->children) { struct task_struct *chld; chld = list_entry(tsk, struct task_struct, sibling); if (chld->mm = p->mm && chld->mm) points += chld->mm->total_vm; } The intention is clear: If some server (apache) keeps spawning new children and we run OOM, we want to kill the father rather than picking a child. This -- to some degree -- also helps a bit with getting fork bombs under control, though I'd consider this a desirable side-effect rather than a feature. There's one problem with this: No matter how many or few children there are, if just one of them misbehaves, and all others (including the father) do everything right, we still always kill the whole family. This hits in real life; whether it's javascript in konqueror resulting in kdeinit (and thus the whole KDE session) being hit or just a classical server that spawns children. Sidenote: The killer does kill all direct children as well, not only the selected father, see oom_kill_process(). The idea in attached patch is that we do want to account the memory consumption of the (direct) children to the father -- however not fully. This maintains the property that fathers with too many children will still very likely be picked, whereas a single misbehaving child has the chance to be picked by the OOM killer. In the patch I account only half (rounded up) of the children's vm_size to the parent. This means that if one child eats more mem than the rest of the family, it will be picked, otherwise it's still the father and thus the whole family that gets selected. This is heuristics -- we could debate whether accounting for a fourth would be better than for half of it. Or -- if people would consider it worth the trouble -- make it a sysctl. For now I sticked to accounting for half, which should IMHO be a significant improvement. The patch does one more thing: As users tend to be irritated by the choice of killed processes (mainly because the children are killed first, despite some of them having a very low OOM score), I added some more output: The selected (father) process will be reported first and it's oom_score printed to syslog. Description: Only account for half of children's vm size in oom score calculation This should still give the parent enough point in case of fork bombs. If any child however has more than 50% of the vm size of all children together, it'll get a higher score and be elected. This patch also makes the kernel display the oom_score. Signed-off-by: Kurt Garloff <garloff@suse.de> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/9827b781f20828e5ceb911b879f268f78fe90815;"The patch does one more thing: As users tend to be irritated by the choice of killed processes (mainly because the children are killed first, despite some of them having a very low OOM score), I added some more output: The selected (father) process will be reported first and it's oom_score printed to syslog";yes;yes;no 391;MDY6Q29tbWl0MjMyNTI5ODo5ODI3Yjc4MWYyMDgyOGU1Y2ViOTExYjg3OWYyNjhmNzhmZTkwODE1;Kurt Garloff;Linus Torvalds;"[PATCH] OOM kill: children accounting In the badness() calculation, there's currently this piece of code: /* * Processes which fork a lot of child processes are likely * a good choice. We add the vmsize of the children if they * have an own mm. This prevents forking servers to flood the * machine with an endless amount of children */ list_for_each(tsk, &p->children) { struct task_struct *chld; chld = list_entry(tsk, struct task_struct, sibling); if (chld->mm = p->mm && chld->mm) points += chld->mm->total_vm; } The intention is clear: If some server (apache) keeps spawning new children and we run OOM, we want to kill the father rather than picking a child. This -- to some degree -- also helps a bit with getting fork bombs under control, though I'd consider this a desirable side-effect rather than a feature. There's one problem with this: No matter how many or few children there are, if just one of them misbehaves, and all others (including the father) do everything right, we still always kill the whole family. This hits in real life; whether it's javascript in konqueror resulting in kdeinit (and thus the whole KDE session) being hit or just a classical server that spawns children. Sidenote: The killer does kill all direct children as well, not only the selected father, see oom_kill_process(). The idea in attached patch is that we do want to account the memory consumption of the (direct) children to the father -- however not fully. This maintains the property that fathers with too many children will still very likely be picked, whereas a single misbehaving child has the chance to be picked by the OOM killer. In the patch I account only half (rounded up) of the children's vm_size to the parent. This means that if one child eats more mem than the rest of the family, it will be picked, otherwise it's still the father and thus the whole family that gets selected. This is heuristics -- we could debate whether accounting for a fourth would be better than for half of it. Or -- if people would consider it worth the trouble -- make it a sysctl. For now I sticked to accounting for half, which should IMHO be a significant improvement. The patch does one more thing: As users tend to be irritated by the choice of killed processes (mainly because the children are killed first, despite some of them having a very low OOM score), I added some more output: The selected (father) process will be reported first and it's oom_score printed to syslog. Description: Only account for half of children's vm size in oom score calculation This should still give the parent enough point in case of fork bombs. If any child however has more than 50% of the vm size of all children together, it'll get a higher score and be elected. This patch also makes the kernel display the oom_score. Signed-off-by: Kurt Garloff <garloff@suse.de> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/9827b781f20828e5ceb911b879f268f78fe90815;Description;yes;no;no 391;MDY6Q29tbWl0MjMyNTI5ODo5ODI3Yjc4MWYyMDgyOGU1Y2ViOTExYjg3OWYyNjhmNzhmZTkwODE1;Kurt Garloff;Linus Torvalds;"[PATCH] OOM kill: children accounting In the badness() calculation, there's currently this piece of code: /* * Processes which fork a lot of child processes are likely * a good choice. We add the vmsize of the children if they * have an own mm. This prevents forking servers to flood the * machine with an endless amount of children */ list_for_each(tsk, &p->children) { struct task_struct *chld; chld = list_entry(tsk, struct task_struct, sibling); if (chld->mm = p->mm && chld->mm) points += chld->mm->total_vm; } The intention is clear: If some server (apache) keeps spawning new children and we run OOM, we want to kill the father rather than picking a child. This -- to some degree -- also helps a bit with getting fork bombs under control, though I'd consider this a desirable side-effect rather than a feature. There's one problem with this: No matter how many or few children there are, if just one of them misbehaves, and all others (including the father) do everything right, we still always kill the whole family. This hits in real life; whether it's javascript in konqueror resulting in kdeinit (and thus the whole KDE session) being hit or just a classical server that spawns children. Sidenote: The killer does kill all direct children as well, not only the selected father, see oom_kill_process(). The idea in attached patch is that we do want to account the memory consumption of the (direct) children to the father -- however not fully. This maintains the property that fathers with too many children will still very likely be picked, whereas a single misbehaving child has the chance to be picked by the OOM killer. In the patch I account only half (rounded up) of the children's vm_size to the parent. This means that if one child eats more mem than the rest of the family, it will be picked, otherwise it's still the father and thus the whole family that gets selected. This is heuristics -- we could debate whether accounting for a fourth would be better than for half of it. Or -- if people would consider it worth the trouble -- make it a sysctl. For now I sticked to accounting for half, which should IMHO be a significant improvement. The patch does one more thing: As users tend to be irritated by the choice of killed processes (mainly because the children are killed first, despite some of them having a very low OOM score), I added some more output: The selected (father) process will be reported first and it's oom_score printed to syslog. Description: Only account for half of children's vm size in oom score calculation This should still give the parent enough point in case of fork bombs. If any child however has more than 50% of the vm size of all children together, it'll get a higher score and be elected. This patch also makes the kernel display the oom_score. Signed-off-by: Kurt Garloff <garloff@suse.de> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/9827b781f20828e5ceb911b879f268f78fe90815;"Only account for half of children's vm size in oom score calculation This should still give the parent enough point in case of fork bombs";yes;yes;no 391;MDY6Q29tbWl0MjMyNTI5ODo5ODI3Yjc4MWYyMDgyOGU1Y2ViOTExYjg3OWYyNjhmNzhmZTkwODE1;Kurt Garloff;Linus Torvalds;"[PATCH] OOM kill: children accounting In the badness() calculation, there's currently this piece of code: /* * Processes which fork a lot of child processes are likely * a good choice. We add the vmsize of the children if they * have an own mm. This prevents forking servers to flood the * machine with an endless amount of children */ list_for_each(tsk, &p->children) { struct task_struct *chld; chld = list_entry(tsk, struct task_struct, sibling); if (chld->mm = p->mm && chld->mm) points += chld->mm->total_vm; } The intention is clear: If some server (apache) keeps spawning new children and we run OOM, we want to kill the father rather than picking a child. This -- to some degree -- also helps a bit with getting fork bombs under control, though I'd consider this a desirable side-effect rather than a feature. There's one problem with this: No matter how many or few children there are, if just one of them misbehaves, and all others (including the father) do everything right, we still always kill the whole family. This hits in real life; whether it's javascript in konqueror resulting in kdeinit (and thus the whole KDE session) being hit or just a classical server that spawns children. Sidenote: The killer does kill all direct children as well, not only the selected father, see oom_kill_process(). The idea in attached patch is that we do want to account the memory consumption of the (direct) children to the father -- however not fully. This maintains the property that fathers with too many children will still very likely be picked, whereas a single misbehaving child has the chance to be picked by the OOM killer. In the patch I account only half (rounded up) of the children's vm_size to the parent. This means that if one child eats more mem than the rest of the family, it will be picked, otherwise it's still the father and thus the whole family that gets selected. This is heuristics -- we could debate whether accounting for a fourth would be better than for half of it. Or -- if people would consider it worth the trouble -- make it a sysctl. For now I sticked to accounting for half, which should IMHO be a significant improvement. The patch does one more thing: As users tend to be irritated by the choice of killed processes (mainly because the children are killed first, despite some of them having a very low OOM score), I added some more output: The selected (father) process will be reported first and it's oom_score printed to syslog. Description: Only account for half of children's vm size in oom score calculation This should still give the parent enough point in case of fork bombs. If any child however has more than 50% of the vm size of all children together, it'll get a higher score and be elected. This patch also makes the kernel display the oom_score. Signed-off-by: Kurt Garloff <garloff@suse.de> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/9827b781f20828e5ceb911b879f268f78fe90815;" If any child however has more than 50% of the vm size of all children together, it'll get a higher score and be elected";yes;no;no 391;MDY6Q29tbWl0MjMyNTI5ODo5ODI3Yjc4MWYyMDgyOGU1Y2ViOTExYjg3OWYyNjhmNzhmZTkwODE1;Kurt Garloff;Linus Torvalds;"[PATCH] OOM kill: children accounting In the badness() calculation, there's currently this piece of code: /* * Processes which fork a lot of child processes are likely * a good choice. We add the vmsize of the children if they * have an own mm. This prevents forking servers to flood the * machine with an endless amount of children */ list_for_each(tsk, &p->children) { struct task_struct *chld; chld = list_entry(tsk, struct task_struct, sibling); if (chld->mm = p->mm && chld->mm) points += chld->mm->total_vm; } The intention is clear: If some server (apache) keeps spawning new children and we run OOM, we want to kill the father rather than picking a child. This -- to some degree -- also helps a bit with getting fork bombs under control, though I'd consider this a desirable side-effect rather than a feature. There's one problem with this: No matter how many or few children there are, if just one of them misbehaves, and all others (including the father) do everything right, we still always kill the whole family. This hits in real life; whether it's javascript in konqueror resulting in kdeinit (and thus the whole KDE session) being hit or just a classical server that spawns children. Sidenote: The killer does kill all direct children as well, not only the selected father, see oom_kill_process(). The idea in attached patch is that we do want to account the memory consumption of the (direct) children to the father -- however not fully. This maintains the property that fathers with too many children will still very likely be picked, whereas a single misbehaving child has the chance to be picked by the OOM killer. In the patch I account only half (rounded up) of the children's vm_size to the parent. This means that if one child eats more mem than the rest of the family, it will be picked, otherwise it's still the father and thus the whole family that gets selected. This is heuristics -- we could debate whether accounting for a fourth would be better than for half of it. Or -- if people would consider it worth the trouble -- make it a sysctl. For now I sticked to accounting for half, which should IMHO be a significant improvement. The patch does one more thing: As users tend to be irritated by the choice of killed processes (mainly because the children are killed first, despite some of them having a very low OOM score), I added some more output: The selected (father) process will be reported first and it's oom_score printed to syslog. Description: Only account for half of children's vm size in oom score calculation This should still give the parent enough point in case of fork bombs. If any child however has more than 50% of the vm size of all children together, it'll get a higher score and be elected. This patch also makes the kernel display the oom_score. Signed-off-by: Kurt Garloff <garloff@suse.de> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/9827b781f20828e5ceb911b879f268f78fe90815;This patch also makes the kernel display the oom_score.;yes;yes;no 392;MDY6Q29tbWl0MjMyNTI5ODpiOTU4ZjdkOWYzNWJmYjYxNjI1ZjIwMWNkOTJhM2ZjMzk1MDRhZjdh;Andrew Morton;Linus Torvalds;"[PATCH] dump_stack() in oom handler Sometimes it's nice to know who's calling. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/b958f7d9f35bfb61625f201cd92a3fc39504af7a;[PATCH] dump_stack() in oom handler;yes;no;no 392;MDY6Q29tbWl0MjMyNTI5ODpiOTU4ZjdkOWYzNWJmYjYxNjI1ZjIwMWNkOTJhM2ZjMzk1MDRhZjdh;Andrew Morton;Linus Torvalds;"[PATCH] dump_stack() in oom handler Sometimes it's nice to know who's calling. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/b958f7d9f35bfb61625f201cd92a3fc39504af7a;Sometimes it's nice to know who's calling.;no;yes;no 393;MDY6Q29tbWl0MjMyNTI5ODo1MDU5NzBiOTZlM2I3ZDIyMTc3YzM4ZTAzNDM1YTY4Mzc2NjI4ZTdh;Paul Jackson;Linus Torvalds;"[PATCH] cpuset oom lock fix The problem, reported in: http://bugzilla.kernel.org/show_bug.cgi?id=5859 and by various other email messages and lkml posts is that the cpuset hook in the oom (out of memory) code can try to take a cpuset semaphore while holding the tasklist_lock (a spinlock). One must not sleep while holding a spinlock. The fix seems easy enough - move the cpuset semaphore region outside the tasklist_lock region. This required a few lines of mechanism to implement. The oom code where the locking needs to be changed does not have access to the cpuset locks, which are internal to kernel/cpuset.c only. So I provided a couple more cpuset interface routines, available to the rest of the kernel, which simple take and drop the lock needed here (cpusets callback_sem). Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/505970b96e3b7d22177c38e03435a68376628e7a;[PATCH] cpuset oom lock fix;yes;yes;no 393;MDY6Q29tbWl0MjMyNTI5ODo1MDU5NzBiOTZlM2I3ZDIyMTc3YzM4ZTAzNDM1YTY4Mzc2NjI4ZTdh;Paul Jackson;Linus Torvalds;"[PATCH] cpuset oom lock fix The problem, reported in: http://bugzilla.kernel.org/show_bug.cgi?id=5859 and by various other email messages and lkml posts is that the cpuset hook in the oom (out of memory) code can try to take a cpuset semaphore while holding the tasklist_lock (a spinlock). One must not sleep while holding a spinlock. The fix seems easy enough - move the cpuset semaphore region outside the tasklist_lock region. This required a few lines of mechanism to implement. The oom code where the locking needs to be changed does not have access to the cpuset locks, which are internal to kernel/cpuset.c only. So I provided a couple more cpuset interface routines, available to the rest of the kernel, which simple take and drop the lock needed here (cpusets callback_sem). Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/505970b96e3b7d22177c38e03435a68376628e7a;The problem, reported in;no;yes;yes 393;MDY6Q29tbWl0MjMyNTI5ODo1MDU5NzBiOTZlM2I3ZDIyMTc3YzM4ZTAzNDM1YTY4Mzc2NjI4ZTdh;Paul Jackson;Linus Torvalds;"[PATCH] cpuset oom lock fix The problem, reported in: http://bugzilla.kernel.org/show_bug.cgi?id=5859 and by various other email messages and lkml posts is that the cpuset hook in the oom (out of memory) code can try to take a cpuset semaphore while holding the tasklist_lock (a spinlock). One must not sleep while holding a spinlock. The fix seems easy enough - move the cpuset semaphore region outside the tasklist_lock region. This required a few lines of mechanism to implement. The oom code where the locking needs to be changed does not have access to the cpuset locks, which are internal to kernel/cpuset.c only. So I provided a couple more cpuset interface routines, available to the rest of the kernel, which simple take and drop the lock needed here (cpusets callback_sem). Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/505970b96e3b7d22177c38e03435a68376628e7a;"and by various other email messages and lkml posts is that the cpuset hook in the oom (out of memory) code can try to take a cpuset semaphore while holding the tasklist_lock (a spinlock)";no;yes;yes 393;MDY6Q29tbWl0MjMyNTI5ODo1MDU5NzBiOTZlM2I3ZDIyMTc3YzM4ZTAzNDM1YTY4Mzc2NjI4ZTdh;Paul Jackson;Linus Torvalds;"[PATCH] cpuset oom lock fix The problem, reported in: http://bugzilla.kernel.org/show_bug.cgi?id=5859 and by various other email messages and lkml posts is that the cpuset hook in the oom (out of memory) code can try to take a cpuset semaphore while holding the tasklist_lock (a spinlock). One must not sleep while holding a spinlock. The fix seems easy enough - move the cpuset semaphore region outside the tasklist_lock region. This required a few lines of mechanism to implement. The oom code where the locking needs to be changed does not have access to the cpuset locks, which are internal to kernel/cpuset.c only. So I provided a couple more cpuset interface routines, available to the rest of the kernel, which simple take and drop the lock needed here (cpusets callback_sem). Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/505970b96e3b7d22177c38e03435a68376628e7a;One must not sleep while holding a spinlock;no;yes;yes 393;MDY6Q29tbWl0MjMyNTI5ODo1MDU5NzBiOTZlM2I3ZDIyMTc3YzM4ZTAzNDM1YTY4Mzc2NjI4ZTdh;Paul Jackson;Linus Torvalds;"[PATCH] cpuset oom lock fix The problem, reported in: http://bugzilla.kernel.org/show_bug.cgi?id=5859 and by various other email messages and lkml posts is that the cpuset hook in the oom (out of memory) code can try to take a cpuset semaphore while holding the tasklist_lock (a spinlock). One must not sleep while holding a spinlock. The fix seems easy enough - move the cpuset semaphore region outside the tasklist_lock region. This required a few lines of mechanism to implement. The oom code where the locking needs to be changed does not have access to the cpuset locks, which are internal to kernel/cpuset.c only. So I provided a couple more cpuset interface routines, available to the rest of the kernel, which simple take and drop the lock needed here (cpusets callback_sem). Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/505970b96e3b7d22177c38e03435a68376628e7a;"The fix seems easy enough - move the cpuset semaphore region outside the tasklist_lock region";yes;yes;no 393;MDY6Q29tbWl0MjMyNTI5ODo1MDU5NzBiOTZlM2I3ZDIyMTc3YzM4ZTAzNDM1YTY4Mzc2NjI4ZTdh;Paul Jackson;Linus Torvalds;"[PATCH] cpuset oom lock fix The problem, reported in: http://bugzilla.kernel.org/show_bug.cgi?id=5859 and by various other email messages and lkml posts is that the cpuset hook in the oom (out of memory) code can try to take a cpuset semaphore while holding the tasklist_lock (a spinlock). One must not sleep while holding a spinlock. The fix seems easy enough - move the cpuset semaphore region outside the tasklist_lock region. This required a few lines of mechanism to implement. The oom code where the locking needs to be changed does not have access to the cpuset locks, which are internal to kernel/cpuset.c only. So I provided a couple more cpuset interface routines, available to the rest of the kernel, which simple take and drop the lock needed here (cpusets callback_sem). Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/505970b96e3b7d22177c38e03435a68376628e7a;This required a few lines of mechanism to implement;yes;no;no 393;MDY6Q29tbWl0MjMyNTI5ODo1MDU5NzBiOTZlM2I3ZDIyMTc3YzM4ZTAzNDM1YTY4Mzc2NjI4ZTdh;Paul Jackson;Linus Torvalds;"[PATCH] cpuset oom lock fix The problem, reported in: http://bugzilla.kernel.org/show_bug.cgi?id=5859 and by various other email messages and lkml posts is that the cpuset hook in the oom (out of memory) code can try to take a cpuset semaphore while holding the tasklist_lock (a spinlock). One must not sleep while holding a spinlock. The fix seems easy enough - move the cpuset semaphore region outside the tasklist_lock region. This required a few lines of mechanism to implement. The oom code where the locking needs to be changed does not have access to the cpuset locks, which are internal to kernel/cpuset.c only. So I provided a couple more cpuset interface routines, available to the rest of the kernel, which simple take and drop the lock needed here (cpusets callback_sem). Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/505970b96e3b7d22177c38e03435a68376628e7a;" The oom code where the locking needs to be changed does not have access to the cpuset locks, which are internal to kernel/cpuset.c only";yes;yes;yes 393;MDY6Q29tbWl0MjMyNTI5ODo1MDU5NzBiOTZlM2I3ZDIyMTc3YzM4ZTAzNDM1YTY4Mzc2NjI4ZTdh;Paul Jackson;Linus Torvalds;"[PATCH] cpuset oom lock fix The problem, reported in: http://bugzilla.kernel.org/show_bug.cgi?id=5859 and by various other email messages and lkml posts is that the cpuset hook in the oom (out of memory) code can try to take a cpuset semaphore while holding the tasklist_lock (a spinlock). One must not sleep while holding a spinlock. The fix seems easy enough - move the cpuset semaphore region outside the tasklist_lock region. This required a few lines of mechanism to implement. The oom code where the locking needs to be changed does not have access to the cpuset locks, which are internal to kernel/cpuset.c only. So I provided a couple more cpuset interface routines, available to the rest of the kernel, which simple take and drop the lock needed here (cpusets callback_sem). Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/505970b96e3b7d22177c38e03435a68376628e7a;" So I provided a couple more cpuset interface routines, available to the rest of the kernel, which simple take and drop the lock needed here (cpusets callback_sem).";yes;yes;no 394;MDY6Q29tbWl0MjMyNTI5ODoyZjY1OWY0NjJkMmFiNTE5MDY4ZDBlMmJiNjc3ZDdhNzAwZGVjYjhk;Kirill Korotaev;Linus Torvalds;"[PATCH] Optimise oom kill of current task When oom_killer kills current there's no need to call schedule_timeout_interruptible() since task must die ASAP. Signed-Off-By: Pavel Emelianov <xemul@sw.ru> Signed-Off-By: Kirill Korotaev <dev@openvz.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/2f659f462d2ab519068d0e2bb677d7a700decb8d;[PATCH] Optimise oom kill of current task;yes;yes;no 394;MDY6Q29tbWl0MjMyNTI5ODoyZjY1OWY0NjJkMmFiNTE5MDY4ZDBlMmJiNjc3ZDdhNzAwZGVjYjhk;Kirill Korotaev;Linus Torvalds;"[PATCH] Optimise oom kill of current task When oom_killer kills current there's no need to call schedule_timeout_interruptible() since task must die ASAP. Signed-Off-By: Pavel Emelianov <xemul@sw.ru> Signed-Off-By: Kirill Korotaev <dev@openvz.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/2f659f462d2ab519068d0e2bb677d7a700decb8d;"When oom_killer kills current there's no need to call schedule_timeout_interruptible() since task must die ASAP.";yes;yes;yes 395;MDY6Q29tbWl0MjMyNTI5ODpkZDBmYzY2ZmIzM2NkNjEwYmMxYTVkYjhhNWUyMzJkMzQ4NzliNGQ3;Al Viro;Linus Torvalds;"[PATCH] gfp flags annotations - part 1 - added typedef unsigned int __nocast gfp_t; - replaced __nocast uses for gfp flags with gfp_t - it gives exactly the same warnings as far as sparse is concerned, doesn't change generated code (from gcc point of view we replaced unsigned int with typedef) and documents what's going on far better. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/dd0fc66fb33cd610bc1a5db8a5e232d34879b4d7;[PATCH] gfp flags annotations - part 1;yes;no;no 395;MDY6Q29tbWl0MjMyNTI5ODpkZDBmYzY2ZmIzM2NkNjEwYmMxYTVkYjhhNWUyMzJkMzQ4NzliNGQ3;Al Viro;Linus Torvalds;"[PATCH] gfp flags annotations - part 1 - added typedef unsigned int __nocast gfp_t; - replaced __nocast uses for gfp flags with gfp_t - it gives exactly the same warnings as far as sparse is concerned, doesn't change generated code (from gcc point of view we replaced unsigned int with typedef) and documents what's going on far better. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/dd0fc66fb33cd610bc1a5db8a5e232d34879b4d7;" - replaced __nocast uses for gfp flags with gfp_t - it gives exactly the same warnings as far as sparse is concerned, doesn't change generated code (from gcc point of view we replaced unsigned int with typedef) and documents what's going on far better.";yes;yes;no 396;MDY6Q29tbWl0MjMyNTI5ODoxM2U0YjU3ZjZhNGUyM2NlYjk5Nzk0YTY1MGQ3NzdlNzQ4MzFmNGE2;Nishanth Aravamudan;Linus Torvalds;"[PATCH] mm: fix-up schedule_timeout() usage Use schedule_timeout_{,un}interruptible() instead of set_current_state()/schedule_timeout() to reduce kernel size. Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/13e4b57f6a4e23ceb99794a650d777e74831f4a6;[PATCH] mm: fix-up schedule_timeout() usage;yes;yes;no 396;MDY6Q29tbWl0MjMyNTI5ODoxM2U0YjU3ZjZhNGUyM2NlYjk5Nzk0YTY1MGQ3NzdlNzQ4MzFmNGE2;Nishanth Aravamudan;Linus Torvalds;"[PATCH] mm: fix-up schedule_timeout() usage Use schedule_timeout_{,un}interruptible() instead of set_current_state()/schedule_timeout() to reduce kernel size. Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/13e4b57f6a4e23ceb99794a650d777e74831f4a6;set_current_state()/schedule_timeout() to reduce kernel size.;yes;yes;no 397;MDY6Q29tbWl0MjMyNTI5ODplZjA4ZTNiNDk4MWFlYmYyYmE5YmQ3MDI1ZWY3MjEwZThlZWMwN2Nl;Paul Jackson;Linus Torvalds;"[PATCH] cpusets: confine oom_killer to mem_exclusive cpuset Now the real motivation for this cpuset mem_exclusive patch series seems trivial. This patch keeps a task in or under one mem_exclusive cpuset from provoking an oom kill of a task under a non-overlapping mem_exclusive cpuset. Since only interrupt and GFP_ATOMIC allocations are allowed to escape mem_exclusive containment, there is little to gain from oom killing a task under a non-overlapping mem_exclusive cpuset, as almost all kernel and user memory allocation must come from disjoint memory nodes. This patch enables configuring a system so that a runaway job under one mem_exclusive cpuset cannot cause the killing of a job in another such cpuset that might be using very high compute and memory resources for a prolonged time. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/ef08e3b4981aebf2ba9bd7025ef7210e8eec07ce;[PATCH] cpusets: confine oom_killer to mem_exclusive cpuset;yes;no;no 397;MDY6Q29tbWl0MjMyNTI5ODplZjA4ZTNiNDk4MWFlYmYyYmE5YmQ3MDI1ZWY3MjEwZThlZWMwN2Nl;Paul Jackson;Linus Torvalds;"[PATCH] cpusets: confine oom_killer to mem_exclusive cpuset Now the real motivation for this cpuset mem_exclusive patch series seems trivial. This patch keeps a task in or under one mem_exclusive cpuset from provoking an oom kill of a task under a non-overlapping mem_exclusive cpuset. Since only interrupt and GFP_ATOMIC allocations are allowed to escape mem_exclusive containment, there is little to gain from oom killing a task under a non-overlapping mem_exclusive cpuset, as almost all kernel and user memory allocation must come from disjoint memory nodes. This patch enables configuring a system so that a runaway job under one mem_exclusive cpuset cannot cause the killing of a job in another such cpuset that might be using very high compute and memory resources for a prolonged time. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/ef08e3b4981aebf2ba9bd7025ef7210e8eec07ce;"Now the real motivation for this cpuset mem_exclusive patch series seems trivial";no;yes;no 397;MDY6Q29tbWl0MjMyNTI5ODplZjA4ZTNiNDk4MWFlYmYyYmE5YmQ3MDI1ZWY3MjEwZThlZWMwN2Nl;Paul Jackson;Linus Torvalds;"[PATCH] cpusets: confine oom_killer to mem_exclusive cpuset Now the real motivation for this cpuset mem_exclusive patch series seems trivial. This patch keeps a task in or under one mem_exclusive cpuset from provoking an oom kill of a task under a non-overlapping mem_exclusive cpuset. Since only interrupt and GFP_ATOMIC allocations are allowed to escape mem_exclusive containment, there is little to gain from oom killing a task under a non-overlapping mem_exclusive cpuset, as almost all kernel and user memory allocation must come from disjoint memory nodes. This patch enables configuring a system so that a runaway job under one mem_exclusive cpuset cannot cause the killing of a job in another such cpuset that might be using very high compute and memory resources for a prolonged time. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/ef08e3b4981aebf2ba9bd7025ef7210e8eec07ce;"This patch keeps a task in or under one mem_exclusive cpuset from provoking an oom kill of a task under a non-overlapping mem_exclusive cpuset";yes;yes;no 397;MDY6Q29tbWl0MjMyNTI5ODplZjA4ZTNiNDk4MWFlYmYyYmE5YmQ3MDI1ZWY3MjEwZThlZWMwN2Nl;Paul Jackson;Linus Torvalds;"[PATCH] cpusets: confine oom_killer to mem_exclusive cpuset Now the real motivation for this cpuset mem_exclusive patch series seems trivial. This patch keeps a task in or under one mem_exclusive cpuset from provoking an oom kill of a task under a non-overlapping mem_exclusive cpuset. Since only interrupt and GFP_ATOMIC allocations are allowed to escape mem_exclusive containment, there is little to gain from oom killing a task under a non-overlapping mem_exclusive cpuset, as almost all kernel and user memory allocation must come from disjoint memory nodes. This patch enables configuring a system so that a runaway job under one mem_exclusive cpuset cannot cause the killing of a job in another such cpuset that might be using very high compute and memory resources for a prolonged time. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/ef08e3b4981aebf2ba9bd7025ef7210e8eec07ce;" Since only interrupt and GFP_ATOMIC allocations are allowed to escape mem_exclusive containment, there is little to gain from oom killing a task under a non-overlapping mem_exclusive cpuset, as almost all kernel and user memory allocation must come from disjoint memory nodes";no;yes;yes 397;MDY6Q29tbWl0MjMyNTI5ODplZjA4ZTNiNDk4MWFlYmYyYmE5YmQ3MDI1ZWY3MjEwZThlZWMwN2Nl;Paul Jackson;Linus Torvalds;"[PATCH] cpusets: confine oom_killer to mem_exclusive cpuset Now the real motivation for this cpuset mem_exclusive patch series seems trivial. This patch keeps a task in or under one mem_exclusive cpuset from provoking an oom kill of a task under a non-overlapping mem_exclusive cpuset. Since only interrupt and GFP_ATOMIC allocations are allowed to escape mem_exclusive containment, there is little to gain from oom killing a task under a non-overlapping mem_exclusive cpuset, as almost all kernel and user memory allocation must come from disjoint memory nodes. This patch enables configuring a system so that a runaway job under one mem_exclusive cpuset cannot cause the killing of a job in another such cpuset that might be using very high compute and memory resources for a prolonged time. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/ef08e3b4981aebf2ba9bd7025ef7210e8eec07ce;"This patch enables configuring a system so that a runaway job under one mem_exclusive cpuset cannot cause the killing of a job in another such cpuset that might be using very high compute and memory resources for a prolonged time.";yes;yes;no 398;MDY6Q29tbWl0MjMyNTI5ODphNDkzMzVjY2VhYjhhZmI2NjAzMTUyZmNjM2Y3ZDNiNjY3NzM2NmNh;Paul Jackson;Linus Torvalds;"[PATCH] cpusets: oom_kill tweaks This patch series extends the use of the cpuset attribute 'mem_exclusive' to support cpuset configurations that: 1) allow GFP_KERNEL allocations to come from a potentially larger set of memory nodes than GFP_USER allocations, and 2) can constrain the oom killer to tasks running in cpusets in a specified subtree of the cpuset hierarchy. Here's an example usage scenario. For a few hours or more, a large NUMA system at a University is to be divided in two halves, with a bunch of student jobs running in half the system under some form of batch manager, and with a big research project running in the other half. Each of the student jobs is placed in a small cpuset, but should share the classic Unix time share facilities, such as buffered pages of files in /bin and /usr/lib. The big research project wants no interference whatsoever from the student jobs, and has highly tuned, unusual memory and i/o patterns that intend to make full use of all the main memory on the nodes available to it. In this example, we have two big sibling cpusets, one of which is further divided into a more dynamic set of child cpusets. We want kernel memory allocations constrained by the two big cpusets, and user allocations constrained by the smaller child cpusets where present. And we require that the oom killer not operate across the two halves of this system, or else the first time a student job runs amuck, the big research project will likely be first inline to get shot. Tweaking /proc/<pid>/oom_adj is not ideal -- if the big research project really does run amuck allocating memory, it should be shot, not some other task outside the research projects mem_exclusive cpuset. I propose to extend the use of the 'mem_exclusive' flag of cpusets to manage such scenarios. Let memory allocations for user space (GFP_USER) be constrained by a tasks current cpuset, but memory allocations for kernel space (GFP_KERNEL) by constrained by the nearest mem_exclusive ancestor of the current cpuset, even though kernel space allocations will still _prefer_ to remain within the current tasks cpuset, if memory is easily available. Let the oom killer be constrained to consider only tasks that are in overlapping mem_exclusive cpusets (it won't help much to kill a task that normally cannot allocate memory on any of the same nodes as the ones on which the current task can allocate.) The current constraints imposed on setting mem_exclusive are unchanged. A cpuset may only be mem_exclusive if its parent is also mem_exclusive, and a mem_exclusive cpuset may not overlap any of its siblings memory nodes. This patch was presented on linux-mm in early July 2005, though did not generate much feedback at that time. It has been built for a variety of arch's using cross tools, and built, booted and tested for function on SN2 (ia64). There are 4 patches in this set: 1) Some minor cleanup, and some improvements to the code layout of one routine to make subsequent patches cleaner. 2) Add another GFP flag - __GFP_HARDWALL. It marks memory requests for USER space, which are tightly confined by the current tasks cpuset. 3) Now memory requests (such as KERNEL) that not marked HARDWALL can if short on memory, look in the potentially larger pool of memory defined by the nearest mem_exclusive ancestor cpuset of the current tasks cpuset. 4) Finally, modify the oom killer to skip any task whose mem_exclusive cpuset doesn't overlap ours. Patch (1), the one time I looked on an SN2 (ia64) build, actually saved 32 bytes of kernel text space. Patch (2) has no affect on the size of kernel text space (it just adds a preprocessor flag). Patches (3) and (4) added about 600 bytes each of kernel text space, mostly in kernel/cpuset.c, which matters only if CONFIG_CPUSET is enabled. This patch: This patch applies a few comment and code cleanups to mm/oom_kill.c prior to applying a few small patches to improve cpuset management of memory placement. The comment changed in oom_kill.c was seriously misleading. The code layout change in select_bad_process() makes room for adding another condition on which a process can be spared the oom killer (see the subsequent cpuset_nodes_overlap patch for this addition). Also a couple typos and spellos that bugged me, while I was here. This patch should have no material affect. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/a49335cceab8afb6603152fcc3f7d3b6677366ca;[PATCH] cpusets: oom_kill tweaks;yes;no;no 398;MDY6Q29tbWl0MjMyNTI5ODphNDkzMzVjY2VhYjhhZmI2NjAzMTUyZmNjM2Y3ZDNiNjY3NzM2NmNh;Paul Jackson;Linus Torvalds;"[PATCH] cpusets: oom_kill tweaks This patch series extends the use of the cpuset attribute 'mem_exclusive' to support cpuset configurations that: 1) allow GFP_KERNEL allocations to come from a potentially larger set of memory nodes than GFP_USER allocations, and 2) can constrain the oom killer to tasks running in cpusets in a specified subtree of the cpuset hierarchy. Here's an example usage scenario. For a few hours or more, a large NUMA system at a University is to be divided in two halves, with a bunch of student jobs running in half the system under some form of batch manager, and with a big research project running in the other half. Each of the student jobs is placed in a small cpuset, but should share the classic Unix time share facilities, such as buffered pages of files in /bin and /usr/lib. The big research project wants no interference whatsoever from the student jobs, and has highly tuned, unusual memory and i/o patterns that intend to make full use of all the main memory on the nodes available to it. In this example, we have two big sibling cpusets, one of which is further divided into a more dynamic set of child cpusets. We want kernel memory allocations constrained by the two big cpusets, and user allocations constrained by the smaller child cpusets where present. And we require that the oom killer not operate across the two halves of this system, or else the first time a student job runs amuck, the big research project will likely be first inline to get shot. Tweaking /proc/<pid>/oom_adj is not ideal -- if the big research project really does run amuck allocating memory, it should be shot, not some other task outside the research projects mem_exclusive cpuset. I propose to extend the use of the 'mem_exclusive' flag of cpusets to manage such scenarios. Let memory allocations for user space (GFP_USER) be constrained by a tasks current cpuset, but memory allocations for kernel space (GFP_KERNEL) by constrained by the nearest mem_exclusive ancestor of the current cpuset, even though kernel space allocations will still _prefer_ to remain within the current tasks cpuset, if memory is easily available. Let the oom killer be constrained to consider only tasks that are in overlapping mem_exclusive cpusets (it won't help much to kill a task that normally cannot allocate memory on any of the same nodes as the ones on which the current task can allocate.) The current constraints imposed on setting mem_exclusive are unchanged. A cpuset may only be mem_exclusive if its parent is also mem_exclusive, and a mem_exclusive cpuset may not overlap any of its siblings memory nodes. This patch was presented on linux-mm in early July 2005, though did not generate much feedback at that time. It has been built for a variety of arch's using cross tools, and built, booted and tested for function on SN2 (ia64). There are 4 patches in this set: 1) Some minor cleanup, and some improvements to the code layout of one routine to make subsequent patches cleaner. 2) Add another GFP flag - __GFP_HARDWALL. It marks memory requests for USER space, which are tightly confined by the current tasks cpuset. 3) Now memory requests (such as KERNEL) that not marked HARDWALL can if short on memory, look in the potentially larger pool of memory defined by the nearest mem_exclusive ancestor cpuset of the current tasks cpuset. 4) Finally, modify the oom killer to skip any task whose mem_exclusive cpuset doesn't overlap ours. Patch (1), the one time I looked on an SN2 (ia64) build, actually saved 32 bytes of kernel text space. Patch (2) has no affect on the size of kernel text space (it just adds a preprocessor flag). Patches (3) and (4) added about 600 bytes each of kernel text space, mostly in kernel/cpuset.c, which matters only if CONFIG_CPUSET is enabled. This patch: This patch applies a few comment and code cleanups to mm/oom_kill.c prior to applying a few small patches to improve cpuset management of memory placement. The comment changed in oom_kill.c was seriously misleading. The code layout change in select_bad_process() makes room for adding another condition on which a process can be spared the oom killer (see the subsequent cpuset_nodes_overlap patch for this addition). Also a couple typos and spellos that bugged me, while I was here. This patch should have no material affect. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/a49335cceab8afb6603152fcc3f7d3b6677366ca;"This patch series extends the use of the cpuset attribute 'mem_exclusive' to support cpuset configurations that";yes;no;no 398;MDY6Q29tbWl0MjMyNTI5ODphNDkzMzVjY2VhYjhhZmI2NjAzMTUyZmNjM2Y3ZDNiNjY3NzM2NmNh;Paul Jackson;Linus Torvalds;"[PATCH] cpusets: oom_kill tweaks This patch series extends the use of the cpuset attribute 'mem_exclusive' to support cpuset configurations that: 1) allow GFP_KERNEL allocations to come from a potentially larger set of memory nodes than GFP_USER allocations, and 2) can constrain the oom killer to tasks running in cpusets in a specified subtree of the cpuset hierarchy. Here's an example usage scenario. For a few hours or more, a large NUMA system at a University is to be divided in two halves, with a bunch of student jobs running in half the system under some form of batch manager, and with a big research project running in the other half. Each of the student jobs is placed in a small cpuset, but should share the classic Unix time share facilities, such as buffered pages of files in /bin and /usr/lib. The big research project wants no interference whatsoever from the student jobs, and has highly tuned, unusual memory and i/o patterns that intend to make full use of all the main memory on the nodes available to it. In this example, we have two big sibling cpusets, one of which is further divided into a more dynamic set of child cpusets. We want kernel memory allocations constrained by the two big cpusets, and user allocations constrained by the smaller child cpusets where present. And we require that the oom killer not operate across the two halves of this system, or else the first time a student job runs amuck, the big research project will likely be first inline to get shot. Tweaking /proc/<pid>/oom_adj is not ideal -- if the big research project really does run amuck allocating memory, it should be shot, not some other task outside the research projects mem_exclusive cpuset. I propose to extend the use of the 'mem_exclusive' flag of cpusets to manage such scenarios. Let memory allocations for user space (GFP_USER) be constrained by a tasks current cpuset, but memory allocations for kernel space (GFP_KERNEL) by constrained by the nearest mem_exclusive ancestor of the current cpuset, even though kernel space allocations will still _prefer_ to remain within the current tasks cpuset, if memory is easily available. Let the oom killer be constrained to consider only tasks that are in overlapping mem_exclusive cpusets (it won't help much to kill a task that normally cannot allocate memory on any of the same nodes as the ones on which the current task can allocate.) The current constraints imposed on setting mem_exclusive are unchanged. A cpuset may only be mem_exclusive if its parent is also mem_exclusive, and a mem_exclusive cpuset may not overlap any of its siblings memory nodes. This patch was presented on linux-mm in early July 2005, though did not generate much feedback at that time. It has been built for a variety of arch's using cross tools, and built, booted and tested for function on SN2 (ia64). There are 4 patches in this set: 1) Some minor cleanup, and some improvements to the code layout of one routine to make subsequent patches cleaner. 2) Add another GFP flag - __GFP_HARDWALL. It marks memory requests for USER space, which are tightly confined by the current tasks cpuset. 3) Now memory requests (such as KERNEL) that not marked HARDWALL can if short on memory, look in the potentially larger pool of memory defined by the nearest mem_exclusive ancestor cpuset of the current tasks cpuset. 4) Finally, modify the oom killer to skip any task whose mem_exclusive cpuset doesn't overlap ours. Patch (1), the one time I looked on an SN2 (ia64) build, actually saved 32 bytes of kernel text space. Patch (2) has no affect on the size of kernel text space (it just adds a preprocessor flag). Patches (3) and (4) added about 600 bytes each of kernel text space, mostly in kernel/cpuset.c, which matters only if CONFIG_CPUSET is enabled. This patch: This patch applies a few comment and code cleanups to mm/oom_kill.c prior to applying a few small patches to improve cpuset management of memory placement. The comment changed in oom_kill.c was seriously misleading. The code layout change in select_bad_process() makes room for adding another condition on which a process can be spared the oom killer (see the subsequent cpuset_nodes_overlap patch for this addition). Also a couple typos and spellos that bugged me, while I was here. This patch should have no material affect. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/a49335cceab8afb6603152fcc3f7d3b6677366ca;" 1) allow GFP_KERNEL allocations to come from a potentially larger set of memory nodes than GFP_USER allocations, and 2) can constrain the oom killer to tasks running in cpusets in a specified subtree of the cpuset hierarchy";yes;no;no 398;MDY6Q29tbWl0MjMyNTI5ODphNDkzMzVjY2VhYjhhZmI2NjAzMTUyZmNjM2Y3ZDNiNjY3NzM2NmNh;Paul Jackson;Linus Torvalds;"[PATCH] cpusets: oom_kill tweaks This patch series extends the use of the cpuset attribute 'mem_exclusive' to support cpuset configurations that: 1) allow GFP_KERNEL allocations to come from a potentially larger set of memory nodes than GFP_USER allocations, and 2) can constrain the oom killer to tasks running in cpusets in a specified subtree of the cpuset hierarchy. Here's an example usage scenario. For a few hours or more, a large NUMA system at a University is to be divided in two halves, with a bunch of student jobs running in half the system under some form of batch manager, and with a big research project running in the other half. Each of the student jobs is placed in a small cpuset, but should share the classic Unix time share facilities, such as buffered pages of files in /bin and /usr/lib. The big research project wants no interference whatsoever from the student jobs, and has highly tuned, unusual memory and i/o patterns that intend to make full use of all the main memory on the nodes available to it. In this example, we have two big sibling cpusets, one of which is further divided into a more dynamic set of child cpusets. We want kernel memory allocations constrained by the two big cpusets, and user allocations constrained by the smaller child cpusets where present. And we require that the oom killer not operate across the two halves of this system, or else the first time a student job runs amuck, the big research project will likely be first inline to get shot. Tweaking /proc/<pid>/oom_adj is not ideal -- if the big research project really does run amuck allocating memory, it should be shot, not some other task outside the research projects mem_exclusive cpuset. I propose to extend the use of the 'mem_exclusive' flag of cpusets to manage such scenarios. Let memory allocations for user space (GFP_USER) be constrained by a tasks current cpuset, but memory allocations for kernel space (GFP_KERNEL) by constrained by the nearest mem_exclusive ancestor of the current cpuset, even though kernel space allocations will still _prefer_ to remain within the current tasks cpuset, if memory is easily available. Let the oom killer be constrained to consider only tasks that are in overlapping mem_exclusive cpusets (it won't help much to kill a task that normally cannot allocate memory on any of the same nodes as the ones on which the current task can allocate.) The current constraints imposed on setting mem_exclusive are unchanged. A cpuset may only be mem_exclusive if its parent is also mem_exclusive, and a mem_exclusive cpuset may not overlap any of its siblings memory nodes. This patch was presented on linux-mm in early July 2005, though did not generate much feedback at that time. It has been built for a variety of arch's using cross tools, and built, booted and tested for function on SN2 (ia64). There are 4 patches in this set: 1) Some minor cleanup, and some improvements to the code layout of one routine to make subsequent patches cleaner. 2) Add another GFP flag - __GFP_HARDWALL. It marks memory requests for USER space, which are tightly confined by the current tasks cpuset. 3) Now memory requests (such as KERNEL) that not marked HARDWALL can if short on memory, look in the potentially larger pool of memory defined by the nearest mem_exclusive ancestor cpuset of the current tasks cpuset. 4) Finally, modify the oom killer to skip any task whose mem_exclusive cpuset doesn't overlap ours. Patch (1), the one time I looked on an SN2 (ia64) build, actually saved 32 bytes of kernel text space. Patch (2) has no affect on the size of kernel text space (it just adds a preprocessor flag). Patches (3) and (4) added about 600 bytes each of kernel text space, mostly in kernel/cpuset.c, which matters only if CONFIG_CPUSET is enabled. This patch: This patch applies a few comment and code cleanups to mm/oom_kill.c prior to applying a few small patches to improve cpuset management of memory placement. The comment changed in oom_kill.c was seriously misleading. The code layout change in select_bad_process() makes room for adding another condition on which a process can be spared the oom killer (see the subsequent cpuset_nodes_overlap patch for this addition). Also a couple typos and spellos that bugged me, while I was here. This patch should have no material affect. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/a49335cceab8afb6603152fcc3f7d3b6677366ca;Here's an example usage scenario;no;no;yes 398;MDY6Q29tbWl0MjMyNTI5ODphNDkzMzVjY2VhYjhhZmI2NjAzMTUyZmNjM2Y3ZDNiNjY3NzM2NmNh;Paul Jackson;Linus Torvalds;"[PATCH] cpusets: oom_kill tweaks This patch series extends the use of the cpuset attribute 'mem_exclusive' to support cpuset configurations that: 1) allow GFP_KERNEL allocations to come from a potentially larger set of memory nodes than GFP_USER allocations, and 2) can constrain the oom killer to tasks running in cpusets in a specified subtree of the cpuset hierarchy. Here's an example usage scenario. For a few hours or more, a large NUMA system at a University is to be divided in two halves, with a bunch of student jobs running in half the system under some form of batch manager, and with a big research project running in the other half. Each of the student jobs is placed in a small cpuset, but should share the classic Unix time share facilities, such as buffered pages of files in /bin and /usr/lib. The big research project wants no interference whatsoever from the student jobs, and has highly tuned, unusual memory and i/o patterns that intend to make full use of all the main memory on the nodes available to it. In this example, we have two big sibling cpusets, one of which is further divided into a more dynamic set of child cpusets. We want kernel memory allocations constrained by the two big cpusets, and user allocations constrained by the smaller child cpusets where present. And we require that the oom killer not operate across the two halves of this system, or else the first time a student job runs amuck, the big research project will likely be first inline to get shot. Tweaking /proc/<pid>/oom_adj is not ideal -- if the big research project really does run amuck allocating memory, it should be shot, not some other task outside the research projects mem_exclusive cpuset. I propose to extend the use of the 'mem_exclusive' flag of cpusets to manage such scenarios. Let memory allocations for user space (GFP_USER) be constrained by a tasks current cpuset, but memory allocations for kernel space (GFP_KERNEL) by constrained by the nearest mem_exclusive ancestor of the current cpuset, even though kernel space allocations will still _prefer_ to remain within the current tasks cpuset, if memory is easily available. Let the oom killer be constrained to consider only tasks that are in overlapping mem_exclusive cpusets (it won't help much to kill a task that normally cannot allocate memory on any of the same nodes as the ones on which the current task can allocate.) The current constraints imposed on setting mem_exclusive are unchanged. A cpuset may only be mem_exclusive if its parent is also mem_exclusive, and a mem_exclusive cpuset may not overlap any of its siblings memory nodes. This patch was presented on linux-mm in early July 2005, though did not generate much feedback at that time. It has been built for a variety of arch's using cross tools, and built, booted and tested for function on SN2 (ia64). There are 4 patches in this set: 1) Some minor cleanup, and some improvements to the code layout of one routine to make subsequent patches cleaner. 2) Add another GFP flag - __GFP_HARDWALL. It marks memory requests for USER space, which are tightly confined by the current tasks cpuset. 3) Now memory requests (such as KERNEL) that not marked HARDWALL can if short on memory, look in the potentially larger pool of memory defined by the nearest mem_exclusive ancestor cpuset of the current tasks cpuset. 4) Finally, modify the oom killer to skip any task whose mem_exclusive cpuset doesn't overlap ours. Patch (1), the one time I looked on an SN2 (ia64) build, actually saved 32 bytes of kernel text space. Patch (2) has no affect on the size of kernel text space (it just adds a preprocessor flag). Patches (3) and (4) added about 600 bytes each of kernel text space, mostly in kernel/cpuset.c, which matters only if CONFIG_CPUSET is enabled. This patch: This patch applies a few comment and code cleanups to mm/oom_kill.c prior to applying a few small patches to improve cpuset management of memory placement. The comment changed in oom_kill.c was seriously misleading. The code layout change in select_bad_process() makes room for adding another condition on which a process can be spared the oom killer (see the subsequent cpuset_nodes_overlap patch for this addition). Also a couple typos and spellos that bugged me, while I was here. This patch should have no material affect. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/a49335cceab8afb6603152fcc3f7d3b6677366ca;" For a few hours or more, a large NUMA system at a University is to be divided in two halves, with a bunch of student jobs running in half the system under some form of batch manager, and with a big research project running in the other half";no;no;yes 398;MDY6Q29tbWl0MjMyNTI5ODphNDkzMzVjY2VhYjhhZmI2NjAzMTUyZmNjM2Y3ZDNiNjY3NzM2NmNh;Paul Jackson;Linus Torvalds;"[PATCH] cpusets: oom_kill tweaks This patch series extends the use of the cpuset attribute 'mem_exclusive' to support cpuset configurations that: 1) allow GFP_KERNEL allocations to come from a potentially larger set of memory nodes than GFP_USER allocations, and 2) can constrain the oom killer to tasks running in cpusets in a specified subtree of the cpuset hierarchy. Here's an example usage scenario. For a few hours or more, a large NUMA system at a University is to be divided in two halves, with a bunch of student jobs running in half the system under some form of batch manager, and with a big research project running in the other half. Each of the student jobs is placed in a small cpuset, but should share the classic Unix time share facilities, such as buffered pages of files in /bin and /usr/lib. The big research project wants no interference whatsoever from the student jobs, and has highly tuned, unusual memory and i/o patterns that intend to make full use of all the main memory on the nodes available to it. In this example, we have two big sibling cpusets, one of which is further divided into a more dynamic set of child cpusets. We want kernel memory allocations constrained by the two big cpusets, and user allocations constrained by the smaller child cpusets where present. And we require that the oom killer not operate across the two halves of this system, or else the first time a student job runs amuck, the big research project will likely be first inline to get shot. Tweaking /proc/<pid>/oom_adj is not ideal -- if the big research project really does run amuck allocating memory, it should be shot, not some other task outside the research projects mem_exclusive cpuset. I propose to extend the use of the 'mem_exclusive' flag of cpusets to manage such scenarios. Let memory allocations for user space (GFP_USER) be constrained by a tasks current cpuset, but memory allocations for kernel space (GFP_KERNEL) by constrained by the nearest mem_exclusive ancestor of the current cpuset, even though kernel space allocations will still _prefer_ to remain within the current tasks cpuset, if memory is easily available. Let the oom killer be constrained to consider only tasks that are in overlapping mem_exclusive cpusets (it won't help much to kill a task that normally cannot allocate memory on any of the same nodes as the ones on which the current task can allocate.) The current constraints imposed on setting mem_exclusive are unchanged. A cpuset may only be mem_exclusive if its parent is also mem_exclusive, and a mem_exclusive cpuset may not overlap any of its siblings memory nodes. This patch was presented on linux-mm in early July 2005, though did not generate much feedback at that time. It has been built for a variety of arch's using cross tools, and built, booted and tested for function on SN2 (ia64). There are 4 patches in this set: 1) Some minor cleanup, and some improvements to the code layout of one routine to make subsequent patches cleaner. 2) Add another GFP flag - __GFP_HARDWALL. It marks memory requests for USER space, which are tightly confined by the current tasks cpuset. 3) Now memory requests (such as KERNEL) that not marked HARDWALL can if short on memory, look in the potentially larger pool of memory defined by the nearest mem_exclusive ancestor cpuset of the current tasks cpuset. 4) Finally, modify the oom killer to skip any task whose mem_exclusive cpuset doesn't overlap ours. Patch (1), the one time I looked on an SN2 (ia64) build, actually saved 32 bytes of kernel text space. Patch (2) has no affect on the size of kernel text space (it just adds a preprocessor flag). Patches (3) and (4) added about 600 bytes each of kernel text space, mostly in kernel/cpuset.c, which matters only if CONFIG_CPUSET is enabled. This patch: This patch applies a few comment and code cleanups to mm/oom_kill.c prior to applying a few small patches to improve cpuset management of memory placement. The comment changed in oom_kill.c was seriously misleading. The code layout change in select_bad_process() makes room for adding another condition on which a process can be spared the oom killer (see the subsequent cpuset_nodes_overlap patch for this addition). Also a couple typos and spellos that bugged me, while I was here. This patch should have no material affect. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/a49335cceab8afb6603152fcc3f7d3b6677366ca;" Each of the student jobs is placed in a small cpuset, but should share the classic Unix time share facilities, such as buffered pages of files in /bin and /usr/lib";no;no;yes 398;MDY6Q29tbWl0MjMyNTI5ODphNDkzMzVjY2VhYjhhZmI2NjAzMTUyZmNjM2Y3ZDNiNjY3NzM2NmNh;Paul Jackson;Linus Torvalds;"[PATCH] cpusets: oom_kill tweaks This patch series extends the use of the cpuset attribute 'mem_exclusive' to support cpuset configurations that: 1) allow GFP_KERNEL allocations to come from a potentially larger set of memory nodes than GFP_USER allocations, and 2) can constrain the oom killer to tasks running in cpusets in a specified subtree of the cpuset hierarchy. Here's an example usage scenario. For a few hours or more, a large NUMA system at a University is to be divided in two halves, with a bunch of student jobs running in half the system under some form of batch manager, and with a big research project running in the other half. Each of the student jobs is placed in a small cpuset, but should share the classic Unix time share facilities, such as buffered pages of files in /bin and /usr/lib. The big research project wants no interference whatsoever from the student jobs, and has highly tuned, unusual memory and i/o patterns that intend to make full use of all the main memory on the nodes available to it. In this example, we have two big sibling cpusets, one of which is further divided into a more dynamic set of child cpusets. We want kernel memory allocations constrained by the two big cpusets, and user allocations constrained by the smaller child cpusets where present. And we require that the oom killer not operate across the two halves of this system, or else the first time a student job runs amuck, the big research project will likely be first inline to get shot. Tweaking /proc/<pid>/oom_adj is not ideal -- if the big research project really does run amuck allocating memory, it should be shot, not some other task outside the research projects mem_exclusive cpuset. I propose to extend the use of the 'mem_exclusive' flag of cpusets to manage such scenarios. Let memory allocations for user space (GFP_USER) be constrained by a tasks current cpuset, but memory allocations for kernel space (GFP_KERNEL) by constrained by the nearest mem_exclusive ancestor of the current cpuset, even though kernel space allocations will still _prefer_ to remain within the current tasks cpuset, if memory is easily available. Let the oom killer be constrained to consider only tasks that are in overlapping mem_exclusive cpusets (it won't help much to kill a task that normally cannot allocate memory on any of the same nodes as the ones on which the current task can allocate.) The current constraints imposed on setting mem_exclusive are unchanged. A cpuset may only be mem_exclusive if its parent is also mem_exclusive, and a mem_exclusive cpuset may not overlap any of its siblings memory nodes. This patch was presented on linux-mm in early July 2005, though did not generate much feedback at that time. It has been built for a variety of arch's using cross tools, and built, booted and tested for function on SN2 (ia64). There are 4 patches in this set: 1) Some minor cleanup, and some improvements to the code layout of one routine to make subsequent patches cleaner. 2) Add another GFP flag - __GFP_HARDWALL. It marks memory requests for USER space, which are tightly confined by the current tasks cpuset. 3) Now memory requests (such as KERNEL) that not marked HARDWALL can if short on memory, look in the potentially larger pool of memory defined by the nearest mem_exclusive ancestor cpuset of the current tasks cpuset. 4) Finally, modify the oom killer to skip any task whose mem_exclusive cpuset doesn't overlap ours. Patch (1), the one time I looked on an SN2 (ia64) build, actually saved 32 bytes of kernel text space. Patch (2) has no affect on the size of kernel text space (it just adds a preprocessor flag). Patches (3) and (4) added about 600 bytes each of kernel text space, mostly in kernel/cpuset.c, which matters only if CONFIG_CPUSET is enabled. This patch: This patch applies a few comment and code cleanups to mm/oom_kill.c prior to applying a few small patches to improve cpuset management of memory placement. The comment changed in oom_kill.c was seriously misleading. The code layout change in select_bad_process() makes room for adding another condition on which a process can be spared the oom killer (see the subsequent cpuset_nodes_overlap patch for this addition). Also a couple typos and spellos that bugged me, while I was here. This patch should have no material affect. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/a49335cceab8afb6603152fcc3f7d3b6677366ca;" The big research project wants no interference whatsoever from the student jobs, and has highly tuned, unusual memory and i/o patterns that intend to make full use of all the main memory on the nodes available to it";yes;no;yes 398;MDY6Q29tbWl0MjMyNTI5ODphNDkzMzVjY2VhYjhhZmI2NjAzMTUyZmNjM2Y3ZDNiNjY3NzM2NmNh;Paul Jackson;Linus Torvalds;"[PATCH] cpusets: oom_kill tweaks This patch series extends the use of the cpuset attribute 'mem_exclusive' to support cpuset configurations that: 1) allow GFP_KERNEL allocations to come from a potentially larger set of memory nodes than GFP_USER allocations, and 2) can constrain the oom killer to tasks running in cpusets in a specified subtree of the cpuset hierarchy. Here's an example usage scenario. For a few hours or more, a large NUMA system at a University is to be divided in two halves, with a bunch of student jobs running in half the system under some form of batch manager, and with a big research project running in the other half. Each of the student jobs is placed in a small cpuset, but should share the classic Unix time share facilities, such as buffered pages of files in /bin and /usr/lib. The big research project wants no interference whatsoever from the student jobs, and has highly tuned, unusual memory and i/o patterns that intend to make full use of all the main memory on the nodes available to it. In this example, we have two big sibling cpusets, one of which is further divided into a more dynamic set of child cpusets. We want kernel memory allocations constrained by the two big cpusets, and user allocations constrained by the smaller child cpusets where present. And we require that the oom killer not operate across the two halves of this system, or else the first time a student job runs amuck, the big research project will likely be first inline to get shot. Tweaking /proc/<pid>/oom_adj is not ideal -- if the big research project really does run amuck allocating memory, it should be shot, not some other task outside the research projects mem_exclusive cpuset. I propose to extend the use of the 'mem_exclusive' flag of cpusets to manage such scenarios. Let memory allocations for user space (GFP_USER) be constrained by a tasks current cpuset, but memory allocations for kernel space (GFP_KERNEL) by constrained by the nearest mem_exclusive ancestor of the current cpuset, even though kernel space allocations will still _prefer_ to remain within the current tasks cpuset, if memory is easily available. Let the oom killer be constrained to consider only tasks that are in overlapping mem_exclusive cpusets (it won't help much to kill a task that normally cannot allocate memory on any of the same nodes as the ones on which the current task can allocate.) The current constraints imposed on setting mem_exclusive are unchanged. A cpuset may only be mem_exclusive if its parent is also mem_exclusive, and a mem_exclusive cpuset may not overlap any of its siblings memory nodes. This patch was presented on linux-mm in early July 2005, though did not generate much feedback at that time. It has been built for a variety of arch's using cross tools, and built, booted and tested for function on SN2 (ia64). There are 4 patches in this set: 1) Some minor cleanup, and some improvements to the code layout of one routine to make subsequent patches cleaner. 2) Add another GFP flag - __GFP_HARDWALL. It marks memory requests for USER space, which are tightly confined by the current tasks cpuset. 3) Now memory requests (such as KERNEL) that not marked HARDWALL can if short on memory, look in the potentially larger pool of memory defined by the nearest mem_exclusive ancestor cpuset of the current tasks cpuset. 4) Finally, modify the oom killer to skip any task whose mem_exclusive cpuset doesn't overlap ours. Patch (1), the one time I looked on an SN2 (ia64) build, actually saved 32 bytes of kernel text space. Patch (2) has no affect on the size of kernel text space (it just adds a preprocessor flag). Patches (3) and (4) added about 600 bytes each of kernel text space, mostly in kernel/cpuset.c, which matters only if CONFIG_CPUSET is enabled. This patch: This patch applies a few comment and code cleanups to mm/oom_kill.c prior to applying a few small patches to improve cpuset management of memory placement. The comment changed in oom_kill.c was seriously misleading. The code layout change in select_bad_process() makes room for adding another condition on which a process can be spared the oom killer (see the subsequent cpuset_nodes_overlap patch for this addition). Also a couple typos and spellos that bugged me, while I was here. This patch should have no material affect. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/a49335cceab8afb6603152fcc3f7d3b6677366ca;"In this example, we have two big sibling cpusets, one of which is further divided into a more dynamic set of child cpusets";no;no;yes 398;MDY6Q29tbWl0MjMyNTI5ODphNDkzMzVjY2VhYjhhZmI2NjAzMTUyZmNjM2Y3ZDNiNjY3NzM2NmNh;Paul Jackson;Linus Torvalds;"[PATCH] cpusets: oom_kill tweaks This patch series extends the use of the cpuset attribute 'mem_exclusive' to support cpuset configurations that: 1) allow GFP_KERNEL allocations to come from a potentially larger set of memory nodes than GFP_USER allocations, and 2) can constrain the oom killer to tasks running in cpusets in a specified subtree of the cpuset hierarchy. Here's an example usage scenario. For a few hours or more, a large NUMA system at a University is to be divided in two halves, with a bunch of student jobs running in half the system under some form of batch manager, and with a big research project running in the other half. Each of the student jobs is placed in a small cpuset, but should share the classic Unix time share facilities, such as buffered pages of files in /bin and /usr/lib. The big research project wants no interference whatsoever from the student jobs, and has highly tuned, unusual memory and i/o patterns that intend to make full use of all the main memory on the nodes available to it. In this example, we have two big sibling cpusets, one of which is further divided into a more dynamic set of child cpusets. We want kernel memory allocations constrained by the two big cpusets, and user allocations constrained by the smaller child cpusets where present. And we require that the oom killer not operate across the two halves of this system, or else the first time a student job runs amuck, the big research project will likely be first inline to get shot. Tweaking /proc/<pid>/oom_adj is not ideal -- if the big research project really does run amuck allocating memory, it should be shot, not some other task outside the research projects mem_exclusive cpuset. I propose to extend the use of the 'mem_exclusive' flag of cpusets to manage such scenarios. Let memory allocations for user space (GFP_USER) be constrained by a tasks current cpuset, but memory allocations for kernel space (GFP_KERNEL) by constrained by the nearest mem_exclusive ancestor of the current cpuset, even though kernel space allocations will still _prefer_ to remain within the current tasks cpuset, if memory is easily available. Let the oom killer be constrained to consider only tasks that are in overlapping mem_exclusive cpusets (it won't help much to kill a task that normally cannot allocate memory on any of the same nodes as the ones on which the current task can allocate.) The current constraints imposed on setting mem_exclusive are unchanged. A cpuset may only be mem_exclusive if its parent is also mem_exclusive, and a mem_exclusive cpuset may not overlap any of its siblings memory nodes. This patch was presented on linux-mm in early July 2005, though did not generate much feedback at that time. It has been built for a variety of arch's using cross tools, and built, booted and tested for function on SN2 (ia64). There are 4 patches in this set: 1) Some minor cleanup, and some improvements to the code layout of one routine to make subsequent patches cleaner. 2) Add another GFP flag - __GFP_HARDWALL. It marks memory requests for USER space, which are tightly confined by the current tasks cpuset. 3) Now memory requests (such as KERNEL) that not marked HARDWALL can if short on memory, look in the potentially larger pool of memory defined by the nearest mem_exclusive ancestor cpuset of the current tasks cpuset. 4) Finally, modify the oom killer to skip any task whose mem_exclusive cpuset doesn't overlap ours. Patch (1), the one time I looked on an SN2 (ia64) build, actually saved 32 bytes of kernel text space. Patch (2) has no affect on the size of kernel text space (it just adds a preprocessor flag). Patches (3) and (4) added about 600 bytes each of kernel text space, mostly in kernel/cpuset.c, which matters only if CONFIG_CPUSET is enabled. This patch: This patch applies a few comment and code cleanups to mm/oom_kill.c prior to applying a few small patches to improve cpuset management of memory placement. The comment changed in oom_kill.c was seriously misleading. The code layout change in select_bad_process() makes room for adding another condition on which a process can be spared the oom killer (see the subsequent cpuset_nodes_overlap patch for this addition). Also a couple typos and spellos that bugged me, while I was here. This patch should have no material affect. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/a49335cceab8afb6603152fcc3f7d3b6677366ca;"We want kernel memory allocations constrained by the two big cpusets, and user allocations constrained by the smaller child cpusets where present";no;yes;yes 398;MDY6Q29tbWl0MjMyNTI5ODphNDkzMzVjY2VhYjhhZmI2NjAzMTUyZmNjM2Y3ZDNiNjY3NzM2NmNh;Paul Jackson;Linus Torvalds;"[PATCH] cpusets: oom_kill tweaks This patch series extends the use of the cpuset attribute 'mem_exclusive' to support cpuset configurations that: 1) allow GFP_KERNEL allocations to come from a potentially larger set of memory nodes than GFP_USER allocations, and 2) can constrain the oom killer to tasks running in cpusets in a specified subtree of the cpuset hierarchy. Here's an example usage scenario. For a few hours or more, a large NUMA system at a University is to be divided in two halves, with a bunch of student jobs running in half the system under some form of batch manager, and with a big research project running in the other half. Each of the student jobs is placed in a small cpuset, but should share the classic Unix time share facilities, such as buffered pages of files in /bin and /usr/lib. The big research project wants no interference whatsoever from the student jobs, and has highly tuned, unusual memory and i/o patterns that intend to make full use of all the main memory on the nodes available to it. In this example, we have two big sibling cpusets, one of which is further divided into a more dynamic set of child cpusets. We want kernel memory allocations constrained by the two big cpusets, and user allocations constrained by the smaller child cpusets where present. And we require that the oom killer not operate across the two halves of this system, or else the first time a student job runs amuck, the big research project will likely be first inline to get shot. Tweaking /proc/<pid>/oom_adj is not ideal -- if the big research project really does run amuck allocating memory, it should be shot, not some other task outside the research projects mem_exclusive cpuset. I propose to extend the use of the 'mem_exclusive' flag of cpusets to manage such scenarios. Let memory allocations for user space (GFP_USER) be constrained by a tasks current cpuset, but memory allocations for kernel space (GFP_KERNEL) by constrained by the nearest mem_exclusive ancestor of the current cpuset, even though kernel space allocations will still _prefer_ to remain within the current tasks cpuset, if memory is easily available. Let the oom killer be constrained to consider only tasks that are in overlapping mem_exclusive cpusets (it won't help much to kill a task that normally cannot allocate memory on any of the same nodes as the ones on which the current task can allocate.) The current constraints imposed on setting mem_exclusive are unchanged. A cpuset may only be mem_exclusive if its parent is also mem_exclusive, and a mem_exclusive cpuset may not overlap any of its siblings memory nodes. This patch was presented on linux-mm in early July 2005, though did not generate much feedback at that time. It has been built for a variety of arch's using cross tools, and built, booted and tested for function on SN2 (ia64). There are 4 patches in this set: 1) Some minor cleanup, and some improvements to the code layout of one routine to make subsequent patches cleaner. 2) Add another GFP flag - __GFP_HARDWALL. It marks memory requests for USER space, which are tightly confined by the current tasks cpuset. 3) Now memory requests (such as KERNEL) that not marked HARDWALL can if short on memory, look in the potentially larger pool of memory defined by the nearest mem_exclusive ancestor cpuset of the current tasks cpuset. 4) Finally, modify the oom killer to skip any task whose mem_exclusive cpuset doesn't overlap ours. Patch (1), the one time I looked on an SN2 (ia64) build, actually saved 32 bytes of kernel text space. Patch (2) has no affect on the size of kernel text space (it just adds a preprocessor flag). Patches (3) and (4) added about 600 bytes each of kernel text space, mostly in kernel/cpuset.c, which matters only if CONFIG_CPUSET is enabled. This patch: This patch applies a few comment and code cleanups to mm/oom_kill.c prior to applying a few small patches to improve cpuset management of memory placement. The comment changed in oom_kill.c was seriously misleading. The code layout change in select_bad_process() makes room for adding another condition on which a process can be spared the oom killer (see the subsequent cpuset_nodes_overlap patch for this addition). Also a couple typos and spellos that bugged me, while I was here. This patch should have no material affect. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/a49335cceab8afb6603152fcc3f7d3b6677366ca;" And we require that the oom killer not operate across the two halves of this system, or else the first time a student job runs amuck, the big research project will likely be first inline to get shot";no;yes;yes 398;MDY6Q29tbWl0MjMyNTI5ODphNDkzMzVjY2VhYjhhZmI2NjAzMTUyZmNjM2Y3ZDNiNjY3NzM2NmNh;Paul Jackson;Linus Torvalds;"[PATCH] cpusets: oom_kill tweaks This patch series extends the use of the cpuset attribute 'mem_exclusive' to support cpuset configurations that: 1) allow GFP_KERNEL allocations to come from a potentially larger set of memory nodes than GFP_USER allocations, and 2) can constrain the oom killer to tasks running in cpusets in a specified subtree of the cpuset hierarchy. Here's an example usage scenario. For a few hours or more, a large NUMA system at a University is to be divided in two halves, with a bunch of student jobs running in half the system under some form of batch manager, and with a big research project running in the other half. Each of the student jobs is placed in a small cpuset, but should share the classic Unix time share facilities, such as buffered pages of files in /bin and /usr/lib. The big research project wants no interference whatsoever from the student jobs, and has highly tuned, unusual memory and i/o patterns that intend to make full use of all the main memory on the nodes available to it. In this example, we have two big sibling cpusets, one of which is further divided into a more dynamic set of child cpusets. We want kernel memory allocations constrained by the two big cpusets, and user allocations constrained by the smaller child cpusets where present. And we require that the oom killer not operate across the two halves of this system, or else the first time a student job runs amuck, the big research project will likely be first inline to get shot. Tweaking /proc/<pid>/oom_adj is not ideal -- if the big research project really does run amuck allocating memory, it should be shot, not some other task outside the research projects mem_exclusive cpuset. I propose to extend the use of the 'mem_exclusive' flag of cpusets to manage such scenarios. Let memory allocations for user space (GFP_USER) be constrained by a tasks current cpuset, but memory allocations for kernel space (GFP_KERNEL) by constrained by the nearest mem_exclusive ancestor of the current cpuset, even though kernel space allocations will still _prefer_ to remain within the current tasks cpuset, if memory is easily available. Let the oom killer be constrained to consider only tasks that are in overlapping mem_exclusive cpusets (it won't help much to kill a task that normally cannot allocate memory on any of the same nodes as the ones on which the current task can allocate.) The current constraints imposed on setting mem_exclusive are unchanged. A cpuset may only be mem_exclusive if its parent is also mem_exclusive, and a mem_exclusive cpuset may not overlap any of its siblings memory nodes. This patch was presented on linux-mm in early July 2005, though did not generate much feedback at that time. It has been built for a variety of arch's using cross tools, and built, booted and tested for function on SN2 (ia64). There are 4 patches in this set: 1) Some minor cleanup, and some improvements to the code layout of one routine to make subsequent patches cleaner. 2) Add another GFP flag - __GFP_HARDWALL. It marks memory requests for USER space, which are tightly confined by the current tasks cpuset. 3) Now memory requests (such as KERNEL) that not marked HARDWALL can if short on memory, look in the potentially larger pool of memory defined by the nearest mem_exclusive ancestor cpuset of the current tasks cpuset. 4) Finally, modify the oom killer to skip any task whose mem_exclusive cpuset doesn't overlap ours. Patch (1), the one time I looked on an SN2 (ia64) build, actually saved 32 bytes of kernel text space. Patch (2) has no affect on the size of kernel text space (it just adds a preprocessor flag). Patches (3) and (4) added about 600 bytes each of kernel text space, mostly in kernel/cpuset.c, which matters only if CONFIG_CPUSET is enabled. This patch: This patch applies a few comment and code cleanups to mm/oom_kill.c prior to applying a few small patches to improve cpuset management of memory placement. The comment changed in oom_kill.c was seriously misleading. The code layout change in select_bad_process() makes room for adding another condition on which a process can be spared the oom killer (see the subsequent cpuset_nodes_overlap patch for this addition). Also a couple typos and spellos that bugged me, while I was here. This patch should have no material affect. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/a49335cceab8afb6603152fcc3f7d3b6677366ca;"Tweaking /proc/<pid>/oom_adj is not ideal -- if the big research project really does run amuck allocating memory, it should be shot, not some other task outside the research projects mem_exclusive cpuset";yes;yes;yes 398;MDY6Q29tbWl0MjMyNTI5ODphNDkzMzVjY2VhYjhhZmI2NjAzMTUyZmNjM2Y3ZDNiNjY3NzM2NmNh;Paul Jackson;Linus Torvalds;"[PATCH] cpusets: oom_kill tweaks This patch series extends the use of the cpuset attribute 'mem_exclusive' to support cpuset configurations that: 1) allow GFP_KERNEL allocations to come from a potentially larger set of memory nodes than GFP_USER allocations, and 2) can constrain the oom killer to tasks running in cpusets in a specified subtree of the cpuset hierarchy. Here's an example usage scenario. For a few hours or more, a large NUMA system at a University is to be divided in two halves, with a bunch of student jobs running in half the system under some form of batch manager, and with a big research project running in the other half. Each of the student jobs is placed in a small cpuset, but should share the classic Unix time share facilities, such as buffered pages of files in /bin and /usr/lib. The big research project wants no interference whatsoever from the student jobs, and has highly tuned, unusual memory and i/o patterns that intend to make full use of all the main memory on the nodes available to it. In this example, we have two big sibling cpusets, one of which is further divided into a more dynamic set of child cpusets. We want kernel memory allocations constrained by the two big cpusets, and user allocations constrained by the smaller child cpusets where present. And we require that the oom killer not operate across the two halves of this system, or else the first time a student job runs amuck, the big research project will likely be first inline to get shot. Tweaking /proc/<pid>/oom_adj is not ideal -- if the big research project really does run amuck allocating memory, it should be shot, not some other task outside the research projects mem_exclusive cpuset. I propose to extend the use of the 'mem_exclusive' flag of cpusets to manage such scenarios. Let memory allocations for user space (GFP_USER) be constrained by a tasks current cpuset, but memory allocations for kernel space (GFP_KERNEL) by constrained by the nearest mem_exclusive ancestor of the current cpuset, even though kernel space allocations will still _prefer_ to remain within the current tasks cpuset, if memory is easily available. Let the oom killer be constrained to consider only tasks that are in overlapping mem_exclusive cpusets (it won't help much to kill a task that normally cannot allocate memory on any of the same nodes as the ones on which the current task can allocate.) The current constraints imposed on setting mem_exclusive are unchanged. A cpuset may only be mem_exclusive if its parent is also mem_exclusive, and a mem_exclusive cpuset may not overlap any of its siblings memory nodes. This patch was presented on linux-mm in early July 2005, though did not generate much feedback at that time. It has been built for a variety of arch's using cross tools, and built, booted and tested for function on SN2 (ia64). There are 4 patches in this set: 1) Some minor cleanup, and some improvements to the code layout of one routine to make subsequent patches cleaner. 2) Add another GFP flag - __GFP_HARDWALL. It marks memory requests for USER space, which are tightly confined by the current tasks cpuset. 3) Now memory requests (such as KERNEL) that not marked HARDWALL can if short on memory, look in the potentially larger pool of memory defined by the nearest mem_exclusive ancestor cpuset of the current tasks cpuset. 4) Finally, modify the oom killer to skip any task whose mem_exclusive cpuset doesn't overlap ours. Patch (1), the one time I looked on an SN2 (ia64) build, actually saved 32 bytes of kernel text space. Patch (2) has no affect on the size of kernel text space (it just adds a preprocessor flag). Patches (3) and (4) added about 600 bytes each of kernel text space, mostly in kernel/cpuset.c, which matters only if CONFIG_CPUSET is enabled. This patch: This patch applies a few comment and code cleanups to mm/oom_kill.c prior to applying a few small patches to improve cpuset management of memory placement. The comment changed in oom_kill.c was seriously misleading. The code layout change in select_bad_process() makes room for adding another condition on which a process can be spared the oom killer (see the subsequent cpuset_nodes_overlap patch for this addition). Also a couple typos and spellos that bugged me, while I was here. This patch should have no material affect. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/a49335cceab8afb6603152fcc3f7d3b6677366ca;"I propose to extend the use of the 'mem_exclusive' flag of cpusets to manage such scenarios";yes;yes;no 398;MDY6Q29tbWl0MjMyNTI5ODphNDkzMzVjY2VhYjhhZmI2NjAzMTUyZmNjM2Y3ZDNiNjY3NzM2NmNh;Paul Jackson;Linus Torvalds;"[PATCH] cpusets: oom_kill tweaks This patch series extends the use of the cpuset attribute 'mem_exclusive' to support cpuset configurations that: 1) allow GFP_KERNEL allocations to come from a potentially larger set of memory nodes than GFP_USER allocations, and 2) can constrain the oom killer to tasks running in cpusets in a specified subtree of the cpuset hierarchy. Here's an example usage scenario. For a few hours or more, a large NUMA system at a University is to be divided in two halves, with a bunch of student jobs running in half the system under some form of batch manager, and with a big research project running in the other half. Each of the student jobs is placed in a small cpuset, but should share the classic Unix time share facilities, such as buffered pages of files in /bin and /usr/lib. The big research project wants no interference whatsoever from the student jobs, and has highly tuned, unusual memory and i/o patterns that intend to make full use of all the main memory on the nodes available to it. In this example, we have two big sibling cpusets, one of which is further divided into a more dynamic set of child cpusets. We want kernel memory allocations constrained by the two big cpusets, and user allocations constrained by the smaller child cpusets where present. And we require that the oom killer not operate across the two halves of this system, or else the first time a student job runs amuck, the big research project will likely be first inline to get shot. Tweaking /proc/<pid>/oom_adj is not ideal -- if the big research project really does run amuck allocating memory, it should be shot, not some other task outside the research projects mem_exclusive cpuset. I propose to extend the use of the 'mem_exclusive' flag of cpusets to manage such scenarios. Let memory allocations for user space (GFP_USER) be constrained by a tasks current cpuset, but memory allocations for kernel space (GFP_KERNEL) by constrained by the nearest mem_exclusive ancestor of the current cpuset, even though kernel space allocations will still _prefer_ to remain within the current tasks cpuset, if memory is easily available. Let the oom killer be constrained to consider only tasks that are in overlapping mem_exclusive cpusets (it won't help much to kill a task that normally cannot allocate memory on any of the same nodes as the ones on which the current task can allocate.) The current constraints imposed on setting mem_exclusive are unchanged. A cpuset may only be mem_exclusive if its parent is also mem_exclusive, and a mem_exclusive cpuset may not overlap any of its siblings memory nodes. This patch was presented on linux-mm in early July 2005, though did not generate much feedback at that time. It has been built for a variety of arch's using cross tools, and built, booted and tested for function on SN2 (ia64). There are 4 patches in this set: 1) Some minor cleanup, and some improvements to the code layout of one routine to make subsequent patches cleaner. 2) Add another GFP flag - __GFP_HARDWALL. It marks memory requests for USER space, which are tightly confined by the current tasks cpuset. 3) Now memory requests (such as KERNEL) that not marked HARDWALL can if short on memory, look in the potentially larger pool of memory defined by the nearest mem_exclusive ancestor cpuset of the current tasks cpuset. 4) Finally, modify the oom killer to skip any task whose mem_exclusive cpuset doesn't overlap ours. Patch (1), the one time I looked on an SN2 (ia64) build, actually saved 32 bytes of kernel text space. Patch (2) has no affect on the size of kernel text space (it just adds a preprocessor flag). Patches (3) and (4) added about 600 bytes each of kernel text space, mostly in kernel/cpuset.c, which matters only if CONFIG_CPUSET is enabled. This patch: This patch applies a few comment and code cleanups to mm/oom_kill.c prior to applying a few small patches to improve cpuset management of memory placement. The comment changed in oom_kill.c was seriously misleading. The code layout change in select_bad_process() makes room for adding another condition on which a process can be spared the oom killer (see the subsequent cpuset_nodes_overlap patch for this addition). Also a couple typos and spellos that bugged me, while I was here. This patch should have no material affect. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/a49335cceab8afb6603152fcc3f7d3b6677366ca;" Let memory allocations for user space (GFP_USER) be constrained by a tasks current cpuset, but memory allocations for kernel space (GFP_KERNEL) by constrained by the nearest mem_exclusive ancestor of the current cpuset, even though kernel space allocations will still _prefer_ to remain within the current tasks cpuset, if memory is easily available";yes;yes;no 398;MDY6Q29tbWl0MjMyNTI5ODphNDkzMzVjY2VhYjhhZmI2NjAzMTUyZmNjM2Y3ZDNiNjY3NzM2NmNh;Paul Jackson;Linus Torvalds;"[PATCH] cpusets: oom_kill tweaks This patch series extends the use of the cpuset attribute 'mem_exclusive' to support cpuset configurations that: 1) allow GFP_KERNEL allocations to come from a potentially larger set of memory nodes than GFP_USER allocations, and 2) can constrain the oom killer to tasks running in cpusets in a specified subtree of the cpuset hierarchy. Here's an example usage scenario. For a few hours or more, a large NUMA system at a University is to be divided in two halves, with a bunch of student jobs running in half the system under some form of batch manager, and with a big research project running in the other half. Each of the student jobs is placed in a small cpuset, but should share the classic Unix time share facilities, such as buffered pages of files in /bin and /usr/lib. The big research project wants no interference whatsoever from the student jobs, and has highly tuned, unusual memory and i/o patterns that intend to make full use of all the main memory on the nodes available to it. In this example, we have two big sibling cpusets, one of which is further divided into a more dynamic set of child cpusets. We want kernel memory allocations constrained by the two big cpusets, and user allocations constrained by the smaller child cpusets where present. And we require that the oom killer not operate across the two halves of this system, or else the first time a student job runs amuck, the big research project will likely be first inline to get shot. Tweaking /proc/<pid>/oom_adj is not ideal -- if the big research project really does run amuck allocating memory, it should be shot, not some other task outside the research projects mem_exclusive cpuset. I propose to extend the use of the 'mem_exclusive' flag of cpusets to manage such scenarios. Let memory allocations for user space (GFP_USER) be constrained by a tasks current cpuset, but memory allocations for kernel space (GFP_KERNEL) by constrained by the nearest mem_exclusive ancestor of the current cpuset, even though kernel space allocations will still _prefer_ to remain within the current tasks cpuset, if memory is easily available. Let the oom killer be constrained to consider only tasks that are in overlapping mem_exclusive cpusets (it won't help much to kill a task that normally cannot allocate memory on any of the same nodes as the ones on which the current task can allocate.) The current constraints imposed on setting mem_exclusive are unchanged. A cpuset may only be mem_exclusive if its parent is also mem_exclusive, and a mem_exclusive cpuset may not overlap any of its siblings memory nodes. This patch was presented on linux-mm in early July 2005, though did not generate much feedback at that time. It has been built for a variety of arch's using cross tools, and built, booted and tested for function on SN2 (ia64). There are 4 patches in this set: 1) Some minor cleanup, and some improvements to the code layout of one routine to make subsequent patches cleaner. 2) Add another GFP flag - __GFP_HARDWALL. It marks memory requests for USER space, which are tightly confined by the current tasks cpuset. 3) Now memory requests (such as KERNEL) that not marked HARDWALL can if short on memory, look in the potentially larger pool of memory defined by the nearest mem_exclusive ancestor cpuset of the current tasks cpuset. 4) Finally, modify the oom killer to skip any task whose mem_exclusive cpuset doesn't overlap ours. Patch (1), the one time I looked on an SN2 (ia64) build, actually saved 32 bytes of kernel text space. Patch (2) has no affect on the size of kernel text space (it just adds a preprocessor flag). Patches (3) and (4) added about 600 bytes each of kernel text space, mostly in kernel/cpuset.c, which matters only if CONFIG_CPUSET is enabled. This patch: This patch applies a few comment and code cleanups to mm/oom_kill.c prior to applying a few small patches to improve cpuset management of memory placement. The comment changed in oom_kill.c was seriously misleading. The code layout change in select_bad_process() makes room for adding another condition on which a process can be spared the oom killer (see the subsequent cpuset_nodes_overlap patch for this addition). Also a couple typos and spellos that bugged me, while I was here. This patch should have no material affect. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/a49335cceab8afb6603152fcc3f7d3b6677366ca;"Let the oom killer be constrained to consider only tasks that are in overlapping mem_exclusive cpusets (it won't help much to kill a task that normally cannot allocate memory on any of the same nodes as the ones on which the current task can allocate.) The current constraints imposed on setting mem_exclusive are unchanged";yes;yes;no 398;MDY6Q29tbWl0MjMyNTI5ODphNDkzMzVjY2VhYjhhZmI2NjAzMTUyZmNjM2Y3ZDNiNjY3NzM2NmNh;Paul Jackson;Linus Torvalds;"[PATCH] cpusets: oom_kill tweaks This patch series extends the use of the cpuset attribute 'mem_exclusive' to support cpuset configurations that: 1) allow GFP_KERNEL allocations to come from a potentially larger set of memory nodes than GFP_USER allocations, and 2) can constrain the oom killer to tasks running in cpusets in a specified subtree of the cpuset hierarchy. Here's an example usage scenario. For a few hours or more, a large NUMA system at a University is to be divided in two halves, with a bunch of student jobs running in half the system under some form of batch manager, and with a big research project running in the other half. Each of the student jobs is placed in a small cpuset, but should share the classic Unix time share facilities, such as buffered pages of files in /bin and /usr/lib. The big research project wants no interference whatsoever from the student jobs, and has highly tuned, unusual memory and i/o patterns that intend to make full use of all the main memory on the nodes available to it. In this example, we have two big sibling cpusets, one of which is further divided into a more dynamic set of child cpusets. We want kernel memory allocations constrained by the two big cpusets, and user allocations constrained by the smaller child cpusets where present. And we require that the oom killer not operate across the two halves of this system, or else the first time a student job runs amuck, the big research project will likely be first inline to get shot. Tweaking /proc/<pid>/oom_adj is not ideal -- if the big research project really does run amuck allocating memory, it should be shot, not some other task outside the research projects mem_exclusive cpuset. I propose to extend the use of the 'mem_exclusive' flag of cpusets to manage such scenarios. Let memory allocations for user space (GFP_USER) be constrained by a tasks current cpuset, but memory allocations for kernel space (GFP_KERNEL) by constrained by the nearest mem_exclusive ancestor of the current cpuset, even though kernel space allocations will still _prefer_ to remain within the current tasks cpuset, if memory is easily available. Let the oom killer be constrained to consider only tasks that are in overlapping mem_exclusive cpusets (it won't help much to kill a task that normally cannot allocate memory on any of the same nodes as the ones on which the current task can allocate.) The current constraints imposed on setting mem_exclusive are unchanged. A cpuset may only be mem_exclusive if its parent is also mem_exclusive, and a mem_exclusive cpuset may not overlap any of its siblings memory nodes. This patch was presented on linux-mm in early July 2005, though did not generate much feedback at that time. It has been built for a variety of arch's using cross tools, and built, booted and tested for function on SN2 (ia64). There are 4 patches in this set: 1) Some minor cleanup, and some improvements to the code layout of one routine to make subsequent patches cleaner. 2) Add another GFP flag - __GFP_HARDWALL. It marks memory requests for USER space, which are tightly confined by the current tasks cpuset. 3) Now memory requests (such as KERNEL) that not marked HARDWALL can if short on memory, look in the potentially larger pool of memory defined by the nearest mem_exclusive ancestor cpuset of the current tasks cpuset. 4) Finally, modify the oom killer to skip any task whose mem_exclusive cpuset doesn't overlap ours. Patch (1), the one time I looked on an SN2 (ia64) build, actually saved 32 bytes of kernel text space. Patch (2) has no affect on the size of kernel text space (it just adds a preprocessor flag). Patches (3) and (4) added about 600 bytes each of kernel text space, mostly in kernel/cpuset.c, which matters only if CONFIG_CPUSET is enabled. This patch: This patch applies a few comment and code cleanups to mm/oom_kill.c prior to applying a few small patches to improve cpuset management of memory placement. The comment changed in oom_kill.c was seriously misleading. The code layout change in select_bad_process() makes room for adding another condition on which a process can be spared the oom killer (see the subsequent cpuset_nodes_overlap patch for this addition). Also a couple typos and spellos that bugged me, while I was here. This patch should have no material affect. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/a49335cceab8afb6603152fcc3f7d3b6677366ca;" A cpuset may only be mem_exclusive if its parent is also mem_exclusive, and a mem_exclusive cpuset may not overlap any of its siblings memory nodes";yes;no;no 398;MDY6Q29tbWl0MjMyNTI5ODphNDkzMzVjY2VhYjhhZmI2NjAzMTUyZmNjM2Y3ZDNiNjY3NzM2NmNh;Paul Jackson;Linus Torvalds;"[PATCH] cpusets: oom_kill tweaks This patch series extends the use of the cpuset attribute 'mem_exclusive' to support cpuset configurations that: 1) allow GFP_KERNEL allocations to come from a potentially larger set of memory nodes than GFP_USER allocations, and 2) can constrain the oom killer to tasks running in cpusets in a specified subtree of the cpuset hierarchy. Here's an example usage scenario. For a few hours or more, a large NUMA system at a University is to be divided in two halves, with a bunch of student jobs running in half the system under some form of batch manager, and with a big research project running in the other half. Each of the student jobs is placed in a small cpuset, but should share the classic Unix time share facilities, such as buffered pages of files in /bin and /usr/lib. The big research project wants no interference whatsoever from the student jobs, and has highly tuned, unusual memory and i/o patterns that intend to make full use of all the main memory on the nodes available to it. In this example, we have two big sibling cpusets, one of which is further divided into a more dynamic set of child cpusets. We want kernel memory allocations constrained by the two big cpusets, and user allocations constrained by the smaller child cpusets where present. And we require that the oom killer not operate across the two halves of this system, or else the first time a student job runs amuck, the big research project will likely be first inline to get shot. Tweaking /proc/<pid>/oom_adj is not ideal -- if the big research project really does run amuck allocating memory, it should be shot, not some other task outside the research projects mem_exclusive cpuset. I propose to extend the use of the 'mem_exclusive' flag of cpusets to manage such scenarios. Let memory allocations for user space (GFP_USER) be constrained by a tasks current cpuset, but memory allocations for kernel space (GFP_KERNEL) by constrained by the nearest mem_exclusive ancestor of the current cpuset, even though kernel space allocations will still _prefer_ to remain within the current tasks cpuset, if memory is easily available. Let the oom killer be constrained to consider only tasks that are in overlapping mem_exclusive cpusets (it won't help much to kill a task that normally cannot allocate memory on any of the same nodes as the ones on which the current task can allocate.) The current constraints imposed on setting mem_exclusive are unchanged. A cpuset may only be mem_exclusive if its parent is also mem_exclusive, and a mem_exclusive cpuset may not overlap any of its siblings memory nodes. This patch was presented on linux-mm in early July 2005, though did not generate much feedback at that time. It has been built for a variety of arch's using cross tools, and built, booted and tested for function on SN2 (ia64). There are 4 patches in this set: 1) Some minor cleanup, and some improvements to the code layout of one routine to make subsequent patches cleaner. 2) Add another GFP flag - __GFP_HARDWALL. It marks memory requests for USER space, which are tightly confined by the current tasks cpuset. 3) Now memory requests (such as KERNEL) that not marked HARDWALL can if short on memory, look in the potentially larger pool of memory defined by the nearest mem_exclusive ancestor cpuset of the current tasks cpuset. 4) Finally, modify the oom killer to skip any task whose mem_exclusive cpuset doesn't overlap ours. Patch (1), the one time I looked on an SN2 (ia64) build, actually saved 32 bytes of kernel text space. Patch (2) has no affect on the size of kernel text space (it just adds a preprocessor flag). Patches (3) and (4) added about 600 bytes each of kernel text space, mostly in kernel/cpuset.c, which matters only if CONFIG_CPUSET is enabled. This patch: This patch applies a few comment and code cleanups to mm/oom_kill.c prior to applying a few small patches to improve cpuset management of memory placement. The comment changed in oom_kill.c was seriously misleading. The code layout change in select_bad_process() makes room for adding another condition on which a process can be spared the oom killer (see the subsequent cpuset_nodes_overlap patch for this addition). Also a couple typos and spellos that bugged me, while I was here. This patch should have no material affect. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/a49335cceab8afb6603152fcc3f7d3b6677366ca;"This patch was presented on linux-mm in early July 2005, though did not generate much feedback at that time";no;no;yes 398;MDY6Q29tbWl0MjMyNTI5ODphNDkzMzVjY2VhYjhhZmI2NjAzMTUyZmNjM2Y3ZDNiNjY3NzM2NmNh;Paul Jackson;Linus Torvalds;"[PATCH] cpusets: oom_kill tweaks This patch series extends the use of the cpuset attribute 'mem_exclusive' to support cpuset configurations that: 1) allow GFP_KERNEL allocations to come from a potentially larger set of memory nodes than GFP_USER allocations, and 2) can constrain the oom killer to tasks running in cpusets in a specified subtree of the cpuset hierarchy. Here's an example usage scenario. For a few hours or more, a large NUMA system at a University is to be divided in two halves, with a bunch of student jobs running in half the system under some form of batch manager, and with a big research project running in the other half. Each of the student jobs is placed in a small cpuset, but should share the classic Unix time share facilities, such as buffered pages of files in /bin and /usr/lib. The big research project wants no interference whatsoever from the student jobs, and has highly tuned, unusual memory and i/o patterns that intend to make full use of all the main memory on the nodes available to it. In this example, we have two big sibling cpusets, one of which is further divided into a more dynamic set of child cpusets. We want kernel memory allocations constrained by the two big cpusets, and user allocations constrained by the smaller child cpusets where present. And we require that the oom killer not operate across the two halves of this system, or else the first time a student job runs amuck, the big research project will likely be first inline to get shot. Tweaking /proc/<pid>/oom_adj is not ideal -- if the big research project really does run amuck allocating memory, it should be shot, not some other task outside the research projects mem_exclusive cpuset. I propose to extend the use of the 'mem_exclusive' flag of cpusets to manage such scenarios. Let memory allocations for user space (GFP_USER) be constrained by a tasks current cpuset, but memory allocations for kernel space (GFP_KERNEL) by constrained by the nearest mem_exclusive ancestor of the current cpuset, even though kernel space allocations will still _prefer_ to remain within the current tasks cpuset, if memory is easily available. Let the oom killer be constrained to consider only tasks that are in overlapping mem_exclusive cpusets (it won't help much to kill a task that normally cannot allocate memory on any of the same nodes as the ones on which the current task can allocate.) The current constraints imposed on setting mem_exclusive are unchanged. A cpuset may only be mem_exclusive if its parent is also mem_exclusive, and a mem_exclusive cpuset may not overlap any of its siblings memory nodes. This patch was presented on linux-mm in early July 2005, though did not generate much feedback at that time. It has been built for a variety of arch's using cross tools, and built, booted and tested for function on SN2 (ia64). There are 4 patches in this set: 1) Some minor cleanup, and some improvements to the code layout of one routine to make subsequent patches cleaner. 2) Add another GFP flag - __GFP_HARDWALL. It marks memory requests for USER space, which are tightly confined by the current tasks cpuset. 3) Now memory requests (such as KERNEL) that not marked HARDWALL can if short on memory, look in the potentially larger pool of memory defined by the nearest mem_exclusive ancestor cpuset of the current tasks cpuset. 4) Finally, modify the oom killer to skip any task whose mem_exclusive cpuset doesn't overlap ours. Patch (1), the one time I looked on an SN2 (ia64) build, actually saved 32 bytes of kernel text space. Patch (2) has no affect on the size of kernel text space (it just adds a preprocessor flag). Patches (3) and (4) added about 600 bytes each of kernel text space, mostly in kernel/cpuset.c, which matters only if CONFIG_CPUSET is enabled. This patch: This patch applies a few comment and code cleanups to mm/oom_kill.c prior to applying a few small patches to improve cpuset management of memory placement. The comment changed in oom_kill.c was seriously misleading. The code layout change in select_bad_process() makes room for adding another condition on which a process can be spared the oom killer (see the subsequent cpuset_nodes_overlap patch for this addition). Also a couple typos and spellos that bugged me, while I was here. This patch should have no material affect. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/a49335cceab8afb6603152fcc3f7d3b6677366ca;" It has been built for a variety of arch's using cross tools, and built, booted and tested for function on SN2 (ia64)";yes;no;yes 398;MDY6Q29tbWl0MjMyNTI5ODphNDkzMzVjY2VhYjhhZmI2NjAzMTUyZmNjM2Y3ZDNiNjY3NzM2NmNh;Paul Jackson;Linus Torvalds;"[PATCH] cpusets: oom_kill tweaks This patch series extends the use of the cpuset attribute 'mem_exclusive' to support cpuset configurations that: 1) allow GFP_KERNEL allocations to come from a potentially larger set of memory nodes than GFP_USER allocations, and 2) can constrain the oom killer to tasks running in cpusets in a specified subtree of the cpuset hierarchy. Here's an example usage scenario. For a few hours or more, a large NUMA system at a University is to be divided in two halves, with a bunch of student jobs running in half the system under some form of batch manager, and with a big research project running in the other half. Each of the student jobs is placed in a small cpuset, but should share the classic Unix time share facilities, such as buffered pages of files in /bin and /usr/lib. The big research project wants no interference whatsoever from the student jobs, and has highly tuned, unusual memory and i/o patterns that intend to make full use of all the main memory on the nodes available to it. In this example, we have two big sibling cpusets, one of which is further divided into a more dynamic set of child cpusets. We want kernel memory allocations constrained by the two big cpusets, and user allocations constrained by the smaller child cpusets where present. And we require that the oom killer not operate across the two halves of this system, or else the first time a student job runs amuck, the big research project will likely be first inline to get shot. Tweaking /proc/<pid>/oom_adj is not ideal -- if the big research project really does run amuck allocating memory, it should be shot, not some other task outside the research projects mem_exclusive cpuset. I propose to extend the use of the 'mem_exclusive' flag of cpusets to manage such scenarios. Let memory allocations for user space (GFP_USER) be constrained by a tasks current cpuset, but memory allocations for kernel space (GFP_KERNEL) by constrained by the nearest mem_exclusive ancestor of the current cpuset, even though kernel space allocations will still _prefer_ to remain within the current tasks cpuset, if memory is easily available. Let the oom killer be constrained to consider only tasks that are in overlapping mem_exclusive cpusets (it won't help much to kill a task that normally cannot allocate memory on any of the same nodes as the ones on which the current task can allocate.) The current constraints imposed on setting mem_exclusive are unchanged. A cpuset may only be mem_exclusive if its parent is also mem_exclusive, and a mem_exclusive cpuset may not overlap any of its siblings memory nodes. This patch was presented on linux-mm in early July 2005, though did not generate much feedback at that time. It has been built for a variety of arch's using cross tools, and built, booted and tested for function on SN2 (ia64). There are 4 patches in this set: 1) Some minor cleanup, and some improvements to the code layout of one routine to make subsequent patches cleaner. 2) Add another GFP flag - __GFP_HARDWALL. It marks memory requests for USER space, which are tightly confined by the current tasks cpuset. 3) Now memory requests (such as KERNEL) that not marked HARDWALL can if short on memory, look in the potentially larger pool of memory defined by the nearest mem_exclusive ancestor cpuset of the current tasks cpuset. 4) Finally, modify the oom killer to skip any task whose mem_exclusive cpuset doesn't overlap ours. Patch (1), the one time I looked on an SN2 (ia64) build, actually saved 32 bytes of kernel text space. Patch (2) has no affect on the size of kernel text space (it just adds a preprocessor flag). Patches (3) and (4) added about 600 bytes each of kernel text space, mostly in kernel/cpuset.c, which matters only if CONFIG_CPUSET is enabled. This patch: This patch applies a few comment and code cleanups to mm/oom_kill.c prior to applying a few small patches to improve cpuset management of memory placement. The comment changed in oom_kill.c was seriously misleading. The code layout change in select_bad_process() makes room for adding another condition on which a process can be spared the oom killer (see the subsequent cpuset_nodes_overlap patch for this addition). Also a couple typos and spellos that bugged me, while I was here. This patch should have no material affect. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/a49335cceab8afb6603152fcc3f7d3b6677366ca;There are 4 patches in this set;yes;no;no 398;MDY6Q29tbWl0MjMyNTI5ODphNDkzMzVjY2VhYjhhZmI2NjAzMTUyZmNjM2Y3ZDNiNjY3NzM2NmNh;Paul Jackson;Linus Torvalds;"[PATCH] cpusets: oom_kill tweaks This patch series extends the use of the cpuset attribute 'mem_exclusive' to support cpuset configurations that: 1) allow GFP_KERNEL allocations to come from a potentially larger set of memory nodes than GFP_USER allocations, and 2) can constrain the oom killer to tasks running in cpusets in a specified subtree of the cpuset hierarchy. Here's an example usage scenario. For a few hours or more, a large NUMA system at a University is to be divided in two halves, with a bunch of student jobs running in half the system under some form of batch manager, and with a big research project running in the other half. Each of the student jobs is placed in a small cpuset, but should share the classic Unix time share facilities, such as buffered pages of files in /bin and /usr/lib. The big research project wants no interference whatsoever from the student jobs, and has highly tuned, unusual memory and i/o patterns that intend to make full use of all the main memory on the nodes available to it. In this example, we have two big sibling cpusets, one of which is further divided into a more dynamic set of child cpusets. We want kernel memory allocations constrained by the two big cpusets, and user allocations constrained by the smaller child cpusets where present. And we require that the oom killer not operate across the two halves of this system, or else the first time a student job runs amuck, the big research project will likely be first inline to get shot. Tweaking /proc/<pid>/oom_adj is not ideal -- if the big research project really does run amuck allocating memory, it should be shot, not some other task outside the research projects mem_exclusive cpuset. I propose to extend the use of the 'mem_exclusive' flag of cpusets to manage such scenarios. Let memory allocations for user space (GFP_USER) be constrained by a tasks current cpuset, but memory allocations for kernel space (GFP_KERNEL) by constrained by the nearest mem_exclusive ancestor of the current cpuset, even though kernel space allocations will still _prefer_ to remain within the current tasks cpuset, if memory is easily available. Let the oom killer be constrained to consider only tasks that are in overlapping mem_exclusive cpusets (it won't help much to kill a task that normally cannot allocate memory on any of the same nodes as the ones on which the current task can allocate.) The current constraints imposed on setting mem_exclusive are unchanged. A cpuset may only be mem_exclusive if its parent is also mem_exclusive, and a mem_exclusive cpuset may not overlap any of its siblings memory nodes. This patch was presented on linux-mm in early July 2005, though did not generate much feedback at that time. It has been built for a variety of arch's using cross tools, and built, booted and tested for function on SN2 (ia64). There are 4 patches in this set: 1) Some minor cleanup, and some improvements to the code layout of one routine to make subsequent patches cleaner. 2) Add another GFP flag - __GFP_HARDWALL. It marks memory requests for USER space, which are tightly confined by the current tasks cpuset. 3) Now memory requests (such as KERNEL) that not marked HARDWALL can if short on memory, look in the potentially larger pool of memory defined by the nearest mem_exclusive ancestor cpuset of the current tasks cpuset. 4) Finally, modify the oom killer to skip any task whose mem_exclusive cpuset doesn't overlap ours. Patch (1), the one time I looked on an SN2 (ia64) build, actually saved 32 bytes of kernel text space. Patch (2) has no affect on the size of kernel text space (it just adds a preprocessor flag). Patches (3) and (4) added about 600 bytes each of kernel text space, mostly in kernel/cpuset.c, which matters only if CONFIG_CPUSET is enabled. This patch: This patch applies a few comment and code cleanups to mm/oom_kill.c prior to applying a few small patches to improve cpuset management of memory placement. The comment changed in oom_kill.c was seriously misleading. The code layout change in select_bad_process() makes room for adding another condition on which a process can be spared the oom killer (see the subsequent cpuset_nodes_overlap patch for this addition). Also a couple typos and spellos that bugged me, while I was here. This patch should have no material affect. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/a49335cceab8afb6603152fcc3f7d3b6677366ca;" 1) Some minor cleanup, and some improvements to the code layout of one routine to make subsequent patches cleaner";yes;yes;no 398;MDY6Q29tbWl0MjMyNTI5ODphNDkzMzVjY2VhYjhhZmI2NjAzMTUyZmNjM2Y3ZDNiNjY3NzM2NmNh;Paul Jackson;Linus Torvalds;"[PATCH] cpusets: oom_kill tweaks This patch series extends the use of the cpuset attribute 'mem_exclusive' to support cpuset configurations that: 1) allow GFP_KERNEL allocations to come from a potentially larger set of memory nodes than GFP_USER allocations, and 2) can constrain the oom killer to tasks running in cpusets in a specified subtree of the cpuset hierarchy. Here's an example usage scenario. For a few hours or more, a large NUMA system at a University is to be divided in two halves, with a bunch of student jobs running in half the system under some form of batch manager, and with a big research project running in the other half. Each of the student jobs is placed in a small cpuset, but should share the classic Unix time share facilities, such as buffered pages of files in /bin and /usr/lib. The big research project wants no interference whatsoever from the student jobs, and has highly tuned, unusual memory and i/o patterns that intend to make full use of all the main memory on the nodes available to it. In this example, we have two big sibling cpusets, one of which is further divided into a more dynamic set of child cpusets. We want kernel memory allocations constrained by the two big cpusets, and user allocations constrained by the smaller child cpusets where present. And we require that the oom killer not operate across the two halves of this system, or else the first time a student job runs amuck, the big research project will likely be first inline to get shot. Tweaking /proc/<pid>/oom_adj is not ideal -- if the big research project really does run amuck allocating memory, it should be shot, not some other task outside the research projects mem_exclusive cpuset. I propose to extend the use of the 'mem_exclusive' flag of cpusets to manage such scenarios. Let memory allocations for user space (GFP_USER) be constrained by a tasks current cpuset, but memory allocations for kernel space (GFP_KERNEL) by constrained by the nearest mem_exclusive ancestor of the current cpuset, even though kernel space allocations will still _prefer_ to remain within the current tasks cpuset, if memory is easily available. Let the oom killer be constrained to consider only tasks that are in overlapping mem_exclusive cpusets (it won't help much to kill a task that normally cannot allocate memory on any of the same nodes as the ones on which the current task can allocate.) The current constraints imposed on setting mem_exclusive are unchanged. A cpuset may only be mem_exclusive if its parent is also mem_exclusive, and a mem_exclusive cpuset may not overlap any of its siblings memory nodes. This patch was presented on linux-mm in early July 2005, though did not generate much feedback at that time. It has been built for a variety of arch's using cross tools, and built, booted and tested for function on SN2 (ia64). There are 4 patches in this set: 1) Some minor cleanup, and some improvements to the code layout of one routine to make subsequent patches cleaner. 2) Add another GFP flag - __GFP_HARDWALL. It marks memory requests for USER space, which are tightly confined by the current tasks cpuset. 3) Now memory requests (such as KERNEL) that not marked HARDWALL can if short on memory, look in the potentially larger pool of memory defined by the nearest mem_exclusive ancestor cpuset of the current tasks cpuset. 4) Finally, modify the oom killer to skip any task whose mem_exclusive cpuset doesn't overlap ours. Patch (1), the one time I looked on an SN2 (ia64) build, actually saved 32 bytes of kernel text space. Patch (2) has no affect on the size of kernel text space (it just adds a preprocessor flag). Patches (3) and (4) added about 600 bytes each of kernel text space, mostly in kernel/cpuset.c, which matters only if CONFIG_CPUSET is enabled. This patch: This patch applies a few comment and code cleanups to mm/oom_kill.c prior to applying a few small patches to improve cpuset management of memory placement. The comment changed in oom_kill.c was seriously misleading. The code layout change in select_bad_process() makes room for adding another condition on which a process can be spared the oom killer (see the subsequent cpuset_nodes_overlap patch for this addition). Also a couple typos and spellos that bugged me, while I was here. This patch should have no material affect. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/a49335cceab8afb6603152fcc3f7d3b6677366ca; 2) Add another GFP flag - __GFP_HARDWALL;yes;no;no 398;MDY6Q29tbWl0MjMyNTI5ODphNDkzMzVjY2VhYjhhZmI2NjAzMTUyZmNjM2Y3ZDNiNjY3NzM2NmNh;Paul Jackson;Linus Torvalds;"[PATCH] cpusets: oom_kill tweaks This patch series extends the use of the cpuset attribute 'mem_exclusive' to support cpuset configurations that: 1) allow GFP_KERNEL allocations to come from a potentially larger set of memory nodes than GFP_USER allocations, and 2) can constrain the oom killer to tasks running in cpusets in a specified subtree of the cpuset hierarchy. Here's an example usage scenario. For a few hours or more, a large NUMA system at a University is to be divided in two halves, with a bunch of student jobs running in half the system under some form of batch manager, and with a big research project running in the other half. Each of the student jobs is placed in a small cpuset, but should share the classic Unix time share facilities, such as buffered pages of files in /bin and /usr/lib. The big research project wants no interference whatsoever from the student jobs, and has highly tuned, unusual memory and i/o patterns that intend to make full use of all the main memory on the nodes available to it. In this example, we have two big sibling cpusets, one of which is further divided into a more dynamic set of child cpusets. We want kernel memory allocations constrained by the two big cpusets, and user allocations constrained by the smaller child cpusets where present. And we require that the oom killer not operate across the two halves of this system, or else the first time a student job runs amuck, the big research project will likely be first inline to get shot. Tweaking /proc/<pid>/oom_adj is not ideal -- if the big research project really does run amuck allocating memory, it should be shot, not some other task outside the research projects mem_exclusive cpuset. I propose to extend the use of the 'mem_exclusive' flag of cpusets to manage such scenarios. Let memory allocations for user space (GFP_USER) be constrained by a tasks current cpuset, but memory allocations for kernel space (GFP_KERNEL) by constrained by the nearest mem_exclusive ancestor of the current cpuset, even though kernel space allocations will still _prefer_ to remain within the current tasks cpuset, if memory is easily available. Let the oom killer be constrained to consider only tasks that are in overlapping mem_exclusive cpusets (it won't help much to kill a task that normally cannot allocate memory on any of the same nodes as the ones on which the current task can allocate.) The current constraints imposed on setting mem_exclusive are unchanged. A cpuset may only be mem_exclusive if its parent is also mem_exclusive, and a mem_exclusive cpuset may not overlap any of its siblings memory nodes. This patch was presented on linux-mm in early July 2005, though did not generate much feedback at that time. It has been built for a variety of arch's using cross tools, and built, booted and tested for function on SN2 (ia64). There are 4 patches in this set: 1) Some minor cleanup, and some improvements to the code layout of one routine to make subsequent patches cleaner. 2) Add another GFP flag - __GFP_HARDWALL. It marks memory requests for USER space, which are tightly confined by the current tasks cpuset. 3) Now memory requests (such as KERNEL) that not marked HARDWALL can if short on memory, look in the potentially larger pool of memory defined by the nearest mem_exclusive ancestor cpuset of the current tasks cpuset. 4) Finally, modify the oom killer to skip any task whose mem_exclusive cpuset doesn't overlap ours. Patch (1), the one time I looked on an SN2 (ia64) build, actually saved 32 bytes of kernel text space. Patch (2) has no affect on the size of kernel text space (it just adds a preprocessor flag). Patches (3) and (4) added about 600 bytes each of kernel text space, mostly in kernel/cpuset.c, which matters only if CONFIG_CPUSET is enabled. This patch: This patch applies a few comment and code cleanups to mm/oom_kill.c prior to applying a few small patches to improve cpuset management of memory placement. The comment changed in oom_kill.c was seriously misleading. The code layout change in select_bad_process() makes room for adding another condition on which a process can be spared the oom killer (see the subsequent cpuset_nodes_overlap patch for this addition). Also a couple typos and spellos that bugged me, while I was here. This patch should have no material affect. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/a49335cceab8afb6603152fcc3f7d3b6677366ca;" It marks memory requests for USER space, which are tightly confined by the current tasks cpuset";yes;yes;yes 398;MDY6Q29tbWl0MjMyNTI5ODphNDkzMzVjY2VhYjhhZmI2NjAzMTUyZmNjM2Y3ZDNiNjY3NzM2NmNh;Paul Jackson;Linus Torvalds;"[PATCH] cpusets: oom_kill tweaks This patch series extends the use of the cpuset attribute 'mem_exclusive' to support cpuset configurations that: 1) allow GFP_KERNEL allocations to come from a potentially larger set of memory nodes than GFP_USER allocations, and 2) can constrain the oom killer to tasks running in cpusets in a specified subtree of the cpuset hierarchy. Here's an example usage scenario. For a few hours or more, a large NUMA system at a University is to be divided in two halves, with a bunch of student jobs running in half the system under some form of batch manager, and with a big research project running in the other half. Each of the student jobs is placed in a small cpuset, but should share the classic Unix time share facilities, such as buffered pages of files in /bin and /usr/lib. The big research project wants no interference whatsoever from the student jobs, and has highly tuned, unusual memory and i/o patterns that intend to make full use of all the main memory on the nodes available to it. In this example, we have two big sibling cpusets, one of which is further divided into a more dynamic set of child cpusets. We want kernel memory allocations constrained by the two big cpusets, and user allocations constrained by the smaller child cpusets where present. And we require that the oom killer not operate across the two halves of this system, or else the first time a student job runs amuck, the big research project will likely be first inline to get shot. Tweaking /proc/<pid>/oom_adj is not ideal -- if the big research project really does run amuck allocating memory, it should be shot, not some other task outside the research projects mem_exclusive cpuset. I propose to extend the use of the 'mem_exclusive' flag of cpusets to manage such scenarios. Let memory allocations for user space (GFP_USER) be constrained by a tasks current cpuset, but memory allocations for kernel space (GFP_KERNEL) by constrained by the nearest mem_exclusive ancestor of the current cpuset, even though kernel space allocations will still _prefer_ to remain within the current tasks cpuset, if memory is easily available. Let the oom killer be constrained to consider only tasks that are in overlapping mem_exclusive cpusets (it won't help much to kill a task that normally cannot allocate memory on any of the same nodes as the ones on which the current task can allocate.) The current constraints imposed on setting mem_exclusive are unchanged. A cpuset may only be mem_exclusive if its parent is also mem_exclusive, and a mem_exclusive cpuset may not overlap any of its siblings memory nodes. This patch was presented on linux-mm in early July 2005, though did not generate much feedback at that time. It has been built for a variety of arch's using cross tools, and built, booted and tested for function on SN2 (ia64). There are 4 patches in this set: 1) Some minor cleanup, and some improvements to the code layout of one routine to make subsequent patches cleaner. 2) Add another GFP flag - __GFP_HARDWALL. It marks memory requests for USER space, which are tightly confined by the current tasks cpuset. 3) Now memory requests (such as KERNEL) that not marked HARDWALL can if short on memory, look in the potentially larger pool of memory defined by the nearest mem_exclusive ancestor cpuset of the current tasks cpuset. 4) Finally, modify the oom killer to skip any task whose mem_exclusive cpuset doesn't overlap ours. Patch (1), the one time I looked on an SN2 (ia64) build, actually saved 32 bytes of kernel text space. Patch (2) has no affect on the size of kernel text space (it just adds a preprocessor flag). Patches (3) and (4) added about 600 bytes each of kernel text space, mostly in kernel/cpuset.c, which matters only if CONFIG_CPUSET is enabled. This patch: This patch applies a few comment and code cleanups to mm/oom_kill.c prior to applying a few small patches to improve cpuset management of memory placement. The comment changed in oom_kill.c was seriously misleading. The code layout change in select_bad_process() makes room for adding another condition on which a process can be spared the oom killer (see the subsequent cpuset_nodes_overlap patch for this addition). Also a couple typos and spellos that bugged me, while I was here. This patch should have no material affect. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/a49335cceab8afb6603152fcc3f7d3b6677366ca;" 3) Now memory requests (such as KERNEL) that not marked HARDWALL can if short on memory, look in the potentially larger pool of memory defined by the nearest mem_exclusive ancestor cpuset of the current tasks cpuset";yes;no;no 398;MDY6Q29tbWl0MjMyNTI5ODphNDkzMzVjY2VhYjhhZmI2NjAzMTUyZmNjM2Y3ZDNiNjY3NzM2NmNh;Paul Jackson;Linus Torvalds;"[PATCH] cpusets: oom_kill tweaks This patch series extends the use of the cpuset attribute 'mem_exclusive' to support cpuset configurations that: 1) allow GFP_KERNEL allocations to come from a potentially larger set of memory nodes than GFP_USER allocations, and 2) can constrain the oom killer to tasks running in cpusets in a specified subtree of the cpuset hierarchy. Here's an example usage scenario. For a few hours or more, a large NUMA system at a University is to be divided in two halves, with a bunch of student jobs running in half the system under some form of batch manager, and with a big research project running in the other half. Each of the student jobs is placed in a small cpuset, but should share the classic Unix time share facilities, such as buffered pages of files in /bin and /usr/lib. The big research project wants no interference whatsoever from the student jobs, and has highly tuned, unusual memory and i/o patterns that intend to make full use of all the main memory on the nodes available to it. In this example, we have two big sibling cpusets, one of which is further divided into a more dynamic set of child cpusets. We want kernel memory allocations constrained by the two big cpusets, and user allocations constrained by the smaller child cpusets where present. And we require that the oom killer not operate across the two halves of this system, or else the first time a student job runs amuck, the big research project will likely be first inline to get shot. Tweaking /proc/<pid>/oom_adj is not ideal -- if the big research project really does run amuck allocating memory, it should be shot, not some other task outside the research projects mem_exclusive cpuset. I propose to extend the use of the 'mem_exclusive' flag of cpusets to manage such scenarios. Let memory allocations for user space (GFP_USER) be constrained by a tasks current cpuset, but memory allocations for kernel space (GFP_KERNEL) by constrained by the nearest mem_exclusive ancestor of the current cpuset, even though kernel space allocations will still _prefer_ to remain within the current tasks cpuset, if memory is easily available. Let the oom killer be constrained to consider only tasks that are in overlapping mem_exclusive cpusets (it won't help much to kill a task that normally cannot allocate memory on any of the same nodes as the ones on which the current task can allocate.) The current constraints imposed on setting mem_exclusive are unchanged. A cpuset may only be mem_exclusive if its parent is also mem_exclusive, and a mem_exclusive cpuset may not overlap any of its siblings memory nodes. This patch was presented on linux-mm in early July 2005, though did not generate much feedback at that time. It has been built for a variety of arch's using cross tools, and built, booted and tested for function on SN2 (ia64). There are 4 patches in this set: 1) Some minor cleanup, and some improvements to the code layout of one routine to make subsequent patches cleaner. 2) Add another GFP flag - __GFP_HARDWALL. It marks memory requests for USER space, which are tightly confined by the current tasks cpuset. 3) Now memory requests (such as KERNEL) that not marked HARDWALL can if short on memory, look in the potentially larger pool of memory defined by the nearest mem_exclusive ancestor cpuset of the current tasks cpuset. 4) Finally, modify the oom killer to skip any task whose mem_exclusive cpuset doesn't overlap ours. Patch (1), the one time I looked on an SN2 (ia64) build, actually saved 32 bytes of kernel text space. Patch (2) has no affect on the size of kernel text space (it just adds a preprocessor flag). Patches (3) and (4) added about 600 bytes each of kernel text space, mostly in kernel/cpuset.c, which matters only if CONFIG_CPUSET is enabled. This patch: This patch applies a few comment and code cleanups to mm/oom_kill.c prior to applying a few small patches to improve cpuset management of memory placement. The comment changed in oom_kill.c was seriously misleading. The code layout change in select_bad_process() makes room for adding another condition on which a process can be spared the oom killer (see the subsequent cpuset_nodes_overlap patch for this addition). Also a couple typos and spellos that bugged me, while I was here. This patch should have no material affect. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/a49335cceab8afb6603152fcc3f7d3b6677366ca;" 4) Finally, modify the oom killer to skip any task whose mem_exclusive cpuset doesn't overlap ours";yes;no;no 398;MDY6Q29tbWl0MjMyNTI5ODphNDkzMzVjY2VhYjhhZmI2NjAzMTUyZmNjM2Y3ZDNiNjY3NzM2NmNh;Paul Jackson;Linus Torvalds;"[PATCH] cpusets: oom_kill tweaks This patch series extends the use of the cpuset attribute 'mem_exclusive' to support cpuset configurations that: 1) allow GFP_KERNEL allocations to come from a potentially larger set of memory nodes than GFP_USER allocations, and 2) can constrain the oom killer to tasks running in cpusets in a specified subtree of the cpuset hierarchy. Here's an example usage scenario. For a few hours or more, a large NUMA system at a University is to be divided in two halves, with a bunch of student jobs running in half the system under some form of batch manager, and with a big research project running in the other half. Each of the student jobs is placed in a small cpuset, but should share the classic Unix time share facilities, such as buffered pages of files in /bin and /usr/lib. The big research project wants no interference whatsoever from the student jobs, and has highly tuned, unusual memory and i/o patterns that intend to make full use of all the main memory on the nodes available to it. In this example, we have two big sibling cpusets, one of which is further divided into a more dynamic set of child cpusets. We want kernel memory allocations constrained by the two big cpusets, and user allocations constrained by the smaller child cpusets where present. And we require that the oom killer not operate across the two halves of this system, or else the first time a student job runs amuck, the big research project will likely be first inline to get shot. Tweaking /proc/<pid>/oom_adj is not ideal -- if the big research project really does run amuck allocating memory, it should be shot, not some other task outside the research projects mem_exclusive cpuset. I propose to extend the use of the 'mem_exclusive' flag of cpusets to manage such scenarios. Let memory allocations for user space (GFP_USER) be constrained by a tasks current cpuset, but memory allocations for kernel space (GFP_KERNEL) by constrained by the nearest mem_exclusive ancestor of the current cpuset, even though kernel space allocations will still _prefer_ to remain within the current tasks cpuset, if memory is easily available. Let the oom killer be constrained to consider only tasks that are in overlapping mem_exclusive cpusets (it won't help much to kill a task that normally cannot allocate memory on any of the same nodes as the ones on which the current task can allocate.) The current constraints imposed on setting mem_exclusive are unchanged. A cpuset may only be mem_exclusive if its parent is also mem_exclusive, and a mem_exclusive cpuset may not overlap any of its siblings memory nodes. This patch was presented on linux-mm in early July 2005, though did not generate much feedback at that time. It has been built for a variety of arch's using cross tools, and built, booted and tested for function on SN2 (ia64). There are 4 patches in this set: 1) Some minor cleanup, and some improvements to the code layout of one routine to make subsequent patches cleaner. 2) Add another GFP flag - __GFP_HARDWALL. It marks memory requests for USER space, which are tightly confined by the current tasks cpuset. 3) Now memory requests (such as KERNEL) that not marked HARDWALL can if short on memory, look in the potentially larger pool of memory defined by the nearest mem_exclusive ancestor cpuset of the current tasks cpuset. 4) Finally, modify the oom killer to skip any task whose mem_exclusive cpuset doesn't overlap ours. Patch (1), the one time I looked on an SN2 (ia64) build, actually saved 32 bytes of kernel text space. Patch (2) has no affect on the size of kernel text space (it just adds a preprocessor flag). Patches (3) and (4) added about 600 bytes each of kernel text space, mostly in kernel/cpuset.c, which matters only if CONFIG_CPUSET is enabled. This patch: This patch applies a few comment and code cleanups to mm/oom_kill.c prior to applying a few small patches to improve cpuset management of memory placement. The comment changed in oom_kill.c was seriously misleading. The code layout change in select_bad_process() makes room for adding another condition on which a process can be spared the oom killer (see the subsequent cpuset_nodes_overlap patch for this addition). Also a couple typos and spellos that bugged me, while I was here. This patch should have no material affect. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/a49335cceab8afb6603152fcc3f7d3b6677366ca;"Patch (1), the one time I looked on an SN2 (ia64) build, actually saved 32 bytes of kernel text space";yes;yes;no 398;MDY6Q29tbWl0MjMyNTI5ODphNDkzMzVjY2VhYjhhZmI2NjAzMTUyZmNjM2Y3ZDNiNjY3NzM2NmNh;Paul Jackson;Linus Torvalds;"[PATCH] cpusets: oom_kill tweaks This patch series extends the use of the cpuset attribute 'mem_exclusive' to support cpuset configurations that: 1) allow GFP_KERNEL allocations to come from a potentially larger set of memory nodes than GFP_USER allocations, and 2) can constrain the oom killer to tasks running in cpusets in a specified subtree of the cpuset hierarchy. Here's an example usage scenario. For a few hours or more, a large NUMA system at a University is to be divided in two halves, with a bunch of student jobs running in half the system under some form of batch manager, and with a big research project running in the other half. Each of the student jobs is placed in a small cpuset, but should share the classic Unix time share facilities, such as buffered pages of files in /bin and /usr/lib. The big research project wants no interference whatsoever from the student jobs, and has highly tuned, unusual memory and i/o patterns that intend to make full use of all the main memory on the nodes available to it. In this example, we have two big sibling cpusets, one of which is further divided into a more dynamic set of child cpusets. We want kernel memory allocations constrained by the two big cpusets, and user allocations constrained by the smaller child cpusets where present. And we require that the oom killer not operate across the two halves of this system, or else the first time a student job runs amuck, the big research project will likely be first inline to get shot. Tweaking /proc/<pid>/oom_adj is not ideal -- if the big research project really does run amuck allocating memory, it should be shot, not some other task outside the research projects mem_exclusive cpuset. I propose to extend the use of the 'mem_exclusive' flag of cpusets to manage such scenarios. Let memory allocations for user space (GFP_USER) be constrained by a tasks current cpuset, but memory allocations for kernel space (GFP_KERNEL) by constrained by the nearest mem_exclusive ancestor of the current cpuset, even though kernel space allocations will still _prefer_ to remain within the current tasks cpuset, if memory is easily available. Let the oom killer be constrained to consider only tasks that are in overlapping mem_exclusive cpusets (it won't help much to kill a task that normally cannot allocate memory on any of the same nodes as the ones on which the current task can allocate.) The current constraints imposed on setting mem_exclusive are unchanged. A cpuset may only be mem_exclusive if its parent is also mem_exclusive, and a mem_exclusive cpuset may not overlap any of its siblings memory nodes. This patch was presented on linux-mm in early July 2005, though did not generate much feedback at that time. It has been built for a variety of arch's using cross tools, and built, booted and tested for function on SN2 (ia64). There are 4 patches in this set: 1) Some minor cleanup, and some improvements to the code layout of one routine to make subsequent patches cleaner. 2) Add another GFP flag - __GFP_HARDWALL. It marks memory requests for USER space, which are tightly confined by the current tasks cpuset. 3) Now memory requests (such as KERNEL) that not marked HARDWALL can if short on memory, look in the potentially larger pool of memory defined by the nearest mem_exclusive ancestor cpuset of the current tasks cpuset. 4) Finally, modify the oom killer to skip any task whose mem_exclusive cpuset doesn't overlap ours. Patch (1), the one time I looked on an SN2 (ia64) build, actually saved 32 bytes of kernel text space. Patch (2) has no affect on the size of kernel text space (it just adds a preprocessor flag). Patches (3) and (4) added about 600 bytes each of kernel text space, mostly in kernel/cpuset.c, which matters only if CONFIG_CPUSET is enabled. This patch: This patch applies a few comment and code cleanups to mm/oom_kill.c prior to applying a few small patches to improve cpuset management of memory placement. The comment changed in oom_kill.c was seriously misleading. The code layout change in select_bad_process() makes room for adding another condition on which a process can be spared the oom killer (see the subsequent cpuset_nodes_overlap patch for this addition). Also a couple typos and spellos that bugged me, while I was here. This patch should have no material affect. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/a49335cceab8afb6603152fcc3f7d3b6677366ca;" Patch (2) has no affect on the size of kernel text space (it just adds a preprocessor flag)";yes;no;no 398;MDY6Q29tbWl0MjMyNTI5ODphNDkzMzVjY2VhYjhhZmI2NjAzMTUyZmNjM2Y3ZDNiNjY3NzM2NmNh;Paul Jackson;Linus Torvalds;"[PATCH] cpusets: oom_kill tweaks This patch series extends the use of the cpuset attribute 'mem_exclusive' to support cpuset configurations that: 1) allow GFP_KERNEL allocations to come from a potentially larger set of memory nodes than GFP_USER allocations, and 2) can constrain the oom killer to tasks running in cpusets in a specified subtree of the cpuset hierarchy. Here's an example usage scenario. For a few hours or more, a large NUMA system at a University is to be divided in two halves, with a bunch of student jobs running in half the system under some form of batch manager, and with a big research project running in the other half. Each of the student jobs is placed in a small cpuset, but should share the classic Unix time share facilities, such as buffered pages of files in /bin and /usr/lib. The big research project wants no interference whatsoever from the student jobs, and has highly tuned, unusual memory and i/o patterns that intend to make full use of all the main memory on the nodes available to it. In this example, we have two big sibling cpusets, one of which is further divided into a more dynamic set of child cpusets. We want kernel memory allocations constrained by the two big cpusets, and user allocations constrained by the smaller child cpusets where present. And we require that the oom killer not operate across the two halves of this system, or else the first time a student job runs amuck, the big research project will likely be first inline to get shot. Tweaking /proc/<pid>/oom_adj is not ideal -- if the big research project really does run amuck allocating memory, it should be shot, not some other task outside the research projects mem_exclusive cpuset. I propose to extend the use of the 'mem_exclusive' flag of cpusets to manage such scenarios. Let memory allocations for user space (GFP_USER) be constrained by a tasks current cpuset, but memory allocations for kernel space (GFP_KERNEL) by constrained by the nearest mem_exclusive ancestor of the current cpuset, even though kernel space allocations will still _prefer_ to remain within the current tasks cpuset, if memory is easily available. Let the oom killer be constrained to consider only tasks that are in overlapping mem_exclusive cpusets (it won't help much to kill a task that normally cannot allocate memory on any of the same nodes as the ones on which the current task can allocate.) The current constraints imposed on setting mem_exclusive are unchanged. A cpuset may only be mem_exclusive if its parent is also mem_exclusive, and a mem_exclusive cpuset may not overlap any of its siblings memory nodes. This patch was presented on linux-mm in early July 2005, though did not generate much feedback at that time. It has been built for a variety of arch's using cross tools, and built, booted and tested for function on SN2 (ia64). There are 4 patches in this set: 1) Some minor cleanup, and some improvements to the code layout of one routine to make subsequent patches cleaner. 2) Add another GFP flag - __GFP_HARDWALL. It marks memory requests for USER space, which are tightly confined by the current tasks cpuset. 3) Now memory requests (such as KERNEL) that not marked HARDWALL can if short on memory, look in the potentially larger pool of memory defined by the nearest mem_exclusive ancestor cpuset of the current tasks cpuset. 4) Finally, modify the oom killer to skip any task whose mem_exclusive cpuset doesn't overlap ours. Patch (1), the one time I looked on an SN2 (ia64) build, actually saved 32 bytes of kernel text space. Patch (2) has no affect on the size of kernel text space (it just adds a preprocessor flag). Patches (3) and (4) added about 600 bytes each of kernel text space, mostly in kernel/cpuset.c, which matters only if CONFIG_CPUSET is enabled. This patch: This patch applies a few comment and code cleanups to mm/oom_kill.c prior to applying a few small patches to improve cpuset management of memory placement. The comment changed in oom_kill.c was seriously misleading. The code layout change in select_bad_process() makes room for adding another condition on which a process can be spared the oom killer (see the subsequent cpuset_nodes_overlap patch for this addition). Also a couple typos and spellos that bugged me, while I was here. This patch should have no material affect. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/a49335cceab8afb6603152fcc3f7d3b6677366ca;" Patches (3) and (4) added about 600 bytes each of kernel text space, mostly in kernel/cpuset.c, which matters only if CONFIG_CPUSET is enabled";yes;yes;no 398;MDY6Q29tbWl0MjMyNTI5ODphNDkzMzVjY2VhYjhhZmI2NjAzMTUyZmNjM2Y3ZDNiNjY3NzM2NmNh;Paul Jackson;Linus Torvalds;"[PATCH] cpusets: oom_kill tweaks This patch series extends the use of the cpuset attribute 'mem_exclusive' to support cpuset configurations that: 1) allow GFP_KERNEL allocations to come from a potentially larger set of memory nodes than GFP_USER allocations, and 2) can constrain the oom killer to tasks running in cpusets in a specified subtree of the cpuset hierarchy. Here's an example usage scenario. For a few hours or more, a large NUMA system at a University is to be divided in two halves, with a bunch of student jobs running in half the system under some form of batch manager, and with a big research project running in the other half. Each of the student jobs is placed in a small cpuset, but should share the classic Unix time share facilities, such as buffered pages of files in /bin and /usr/lib. The big research project wants no interference whatsoever from the student jobs, and has highly tuned, unusual memory and i/o patterns that intend to make full use of all the main memory on the nodes available to it. In this example, we have two big sibling cpusets, one of which is further divided into a more dynamic set of child cpusets. We want kernel memory allocations constrained by the two big cpusets, and user allocations constrained by the smaller child cpusets where present. And we require that the oom killer not operate across the two halves of this system, or else the first time a student job runs amuck, the big research project will likely be first inline to get shot. Tweaking /proc/<pid>/oom_adj is not ideal -- if the big research project really does run amuck allocating memory, it should be shot, not some other task outside the research projects mem_exclusive cpuset. I propose to extend the use of the 'mem_exclusive' flag of cpusets to manage such scenarios. Let memory allocations for user space (GFP_USER) be constrained by a tasks current cpuset, but memory allocations for kernel space (GFP_KERNEL) by constrained by the nearest mem_exclusive ancestor of the current cpuset, even though kernel space allocations will still _prefer_ to remain within the current tasks cpuset, if memory is easily available. Let the oom killer be constrained to consider only tasks that are in overlapping mem_exclusive cpusets (it won't help much to kill a task that normally cannot allocate memory on any of the same nodes as the ones on which the current task can allocate.) The current constraints imposed on setting mem_exclusive are unchanged. A cpuset may only be mem_exclusive if its parent is also mem_exclusive, and a mem_exclusive cpuset may not overlap any of its siblings memory nodes. This patch was presented on linux-mm in early July 2005, though did not generate much feedback at that time. It has been built for a variety of arch's using cross tools, and built, booted and tested for function on SN2 (ia64). There are 4 patches in this set: 1) Some minor cleanup, and some improvements to the code layout of one routine to make subsequent patches cleaner. 2) Add another GFP flag - __GFP_HARDWALL. It marks memory requests for USER space, which are tightly confined by the current tasks cpuset. 3) Now memory requests (such as KERNEL) that not marked HARDWALL can if short on memory, look in the potentially larger pool of memory defined by the nearest mem_exclusive ancestor cpuset of the current tasks cpuset. 4) Finally, modify the oom killer to skip any task whose mem_exclusive cpuset doesn't overlap ours. Patch (1), the one time I looked on an SN2 (ia64) build, actually saved 32 bytes of kernel text space. Patch (2) has no affect on the size of kernel text space (it just adds a preprocessor flag). Patches (3) and (4) added about 600 bytes each of kernel text space, mostly in kernel/cpuset.c, which matters only if CONFIG_CPUSET is enabled. This patch: This patch applies a few comment and code cleanups to mm/oom_kill.c prior to applying a few small patches to improve cpuset management of memory placement. The comment changed in oom_kill.c was seriously misleading. The code layout change in select_bad_process() makes room for adding another condition on which a process can be spared the oom killer (see the subsequent cpuset_nodes_overlap patch for this addition). Also a couple typos and spellos that bugged me, while I was here. This patch should have no material affect. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/a49335cceab8afb6603152fcc3f7d3b6677366ca;This patch;no;no;no 398;MDY6Q29tbWl0MjMyNTI5ODphNDkzMzVjY2VhYjhhZmI2NjAzMTUyZmNjM2Y3ZDNiNjY3NzM2NmNh;Paul Jackson;Linus Torvalds;"[PATCH] cpusets: oom_kill tweaks This patch series extends the use of the cpuset attribute 'mem_exclusive' to support cpuset configurations that: 1) allow GFP_KERNEL allocations to come from a potentially larger set of memory nodes than GFP_USER allocations, and 2) can constrain the oom killer to tasks running in cpusets in a specified subtree of the cpuset hierarchy. Here's an example usage scenario. For a few hours or more, a large NUMA system at a University is to be divided in two halves, with a bunch of student jobs running in half the system under some form of batch manager, and with a big research project running in the other half. Each of the student jobs is placed in a small cpuset, but should share the classic Unix time share facilities, such as buffered pages of files in /bin and /usr/lib. The big research project wants no interference whatsoever from the student jobs, and has highly tuned, unusual memory and i/o patterns that intend to make full use of all the main memory on the nodes available to it. In this example, we have two big sibling cpusets, one of which is further divided into a more dynamic set of child cpusets. We want kernel memory allocations constrained by the two big cpusets, and user allocations constrained by the smaller child cpusets where present. And we require that the oom killer not operate across the two halves of this system, or else the first time a student job runs amuck, the big research project will likely be first inline to get shot. Tweaking /proc/<pid>/oom_adj is not ideal -- if the big research project really does run amuck allocating memory, it should be shot, not some other task outside the research projects mem_exclusive cpuset. I propose to extend the use of the 'mem_exclusive' flag of cpusets to manage such scenarios. Let memory allocations for user space (GFP_USER) be constrained by a tasks current cpuset, but memory allocations for kernel space (GFP_KERNEL) by constrained by the nearest mem_exclusive ancestor of the current cpuset, even though kernel space allocations will still _prefer_ to remain within the current tasks cpuset, if memory is easily available. Let the oom killer be constrained to consider only tasks that are in overlapping mem_exclusive cpusets (it won't help much to kill a task that normally cannot allocate memory on any of the same nodes as the ones on which the current task can allocate.) The current constraints imposed on setting mem_exclusive are unchanged. A cpuset may only be mem_exclusive if its parent is also mem_exclusive, and a mem_exclusive cpuset may not overlap any of its siblings memory nodes. This patch was presented on linux-mm in early July 2005, though did not generate much feedback at that time. It has been built for a variety of arch's using cross tools, and built, booted and tested for function on SN2 (ia64). There are 4 patches in this set: 1) Some minor cleanup, and some improvements to the code layout of one routine to make subsequent patches cleaner. 2) Add another GFP flag - __GFP_HARDWALL. It marks memory requests for USER space, which are tightly confined by the current tasks cpuset. 3) Now memory requests (such as KERNEL) that not marked HARDWALL can if short on memory, look in the potentially larger pool of memory defined by the nearest mem_exclusive ancestor cpuset of the current tasks cpuset. 4) Finally, modify the oom killer to skip any task whose mem_exclusive cpuset doesn't overlap ours. Patch (1), the one time I looked on an SN2 (ia64) build, actually saved 32 bytes of kernel text space. Patch (2) has no affect on the size of kernel text space (it just adds a preprocessor flag). Patches (3) and (4) added about 600 bytes each of kernel text space, mostly in kernel/cpuset.c, which matters only if CONFIG_CPUSET is enabled. This patch: This patch applies a few comment and code cleanups to mm/oom_kill.c prior to applying a few small patches to improve cpuset management of memory placement. The comment changed in oom_kill.c was seriously misleading. The code layout change in select_bad_process() makes room for adding another condition on which a process can be spared the oom killer (see the subsequent cpuset_nodes_overlap patch for this addition). Also a couple typos and spellos that bugged me, while I was here. This patch should have no material affect. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/a49335cceab8afb6603152fcc3f7d3b6677366ca;"This patch applies a few comment and code cleanups to mm/oom_kill.c prior to applying a few small patches to improve cpuset management of memory placement";yes;yes;no 398;MDY6Q29tbWl0MjMyNTI5ODphNDkzMzVjY2VhYjhhZmI2NjAzMTUyZmNjM2Y3ZDNiNjY3NzM2NmNh;Paul Jackson;Linus Torvalds;"[PATCH] cpusets: oom_kill tweaks This patch series extends the use of the cpuset attribute 'mem_exclusive' to support cpuset configurations that: 1) allow GFP_KERNEL allocations to come from a potentially larger set of memory nodes than GFP_USER allocations, and 2) can constrain the oom killer to tasks running in cpusets in a specified subtree of the cpuset hierarchy. Here's an example usage scenario. For a few hours or more, a large NUMA system at a University is to be divided in two halves, with a bunch of student jobs running in half the system under some form of batch manager, and with a big research project running in the other half. Each of the student jobs is placed in a small cpuset, but should share the classic Unix time share facilities, such as buffered pages of files in /bin and /usr/lib. The big research project wants no interference whatsoever from the student jobs, and has highly tuned, unusual memory and i/o patterns that intend to make full use of all the main memory on the nodes available to it. In this example, we have two big sibling cpusets, one of which is further divided into a more dynamic set of child cpusets. We want kernel memory allocations constrained by the two big cpusets, and user allocations constrained by the smaller child cpusets where present. And we require that the oom killer not operate across the two halves of this system, or else the first time a student job runs amuck, the big research project will likely be first inline to get shot. Tweaking /proc/<pid>/oom_adj is not ideal -- if the big research project really does run amuck allocating memory, it should be shot, not some other task outside the research projects mem_exclusive cpuset. I propose to extend the use of the 'mem_exclusive' flag of cpusets to manage such scenarios. Let memory allocations for user space (GFP_USER) be constrained by a tasks current cpuset, but memory allocations for kernel space (GFP_KERNEL) by constrained by the nearest mem_exclusive ancestor of the current cpuset, even though kernel space allocations will still _prefer_ to remain within the current tasks cpuset, if memory is easily available. Let the oom killer be constrained to consider only tasks that are in overlapping mem_exclusive cpusets (it won't help much to kill a task that normally cannot allocate memory on any of the same nodes as the ones on which the current task can allocate.) The current constraints imposed on setting mem_exclusive are unchanged. A cpuset may only be mem_exclusive if its parent is also mem_exclusive, and a mem_exclusive cpuset may not overlap any of its siblings memory nodes. This patch was presented on linux-mm in early July 2005, though did not generate much feedback at that time. It has been built for a variety of arch's using cross tools, and built, booted and tested for function on SN2 (ia64). There are 4 patches in this set: 1) Some minor cleanup, and some improvements to the code layout of one routine to make subsequent patches cleaner. 2) Add another GFP flag - __GFP_HARDWALL. It marks memory requests for USER space, which are tightly confined by the current tasks cpuset. 3) Now memory requests (such as KERNEL) that not marked HARDWALL can if short on memory, look in the potentially larger pool of memory defined by the nearest mem_exclusive ancestor cpuset of the current tasks cpuset. 4) Finally, modify the oom killer to skip any task whose mem_exclusive cpuset doesn't overlap ours. Patch (1), the one time I looked on an SN2 (ia64) build, actually saved 32 bytes of kernel text space. Patch (2) has no affect on the size of kernel text space (it just adds a preprocessor flag). Patches (3) and (4) added about 600 bytes each of kernel text space, mostly in kernel/cpuset.c, which matters only if CONFIG_CPUSET is enabled. This patch: This patch applies a few comment and code cleanups to mm/oom_kill.c prior to applying a few small patches to improve cpuset management of memory placement. The comment changed in oom_kill.c was seriously misleading. The code layout change in select_bad_process() makes room for adding another condition on which a process can be spared the oom killer (see the subsequent cpuset_nodes_overlap patch for this addition). Also a couple typos and spellos that bugged me, while I was here. This patch should have no material affect. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/a49335cceab8afb6603152fcc3f7d3b6677366ca;The comment changed in oom_kill.c was seriously misleading;yes;yes;yes 398;MDY6Q29tbWl0MjMyNTI5ODphNDkzMzVjY2VhYjhhZmI2NjAzMTUyZmNjM2Y3ZDNiNjY3NzM2NmNh;Paul Jackson;Linus Torvalds;"[PATCH] cpusets: oom_kill tweaks This patch series extends the use of the cpuset attribute 'mem_exclusive' to support cpuset configurations that: 1) allow GFP_KERNEL allocations to come from a potentially larger set of memory nodes than GFP_USER allocations, and 2) can constrain the oom killer to tasks running in cpusets in a specified subtree of the cpuset hierarchy. Here's an example usage scenario. For a few hours or more, a large NUMA system at a University is to be divided in two halves, with a bunch of student jobs running in half the system under some form of batch manager, and with a big research project running in the other half. Each of the student jobs is placed in a small cpuset, but should share the classic Unix time share facilities, such as buffered pages of files in /bin and /usr/lib. The big research project wants no interference whatsoever from the student jobs, and has highly tuned, unusual memory and i/o patterns that intend to make full use of all the main memory on the nodes available to it. In this example, we have two big sibling cpusets, one of which is further divided into a more dynamic set of child cpusets. We want kernel memory allocations constrained by the two big cpusets, and user allocations constrained by the smaller child cpusets where present. And we require that the oom killer not operate across the two halves of this system, or else the first time a student job runs amuck, the big research project will likely be first inline to get shot. Tweaking /proc/<pid>/oom_adj is not ideal -- if the big research project really does run amuck allocating memory, it should be shot, not some other task outside the research projects mem_exclusive cpuset. I propose to extend the use of the 'mem_exclusive' flag of cpusets to manage such scenarios. Let memory allocations for user space (GFP_USER) be constrained by a tasks current cpuset, but memory allocations for kernel space (GFP_KERNEL) by constrained by the nearest mem_exclusive ancestor of the current cpuset, even though kernel space allocations will still _prefer_ to remain within the current tasks cpuset, if memory is easily available. Let the oom killer be constrained to consider only tasks that are in overlapping mem_exclusive cpusets (it won't help much to kill a task that normally cannot allocate memory on any of the same nodes as the ones on which the current task can allocate.) The current constraints imposed on setting mem_exclusive are unchanged. A cpuset may only be mem_exclusive if its parent is also mem_exclusive, and a mem_exclusive cpuset may not overlap any of its siblings memory nodes. This patch was presented on linux-mm in early July 2005, though did not generate much feedback at that time. It has been built for a variety of arch's using cross tools, and built, booted and tested for function on SN2 (ia64). There are 4 patches in this set: 1) Some minor cleanup, and some improvements to the code layout of one routine to make subsequent patches cleaner. 2) Add another GFP flag - __GFP_HARDWALL. It marks memory requests for USER space, which are tightly confined by the current tasks cpuset. 3) Now memory requests (such as KERNEL) that not marked HARDWALL can if short on memory, look in the potentially larger pool of memory defined by the nearest mem_exclusive ancestor cpuset of the current tasks cpuset. 4) Finally, modify the oom killer to skip any task whose mem_exclusive cpuset doesn't overlap ours. Patch (1), the one time I looked on an SN2 (ia64) build, actually saved 32 bytes of kernel text space. Patch (2) has no affect on the size of kernel text space (it just adds a preprocessor flag). Patches (3) and (4) added about 600 bytes each of kernel text space, mostly in kernel/cpuset.c, which matters only if CONFIG_CPUSET is enabled. This patch: This patch applies a few comment and code cleanups to mm/oom_kill.c prior to applying a few small patches to improve cpuset management of memory placement. The comment changed in oom_kill.c was seriously misleading. The code layout change in select_bad_process() makes room for adding another condition on which a process can be spared the oom killer (see the subsequent cpuset_nodes_overlap patch for this addition). Also a couple typos and spellos that bugged me, while I was here. This patch should have no material affect. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/a49335cceab8afb6603152fcc3f7d3b6677366ca;" The code layout change in select_bad_process() makes room for adding another condition on which a process can be spared the oom killer (see the subsequent cpuset_nodes_overlap patch for this addition)";yes;yes;yes 398;MDY6Q29tbWl0MjMyNTI5ODphNDkzMzVjY2VhYjhhZmI2NjAzMTUyZmNjM2Y3ZDNiNjY3NzM2NmNh;Paul Jackson;Linus Torvalds;"[PATCH] cpusets: oom_kill tweaks This patch series extends the use of the cpuset attribute 'mem_exclusive' to support cpuset configurations that: 1) allow GFP_KERNEL allocations to come from a potentially larger set of memory nodes than GFP_USER allocations, and 2) can constrain the oom killer to tasks running in cpusets in a specified subtree of the cpuset hierarchy. Here's an example usage scenario. For a few hours or more, a large NUMA system at a University is to be divided in two halves, with a bunch of student jobs running in half the system under some form of batch manager, and with a big research project running in the other half. Each of the student jobs is placed in a small cpuset, but should share the classic Unix time share facilities, such as buffered pages of files in /bin and /usr/lib. The big research project wants no interference whatsoever from the student jobs, and has highly tuned, unusual memory and i/o patterns that intend to make full use of all the main memory on the nodes available to it. In this example, we have two big sibling cpusets, one of which is further divided into a more dynamic set of child cpusets. We want kernel memory allocations constrained by the two big cpusets, and user allocations constrained by the smaller child cpusets where present. And we require that the oom killer not operate across the two halves of this system, or else the first time a student job runs amuck, the big research project will likely be first inline to get shot. Tweaking /proc/<pid>/oom_adj is not ideal -- if the big research project really does run amuck allocating memory, it should be shot, not some other task outside the research projects mem_exclusive cpuset. I propose to extend the use of the 'mem_exclusive' flag of cpusets to manage such scenarios. Let memory allocations for user space (GFP_USER) be constrained by a tasks current cpuset, but memory allocations for kernel space (GFP_KERNEL) by constrained by the nearest mem_exclusive ancestor of the current cpuset, even though kernel space allocations will still _prefer_ to remain within the current tasks cpuset, if memory is easily available. Let the oom killer be constrained to consider only tasks that are in overlapping mem_exclusive cpusets (it won't help much to kill a task that normally cannot allocate memory on any of the same nodes as the ones on which the current task can allocate.) The current constraints imposed on setting mem_exclusive are unchanged. A cpuset may only be mem_exclusive if its parent is also mem_exclusive, and a mem_exclusive cpuset may not overlap any of its siblings memory nodes. This patch was presented on linux-mm in early July 2005, though did not generate much feedback at that time. It has been built for a variety of arch's using cross tools, and built, booted and tested for function on SN2 (ia64). There are 4 patches in this set: 1) Some minor cleanup, and some improvements to the code layout of one routine to make subsequent patches cleaner. 2) Add another GFP flag - __GFP_HARDWALL. It marks memory requests for USER space, which are tightly confined by the current tasks cpuset. 3) Now memory requests (such as KERNEL) that not marked HARDWALL can if short on memory, look in the potentially larger pool of memory defined by the nearest mem_exclusive ancestor cpuset of the current tasks cpuset. 4) Finally, modify the oom killer to skip any task whose mem_exclusive cpuset doesn't overlap ours. Patch (1), the one time I looked on an SN2 (ia64) build, actually saved 32 bytes of kernel text space. Patch (2) has no affect on the size of kernel text space (it just adds a preprocessor flag). Patches (3) and (4) added about 600 bytes each of kernel text space, mostly in kernel/cpuset.c, which matters only if CONFIG_CPUSET is enabled. This patch: This patch applies a few comment and code cleanups to mm/oom_kill.c prior to applying a few small patches to improve cpuset management of memory placement. The comment changed in oom_kill.c was seriously misleading. The code layout change in select_bad_process() makes room for adding another condition on which a process can be spared the oom killer (see the subsequent cpuset_nodes_overlap patch for this addition). Also a couple typos and spellos that bugged me, while I was here. This patch should have no material affect. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/a49335cceab8afb6603152fcc3f7d3b6677366ca;Also a couple typos and spellos that bugged me, while I was here;yes;yes;no 398;MDY6Q29tbWl0MjMyNTI5ODphNDkzMzVjY2VhYjhhZmI2NjAzMTUyZmNjM2Y3ZDNiNjY3NzM2NmNh;Paul Jackson;Linus Torvalds;"[PATCH] cpusets: oom_kill tweaks This patch series extends the use of the cpuset attribute 'mem_exclusive' to support cpuset configurations that: 1) allow GFP_KERNEL allocations to come from a potentially larger set of memory nodes than GFP_USER allocations, and 2) can constrain the oom killer to tasks running in cpusets in a specified subtree of the cpuset hierarchy. Here's an example usage scenario. For a few hours or more, a large NUMA system at a University is to be divided in two halves, with a bunch of student jobs running in half the system under some form of batch manager, and with a big research project running in the other half. Each of the student jobs is placed in a small cpuset, but should share the classic Unix time share facilities, such as buffered pages of files in /bin and /usr/lib. The big research project wants no interference whatsoever from the student jobs, and has highly tuned, unusual memory and i/o patterns that intend to make full use of all the main memory on the nodes available to it. In this example, we have two big sibling cpusets, one of which is further divided into a more dynamic set of child cpusets. We want kernel memory allocations constrained by the two big cpusets, and user allocations constrained by the smaller child cpusets where present. And we require that the oom killer not operate across the two halves of this system, or else the first time a student job runs amuck, the big research project will likely be first inline to get shot. Tweaking /proc/<pid>/oom_adj is not ideal -- if the big research project really does run amuck allocating memory, it should be shot, not some other task outside the research projects mem_exclusive cpuset. I propose to extend the use of the 'mem_exclusive' flag of cpusets to manage such scenarios. Let memory allocations for user space (GFP_USER) be constrained by a tasks current cpuset, but memory allocations for kernel space (GFP_KERNEL) by constrained by the nearest mem_exclusive ancestor of the current cpuset, even though kernel space allocations will still _prefer_ to remain within the current tasks cpuset, if memory is easily available. Let the oom killer be constrained to consider only tasks that are in overlapping mem_exclusive cpusets (it won't help much to kill a task that normally cannot allocate memory on any of the same nodes as the ones on which the current task can allocate.) The current constraints imposed on setting mem_exclusive are unchanged. A cpuset may only be mem_exclusive if its parent is also mem_exclusive, and a mem_exclusive cpuset may not overlap any of its siblings memory nodes. This patch was presented on linux-mm in early July 2005, though did not generate much feedback at that time. It has been built for a variety of arch's using cross tools, and built, booted and tested for function on SN2 (ia64). There are 4 patches in this set: 1) Some minor cleanup, and some improvements to the code layout of one routine to make subsequent patches cleaner. 2) Add another GFP flag - __GFP_HARDWALL. It marks memory requests for USER space, which are tightly confined by the current tasks cpuset. 3) Now memory requests (such as KERNEL) that not marked HARDWALL can if short on memory, look in the potentially larger pool of memory defined by the nearest mem_exclusive ancestor cpuset of the current tasks cpuset. 4) Finally, modify the oom killer to skip any task whose mem_exclusive cpuset doesn't overlap ours. Patch (1), the one time I looked on an SN2 (ia64) build, actually saved 32 bytes of kernel text space. Patch (2) has no affect on the size of kernel text space (it just adds a preprocessor flag). Patches (3) and (4) added about 600 bytes each of kernel text space, mostly in kernel/cpuset.c, which matters only if CONFIG_CPUSET is enabled. This patch: This patch applies a few comment and code cleanups to mm/oom_kill.c prior to applying a few small patches to improve cpuset management of memory placement. The comment changed in oom_kill.c was seriously misleading. The code layout change in select_bad_process() makes room for adding another condition on which a process can be spared the oom killer (see the subsequent cpuset_nodes_overlap patch for this addition). Also a couple typos and spellos that bugged me, while I was here. This patch should have no material affect. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/a49335cceab8afb6603152fcc3f7d3b6677366ca;This patch should have no material affect.;yes;no;no 399;MDY6Q29tbWl0MjMyNTI5ODo0MjYzOTI2OWY5Y2U0YWFjMmU2YzIwYmNiY2EzMGI1ZGE4YjlhODk5;Anton Blanchard;Linus Torvalds;"[PATCH] mm: quieten OOM killer noise We now print statistics when invoking the OOM killer, however this information is not rate limited and you can get into situations where the console is continually spammed. For example, when a task is exiting the OOM killer will simply return (waiting for that task to exit and clear up memory). If the VM continually calls back into the OOM killer we get thousands of copies of show_mem() on the console. Use printk_ratelimit() to quieten it. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/42639269f9ce4aac2e6c20bcbca30b5da8b9a899;[PATCH] mm: quieten OOM killer noise;yes;yes;no 399;MDY6Q29tbWl0MjMyNTI5ODo0MjYzOTI2OWY5Y2U0YWFjMmU2YzIwYmNiY2EzMGI1ZGE4YjlhODk5;Anton Blanchard;Linus Torvalds;"[PATCH] mm: quieten OOM killer noise We now print statistics when invoking the OOM killer, however this information is not rate limited and you can get into situations where the console is continually spammed. For example, when a task is exiting the OOM killer will simply return (waiting for that task to exit and clear up memory). If the VM continually calls back into the OOM killer we get thousands of copies of show_mem() on the console. Use printk_ratelimit() to quieten it. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/42639269f9ce4aac2e6c20bcbca30b5da8b9a899;"We now print statistics when invoking the OOM killer, however this information is not rate limited and you can get into situations where the console is continually spammed";no;yes;yes 399;MDY6Q29tbWl0MjMyNTI5ODo0MjYzOTI2OWY5Y2U0YWFjMmU2YzIwYmNiY2EzMGI1ZGE4YjlhODk5;Anton Blanchard;Linus Torvalds;"[PATCH] mm: quieten OOM killer noise We now print statistics when invoking the OOM killer, however this information is not rate limited and you can get into situations where the console is continually spammed. For example, when a task is exiting the OOM killer will simply return (waiting for that task to exit and clear up memory). If the VM continually calls back into the OOM killer we get thousands of copies of show_mem() on the console. Use printk_ratelimit() to quieten it. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/42639269f9ce4aac2e6c20bcbca30b5da8b9a899;"For example, when a task is exiting the OOM killer will simply return (waiting for that task to exit and clear up memory)";no;no;yes 399;MDY6Q29tbWl0MjMyNTI5ODo0MjYzOTI2OWY5Y2U0YWFjMmU2YzIwYmNiY2EzMGI1ZGE4YjlhODk5;Anton Blanchard;Linus Torvalds;"[PATCH] mm: quieten OOM killer noise We now print statistics when invoking the OOM killer, however this information is not rate limited and you can get into situations where the console is continually spammed. For example, when a task is exiting the OOM killer will simply return (waiting for that task to exit and clear up memory). If the VM continually calls back into the OOM killer we get thousands of copies of show_mem() on the console. Use printk_ratelimit() to quieten it. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/42639269f9ce4aac2e6c20bcbca30b5da8b9a899;" If the VM continually calls back into the OOM killer we get thousands of copies of show_mem() on the console";no;yes;yes 399;MDY6Q29tbWl0MjMyNTI5ODo0MjYzOTI2OWY5Y2U0YWFjMmU2YzIwYmNiY2EzMGI1ZGE4YjlhODk5;Anton Blanchard;Linus Torvalds;"[PATCH] mm: quieten OOM killer noise We now print statistics when invoking the OOM killer, however this information is not rate limited and you can get into situations where the console is continually spammed. For example, when a task is exiting the OOM killer will simply return (waiting for that task to exit and clear up memory). If the VM continually calls back into the OOM killer we get thousands of copies of show_mem() on the console. Use printk_ratelimit() to quieten it. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/42639269f9ce4aac2e6c20bcbca30b5da8b9a899;Use printk_ratelimit() to quieten it.;yes;yes;no 400;MDY6Q29tbWl0MjMyNTI5ODo3OWI5Y2UzMTFlMTkyZTlhMzFmZDlmM2NmMWVlNGE0ZWRmOWUyNjUw;Marcelo Tosatti;Linus Torvalds;"[PATCH] print order information when OOM killing Dump the current allocation order when OOM killing. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/79b9ce311e192e9a31fd9f3cf1ee4a4edf9e2650;[PATCH] print order information when OOM killing;yes;yes;no 400;MDY6Q29tbWl0MjMyNTI5ODo3OWI5Y2UzMTFlMTkyZTlhMzFmZDlmM2NmMWVlNGE0ZWRmOWUyNjUw;Marcelo Tosatti;Linus Torvalds;"[PATCH] print order information when OOM killing Dump the current allocation order when OOM killing. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/79b9ce311e192e9a31fd9f3cf1ee4a4edf9e2650;Dump the current allocation order when OOM killing.;yes;no;no 401;MDY6Q29tbWl0MjMyNTI5ODo1NzhjMmZkNmE3ZjM3ODQzNDY1NWU1YzQ4MGUyMzE1MmEzOTk0NDA0;Janet Morgan;Linus Torvalds;"[PATCH] add OOM debug This patch provides more debug info when the system is OOM. It displays memory stats (basically sysrq-m info) from __alloc_pages() when page allocation fails and during OOM kill. Thanks to Dave Jones for coming up with the idea. Signed-off-by: Janet Morgan <janetmor@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/578c2fd6a7f378434655e5c480e23152a3994404;[PATCH] add OOM debug;yes;yes;no 401;MDY6Q29tbWl0MjMyNTI5ODo1NzhjMmZkNmE3ZjM3ODQzNDY1NWU1YzQ4MGUyMzE1MmEzOTk0NDA0;Janet Morgan;Linus Torvalds;"[PATCH] add OOM debug This patch provides more debug info when the system is OOM. It displays memory stats (basically sysrq-m info) from __alloc_pages() when page allocation fails and during OOM kill. Thanks to Dave Jones for coming up with the idea. Signed-off-by: Janet Morgan <janetmor@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/578c2fd6a7f378434655e5c480e23152a3994404;This patch provides more debug info when the system is OOM;yes;yes;no 401;MDY6Q29tbWl0MjMyNTI5ODo1NzhjMmZkNmE3ZjM3ODQzNDY1NWU1YzQ4MGUyMzE1MmEzOTk0NDA0;Janet Morgan;Linus Torvalds;"[PATCH] add OOM debug This patch provides more debug info when the system is OOM. It displays memory stats (basically sysrq-m info) from __alloc_pages() when page allocation fails and during OOM kill. Thanks to Dave Jones for coming up with the idea. Signed-off-by: Janet Morgan <janetmor@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/578c2fd6a7f378434655e5c480e23152a3994404;" It displays memory stats (basically sysrq-m info) from __alloc_pages() when page allocation fails and during OOM kill";yes;yes;no 401;MDY6Q29tbWl0MjMyNTI5ODo1NzhjMmZkNmE3ZjM3ODQzNDY1NWU1YzQ4MGUyMzE1MmEzOTk0NDA0;Janet Morgan;Linus Torvalds;"[PATCH] add OOM debug This patch provides more debug info when the system is OOM. It displays memory stats (basically sysrq-m info) from __alloc_pages() when page allocation fails and during OOM kill. Thanks to Dave Jones for coming up with the idea. Signed-off-by: Janet Morgan <janetmor@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/578c2fd6a7f378434655e5c480e23152a3994404;Thanks to Dave Jones for coming up with the idea.;no;no;no 402;MDY6Q29tbWl0MjMyNTI5ODo3OWJlZmQwYzA4YzQ3NjZmOGZhMjdlMzdhYzJhNzBlNDA4NDBhNTZh;Andrea Arcangeli;Linus Torvalds;"[PATCH] oom-killer disable for iscsi/lvm2/multipath userland critical sections iscsi/lvm2/multipath needs guaranteed protection from the oom-killer, so make the magical value of -17 in /proc/<pid>/oom_adj defeat the oom-killer altogether. (akpm: we still need to document oom_adj and friends in Documentation/filesystems/proc.txt!) Signed-off-by: Andrea Arcangeli <andrea@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/79befd0c08c4766f8fa27e37ac2a70e40840a56a;[PATCH] oom-killer disable for iscsi/lvm2/multipath userland critical sections;yes;no;no 402;MDY6Q29tbWl0MjMyNTI5ODo3OWJlZmQwYzA4YzQ3NjZmOGZhMjdlMzdhYzJhNzBlNDA4NDBhNTZh;Andrea Arcangeli;Linus Torvalds;"[PATCH] oom-killer disable for iscsi/lvm2/multipath userland critical sections iscsi/lvm2/multipath needs guaranteed protection from the oom-killer, so make the magical value of -17 in /proc/<pid>/oom_adj defeat the oom-killer altogether. (akpm: we still need to document oom_adj and friends in Documentation/filesystems/proc.txt!) Signed-off-by: Andrea Arcangeli <andrea@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/79befd0c08c4766f8fa27e37ac2a70e40840a56a;"iscsi/lvm2/multipath needs guaranteed protection from the oom-killer, so make the magical value of -17 in /proc/<pid>/oom_adj defeat the oom-killer altogether";yes;yes;no 402;MDY6Q29tbWl0MjMyNTI5ODo3OWJlZmQwYzA4YzQ3NjZmOGZhMjdlMzdhYzJhNzBlNDA4NDBhNTZh;Andrea Arcangeli;Linus Torvalds;"[PATCH] oom-killer disable for iscsi/lvm2/multipath userland critical sections iscsi/lvm2/multipath needs guaranteed protection from the oom-killer, so make the magical value of -17 in /proc/<pid>/oom_adj defeat the oom-killer altogether. (akpm: we still need to document oom_adj and friends in Documentation/filesystems/proc.txt!) Signed-off-by: Andrea Arcangeli <andrea@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>";https://api.github.com/repos/torvalds/linux/git/commits/79befd0c08c4766f8fa27e37ac2a70e40840a56a;"(akpm: we still need to document oom_adj and friends in Documentation/filesystems/proc.txt!)";yes;no;no 403;MDY6Q29tbWl0MjMyNTI5ODoxZGExNzdlNGMzZjQxNTI0ZTg4NmI3ZjFiOGEwYzFmYzczMjFjYWMy;Linus Torvalds;Linus Torvalds;"Linux-2.6.12-rc2 Initial git repository build. I'm not bothering with the full history, even though we have it. We can create a separate ""historical"" git archive of that later if we want to, and in the meantime it's about 3.2GB when imported into git - space that would just make the early git days unnecessarily complicated, when we don't have a lot of good infrastructure for it. Let it rip!";https://api.github.com/repos/torvalds/linux/git/commits/1da177e4c3f41524e886b7f1b8a0c1fc7321cac2;Initial git repository build;yes;yes;no 403;MDY6Q29tbWl0MjMyNTI5ODoxZGExNzdlNGMzZjQxNTI0ZTg4NmI3ZjFiOGEwYzFmYzczMjFjYWMy;Linus Torvalds;Linus Torvalds;"Linux-2.6.12-rc2 Initial git repository build. I'm not bothering with the full history, even though we have it. We can create a separate ""historical"" git archive of that later if we want to, and in the meantime it's about 3.2GB when imported into git - space that would just make the early git days unnecessarily complicated, when we don't have a lot of good infrastructure for it. Let it rip!";https://api.github.com/repos/torvalds/linux/git/commits/1da177e4c3f41524e886b7f1b8a0c1fc7321cac2;"I'm not bothering with the full history, even though we have it";yes;no;no 403;MDY6Q29tbWl0MjMyNTI5ODoxZGExNzdlNGMzZjQxNTI0ZTg4NmI3ZjFiOGEwYzFmYzczMjFjYWMy;Linus Torvalds;Linus Torvalds;"Linux-2.6.12-rc2 Initial git repository build. I'm not bothering with the full history, even though we have it. We can create a separate ""historical"" git archive of that later if we want to, and in the meantime it's about 3.2GB when imported into git - space that would just make the early git days unnecessarily complicated, when we don't have a lot of good infrastructure for it. Let it rip!";https://api.github.com/repos/torvalds/linux/git/commits/1da177e4c3f41524e886b7f1b8a0c1fc7321cac2;"We can create a separate ""historical"" git archive of that later if we want to, and in the meantime it's about 3.2GB when imported into git - space that would just make the early git days unnecessarily complicated, when we don't have a lot of good infrastructure for it";yes;yes;no 403;MDY6Q29tbWl0MjMyNTI5ODoxZGExNzdlNGMzZjQxNTI0ZTg4NmI3ZjFiOGEwYzFmYzczMjFjYWMy;Linus Torvalds;Linus Torvalds;"Linux-2.6.12-rc2 Initial git repository build. I'm not bothering with the full history, even though we have it. We can create a separate ""historical"" git archive of that later if we want to, and in the meantime it's about 3.2GB when imported into git - space that would just make the early git days unnecessarily complicated, when we don't have a lot of good infrastructure for it. Let it rip!";https://api.github.com/repos/torvalds/linux/git/commits/1da177e4c3f41524e886b7f1b8a0c1fc7321cac2;Let it rip!;yes;no;no