1. 21 Feb, 2016 40 commits
    • Russell King's avatar
      ARM: Do 15e0d9e3 (ARM: pm: let platforms select cpu_suspend support) properly · 9bd8ce11
      Russell King authored
      
      commit b6c7aabd923a17af993c5a5d5d7995f0b27c000a upstream.
      
      Let's do the changes properly and fix the same problem everywhere, not
      just for one case.
      Signed-off-by: default avatarRussell King <rmk+kernel@arm.linux.org.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9bd8ce11
    • Lukas Czerner's avatar
      ext4: convert number of blocks to clusters properly · 5f05c0a0
      Lukas Czerner authored
      
      commit 810da240f221d64bf90020f25941b05b378186fe upstream.
      
      We're using macro EXT4_B2C() to convert number of blocks to number of
      clusters for bigalloc file systems.  However, we should be using
      EXT4_NUM_B2C().
      Signed-off-by: default avatarLukas Czerner <lczerner@redhat.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: default avatarCAI Qian <caiqian@redhat.com>
      Signed-off-by: default avatarLingzhu Xiang <lxiang@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5f05c0a0
    • Riley Andrews's avatar
      android: drivers: Fix build broken by android debugfs fix. · abdc7b3d
      Riley Andrews authored
      Fix build broken by commit b0bd81a67ae5ced88b ("android: drivers: workaround
      debugfs race in binder").
      
      Change-Id: I10c5c0211144a4a1c270dc03f92cf6a1a829e8f8
      abdc7b3d
    • Russell King's avatar
      ARM: footbridge: fix VGA initialisation · 94ed6132
      Russell King authored
      commit 43659222e7a0113912ed02f6b2231550b3e471ac upstream.
      
      It's no good setting vga_base after the VGA console has been
      initialised, because if we do that we get this:
      
      Unable to handle kernel paging request at virtual address 000b8000
      pgd = c0004000
      [000b8000] *pgd=07ffc831, *pte=00000000, *ppte=00000000
      0Internal error: Oops: 5017 [#1] ARM
      Modules linked in:
      CPU: 0 PID: 0 Comm: swapper Not tainted 3.12.0+ #49
      task: c03e2974 ti: c03d8000 task.ti: c03d8000
      PC is at vgacon_startup+0x258/0x39c
      LR is at request_resource+0x10/0x1c
      pc : [<c01725d0>]    lr : [<c0022b50>]    psr: 60000053
      sp : c03d9f68  ip : 000b8000  fp : c03d9f8c
      r10: 000055aa  r9 : 4401a103  r8 : ffffaa55
      r7 : c03e357c  r6 : c051b460  r5 : 000000ff  r4 : 000c0000
      r3 : 000b8000  r2 : c03e0514  r1 : 00000000  r0 : c0304971
      Flags: nZCv  IRQs on  FIQs off  Mode SVC_32  ISA ARM  Segment kernel
      
      which is an access to the 0xb8000 without the PCI offset required to
      make it work.
      
      Fixes: cc22b4c1
      
       ("ARM: set vga memory base at run-time")
      Signed-off-by: default avatarRussell King <rmk+kernel@arm.linux.org.uk>
      [bwh: Backported to 3.2: adjust context]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      Cc: Yang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      94ed6132
    • Arnd Bergmann's avatar
      ARM: 7742/1: topology: export cpu_topology · 71ada24d
      Arnd Bergmann authored
      
      commit 92bdd3f5eba299b33c2f4407977d6fa2e2a6a0da upstream.
      
      The cpu_topology symbol is required by any driver using the topology
      interfaces, which leads to a couple of build errors:
      
      ERROR: "cpu_topology" [drivers/net/ethernet/sfc/sfc.ko] undefined!
      ERROR: "cpu_topology" [drivers/cpufreq/arm_big_little.ko] undefined!
      ERROR: "cpu_topology" [drivers/block/mtip32xx/mtip32xx.ko] undefined!
      
      The obvious solution is to export this symbol.
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Acked-by: default avatarWill Deacon <will.deacon@arm.com>
      Cc: Nicolas Pitre <nico@linaro.org>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: default avatarRussell King <rmk+kernel@arm.linux.org.uk>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      Cc: Yang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      71ada24d
    • Russell King's avatar
      ARM: fix "bad mode in ... handler" message for undefined instructions · b2a6a5a9
      Russell King authored
      
      commit 29c350bf28da333e41e30497b649fe335712a2ab upstream.
      
      The array was missing the final entry for the undefined instruction
      exception handler; this commit adds it.
      Signed-off-by: default avatarRussell King <rmk+kernel@arm.linux.org.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b2a6a5a9
    • Sujit Reddy Thumma's avatar
      mmc: core: Fix clock frequency transitions during invalid states · d68e6610
      Sujit Reddy Thumma authored
      
      eMMC and SD card specifications restrict the usage of a class of
      commands while commands in other class are in progress. For example,
      during erase operations the SD/eMMC spec. allows only CMD35, CMD36,
      CMD38. If clock scaling is enabled and decide to scale up the clocks
      it may be possible that CMD19/21 tuning commands are sent in between
      erase commands, which is illegal as per specification.
      
      Fix such illegal transactions to the card and also make clock scaling
      statistics accountable only for read/write commands instead of time
      consuming commands, like CMD38 erase, where transactions are independent
      of bus frequency.
      
      Change-Id: Iffba175787837e7f95bde8970f19d0f0f9d7d67d
      Signed-off-by: default avatarSujit Reddy Thumma <sthumma@codeaurora.org>
      d68e6610
    • Greg Thelen's avatar
      tmpfs: fix use-after-free of mempolicy object · c87fe6ad
      Greg Thelen authored
      
      commit 5f00110f7273f9ff04ac69a5f85bb535a4fd0987 upstream.
      
      The tmpfs remount logic preserves filesystem mempolicy if the mpol=M
      option is not specified in the remount request.  A new policy can be
      specified if mpol=M is given.
      
      Before this patch remounting an mpol bound tmpfs without specifying
      mpol= mount option in the remount request would set the filesystem's
      mempolicy object to a freed mempolicy object.
      
      To reproduce the problem boot a DEBUG_PAGEALLOC kernel and run:
          # mkdir /tmp/x
      
          # mount -t tmpfs -o size=100M,mpol=interleave nodev /tmp/x
      
          # grep /tmp/x /proc/mounts
          nodev /tmp/x tmpfs rw,relatime,size=102400k,mpol=interleave:0-3 0 0
      
          # mount -o remount,size=200M nodev /tmp/x
      
          # grep /tmp/x /proc/mounts
          nodev /tmp/x tmpfs rw,relatime,size=204800k,mpol=??? 0 0
              # note ? garbage in mpol=... output above
      
          # dd if=/dev/zero of=/tmp/x/f count=1
              # panic here
      
      Panic:
          BUG: unable to handle kernel NULL pointer dereference at           (null)
          IP: [<          (null)>]           (null)
          [...]
          Oops: 0010 [#1] SMP DEBUG_PAGEALLOC
          Call Trace:
            mpol_shared_policy_init+0xa5/0x160
            shmem_get_inode+0x209/0x270
            shmem_mknod+0x3e/0xf0
            shmem_create+0x18/0x20
            vfs_create+0xb5/0x130
            do_last+0x9a1/0xea0
            path_openat+0xb3/0x4d0
            do_filp_open+0x42/0xa0
            do_sys_open+0xfe/0x1e0
            compat_sys_open+0x1b/0x20
            cstar_dispatch+0x7/0x1f
      
      Non-debug kernels will not crash immediately because referencing the
      dangling mpol will not cause a fault.  Instead the filesystem will
      reference a freed mempolicy object, which will cause unpredictable
      behavior.
      
      The problem boils down to a dropped mpol reference below if
      shmem_parse_options() does not allocate a new mpol:
      
          config = *sbinfo
          shmem_parse_options(data, &config, true)
          mpol_put(sbinfo->mpol)
          sbinfo->mpol = config.mpol  /* BUG: saves unreferenced mpol */
      
      This patch avoids the crash by not releasing the mempolicy if
      shmem_parse_options() doesn't create a new mpol.
      
      How far back does this issue go? I see it in both 2.6.36 and 3.3.  I did
      not look back further.
      Signed-off-by: default avatarGreg Thelen <gthelen@google.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c87fe6ad
    • Linus Torvalds's avatar
      mm: fix pageblock bitmap allocation · da587821
      Linus Torvalds authored
      
      commit 7c45512df987c5619db041b5c9b80d281e26d3db upstream.
      
      Commit c060f943d092 ("mm: use aligned zone start for pfn_to_bitidx
      calculation") fixed out calculation of the index into the pageblock
      bitmap when a !SPARSEMEM zome was not aligned to pageblock_nr_pages.
      
      However, the _allocation_ of that bitmap had never taken this alignment
      requirement into accout, so depending on the exact size and alignment of
      the zone, the use of that index could then access past the allocation,
      resulting in some very subtle memory corruption.
      
      This was reported (and bisected) by Ingo Molnar: one of his random
      config builds would hang with certain very specific kernel command line
      options.
      
      In the meantime, commit c060f943d092 has been marked for stable, so this
      fix needs to be back-ported to the stable kernels that backported the
      commit to use the right alignment.
      Bisected-and-tested-by: default avatarIngo Molnar <mingo@kernel.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      da587821
    • Mel Gorman's avatar
      tmpfs: fix shared mempolicy leak · ffa815dc
      Mel Gorman authored
      
      commit 18a2f371f5edf41810f6469cb9be39931ef9deb9 upstream.
      
      This fixes a regression in 3.7-rc, which has since gone into stable.
      
      Commit 00442ad04a5e ("mempolicy: fix a memory corruption by refcount
      imbalance in alloc_pages_vma()") changed get_vma_policy() to raise the
      refcount on a shmem shared mempolicy; whereas shmem_alloc_page() went
      on expecting alloc_page_vma() to drop the refcount it had acquired.
      This deserves a rework: but for now fix the leak in shmem_alloc_page().
      
      Hugh: shmem_swapin() did not need a fix, but surely it's clearer to use
      the same refcounting there as in shmem_alloc_page(), delete its onstack
      mempolicy, and the strange mpol_cond_copy() and __mpol_cond_copy() -
      those were invented to let swapin_readahead() make an unknown number of
      calls to alloc_pages_vma() with one mempolicy; but since 00442ad04a5e,
      alloc_pages_vma() has kept refcount in balance, so now no problem.
      Reported-and-tested-by: default avatarTommi Rantala <tt.rantala@gmail.com>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ffa815dc
    • Hugh Dickins's avatar
      tmpfs mempolicy: fix /proc/mounts corrupting memory · c4a3839f
      Hugh Dickins authored
      commit f2a07f40dbc603c15f8b06e6ec7f768af67b424f upstream.
      
      Recently I suggested using "mount -o remount,mpol=local /tmp" in NUMA
      mempolicy testing.  Very nasty.  Reading /proc/mounts, /proc/pid/mounts
      or /proc/pid/mountinfo may then corrupt one bit of kernel memory, often
      in a page table (causing "Bad swap" or "Bad page map" warning or "Bad
      pagetable" oops), sometimes in a vm_area_struct or rbnode or somewhere
      worse.  "mpol=prefer" and "mpol=prefer:Node" are equally toxic.
      
      Recent NUMA enhancements are not to blame: this dates back to 2.6.35,
      when commit e17f74af
      
       "mempolicy: don't call mpol_set_nodemask() when
      no_context" skipped mpol_parse_str()'s call to mpol_set_nodemask(),
      which used to initialize v.preferred_node, or set MPOL_F_LOCAL in flags.
      With slab poisoning, you can then rely on mpol_to_str() to set the bit
      for node 0x6b6b, probably in the next page above the caller's stack.
      
      mpol_parse_str() is only called from shmem_parse_options(): no_context
      is always true, so call it unused for now, and remove !no_context code.
      Set v.nodes or v.preferred_node or MPOL_F_LOCAL as mpol_to_str() might
      expect.  Then mpol_to_str() can ignore its no_context argument also,
      the mpol being appropriately initialized whether contextualized or not.
      Rename its no_context unused too, and let subsequent patch remove them
      (that's not needed for stable backporting, which would involve rejects).
      
      I don't understand why MPOL_LOCAL is described as a pseudo-policy:
      it's a reasonable policy which suffers from a confusing implementation
      in terms of MPOL_PREFERRED with MPOL_F_LOCAL.  I believe this would be
      much more robust if MPOL_LOCAL were recognized in switch statements
      throughout, MPOL_F_LOCAL deleted, and MPOL_PREFERRED use the (possibly
      empty) nodes mask like everyone else, instead of its preferred_node
      variant (I presume an optimization from the days before MPOL_LOCAL).
      But that would take me too long to get right and fully tested.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c4a3839f
    • Michal Hocko's avatar
      memcg: oom: fix totalpages calculation for memory.swappiness==0 · d632f7e2
      Michal Hocko authored
      
      commit 9a5a8f19b43430752067ecaee62fc59e11e88fa6 upstream.
      
      oom_badness() takes a totalpages argument which says how many pages are
      available and it uses it as a base for the score calculation.  The value
      is calculated by mem_cgroup_get_limit which considers both limit and
      total_swap_pages (resp.  memsw portion of it).
      
      This is usually correct but since fe35004fbf9e ("mm: avoid swapping out
      with swappiness==0") we do not swap when swappiness is 0 which means
      that we cannot really use up all the totalpages pages.  This in turn
      confuses oom score calculation if the memcg limit is much smaller than
      the available swap because the used memory (capped by the limit) is
      negligible comparing to totalpages so the resulting score is too small
      if adj!=0 (typically task with CAP_SYS_ADMIN or non zero oom_score_adj).
      A wrong process might be selected as result.
      
      The problem can be worked around by checking mem_cgroup_swappiness==0
      and not considering swap at all in such a case.
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d632f7e2
    • Mel Gorman's avatar
      mempolicy: fix a race in shared_policy_replace() · b0dc0842
      Mel Gorman authored
      
      commit b22d127a39ddd10d93deee3d96e643657ad53a49 upstream.
      
      shared_policy_replace() use of sp_alloc() is unsafe.  1) sp_node cannot
      be dereferenced if sp->lock is not held and 2) another thread can modify
      sp_node between spin_unlock for allocating a new sp node and next
      spin_lock.  The bug was introduced before 2.6.12-rc2.
      
      Kosaki's original patch for this problem was to allocate an sp node and
      policy within shared_policy_replace and initialise it when the lock is
      reacquired.  I was not keen on this approach because it partially
      duplicates sp_alloc().  As the paths were sp->lock is taken are not that
      performance critical this patch converts sp->lock to sp->mutex so it can
      sleep when calling sp_alloc().
      
      [kosaki.motohiro@jp.fujitsu.com: Original patch]
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Josh Boyer <jwboyer@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b0dc0842
    • Laura Abbott's avatar
      mm: Add notifier framework for showing memory · 42844307
      Laura Abbott authored
      
      There are many drivers in the kernel which can hold on
      to lots of memory. It can be useful to dump out all those
      drivers at key points in the kernel. Introduct a notifier
      framework for dumping this information. When the notifiers
      are called, drivers can dump out the state of any memory
      they may be using.
      
      Change-Id: I514ef1d01510a50970a661c8e9bedc8b78683eab
      Signed-off-by: default avatarLaura Abbott <lauraa@codeaurora.org>
      42844307
    • Vinayak Menon's avatar
      mm: vmpressure: allow in-kernel clients to subscribe for events · 2570b7de
      Vinayak Menon authored
      
      Currently, vmpressure is tied to memcg and its events are
      available only to userspace clients. This patch removes
      the dependency on CONFIG_MEMCG and adds a mechanism for
      in-kernel clients to subscribe for vmpressure events (in
      fact raw vmpressure values are delivered instead of vmpressure
      levels, to provide clients more flexibility to take actions
      on custom pressure levels which are not currently defined
      by vmpressure module).
      
      Change-Id: I1500c098cde11010e463d67955e8a03feb193a67
      Signed-off-by: default avatarVinayak Menon <vinmenon@codeaurora.org>
      2570b7de
    • Liam Mark's avatar
      mm, oom: make dump_tasks public · 1e9e190a
      Liam Mark authored
      
      Allow other functions to dump the list of tasks.
      Useful for when debugging memory leaks.
      
      Bug: 17871993
      Change-Id: I0d9e812d242cbd9e152d561be9a16c00bad3c032
      Signed-off-by: default avatarLiam Mark <lmark@codeaurora.org>
      Signed-off-by: default avatarNaveen Ramaraj <nramaraj@codeaurora.org>
      1e9e190a
    • Anton Vorontsov's avatar
      memcg: add memory.pressure_level events · d12c78e5
      Anton Vorontsov authored
      With this patch userland applications that want to maintain the
      interactivity/memory allocation cost can use the pressure level
      notifications.  The levels are defined like this:
      
      The "low" level means that the system is reclaiming memory for new
      allocations.  Monitoring this reclaiming activity might be useful for
      maintaining cache level.  Upon notification, the program (typically
      "Activity Manager") might analyze vmstat and act in advance (i.e.
      prematurely shutdown unimportant services).
      
      The "medium" level means that the system is experiencing medium memory
      pressure, the system might be making swap, paging out active file
      caches, etc.  Upon this event applications may decide to further analyze
      vmstat/zoneinfo/memcg or internal memory usage statistics and free any
      resources that can be easily reconstructed or re-read from a disk.
      
      The "critical" level means that the system is actively thrashing, it is
      about to out of memory (OOM) or even the in-kernel OOM killer is on its
      way to trigger.  Applications should do whatever they can to help the
      system.  It might be too late to consult with vmstat or any other
      statistics, so it's advisable to take an immediate action.
      
      The events are propagated upward until the event is handled, i.e.  the
      events are not pass-through.  Here is what this means: for example you
      have three cgroups: A->B->C.  Now you set up an event listener on
      cgroups A, B and C, and suppose group C experiences some pressure.  In
      this situation, only group C will receive the notification, i.e.  groups
      A and B will not receive it.  This is done to avoid excessive
      "broadcasting" of messages, which disturbs the system and which is
      especially bad if we are low on memory or thrashing.  So, organize the
      cgroups wisely, or propagate the events manually (or, ask us to
      implement the pass-through events, explaining why would you need them.)
      
      Performance wise, the memory pressure notifications feature itself is
      lightweight and does not require much of bookkeeping, in contrast to the
      rest of memcg features.  Unfortunately, as of current memcg
      implementation, pages accounting is an inseparable part and cannot be
      turned off.  The good news is that there are some efforts[1] to improve
      the situation; plus, implementing the same, fully API-compatible[2]
      interface for CONFIG_CGROUP_MEM_RES_CTLR=n case (e.g.  embedded) is also
      a viable option, so it will not require any changes on the userland
      side.
      
      [1] http://permalink.gmane.org/gmane.linux.kernel.cgroups/6291
      [2] http://lkml.org/lkml/2013/2/21/454
      
      
      
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: fix CONFIG_CGROPUPS=n warnings]
      Signed-off-by: default avatarAnton Vorontsov <anton.vorontsov@linaro.org>
      Acked-by: default avatarKirill A. Shutemov <kirill@shutemov.name>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Leonid Moiseichuk <leonid.moiseichuk@nokia.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      
      Change-Id: I4e703d3688c74466e02cf0f2b866e85043fe799d
      d12c78e5
    • Michael Wang's avatar
      slab: fix the DEADLOCK issue on l3 alien lock · 788a9073
      Michael Wang authored
      commit 947ca1856a7e60aa6d20536785e6a42dff25aa6e upstream.
      
      DEADLOCK will be report while running a kernel with NUMA and LOCKDEP enabled,
      the process of this fake report is:
      
      	   kmem_cache_free()	//free obj in cachep
      	-> cache_free_alien()	//acquire cachep's l3 alien lock
      	-> __drain_alien_cache()
      	-> free_block()
      	-> slab_destroy()
      	-> kmem_cache_free()	//free slab in cachep->slabp_cache
      	-> cache_free_alien()	//acquire cachep->slabp_cache's l3 alien lock
      
      Since the cachep and cachep->slabp_cache's l3 alien are in the same lock class,
      fake report generated.
      
      This should not happen since we already have init_lock_keys() which will
      reassign the lock class for both l3 list and l3 alien.
      
      However, init_lock_keys() was invoked at a wrong position which is before we
      invoke enable_cpucache() on each cache.
      
      Since until set slab_state to be FULL, we won't invoke enable_cpucache()
      on caches to build their l3 alien while creating them, so although we invoked
      init_lock_keys(), the l3 alien lock class won't change since we don't have
      them until invoked enable_cpucache() later.
      
      This patch will invoke init_lock_keys() after we done enable_cpucache()
      instead of before to avoid the fake DEADLOCK report.
      
      Michael traced the problem back to a commit in release 3.0.0:
      
      commit 30765b92
      
      
      Author: Peter Zijlstra <peterz@infradead.org>
      Date:   Thu Jul 28 23:22:56 2011 +0200
      
          slab, lockdep: Annotate the locks before using them
      
          Fernando found we hit the regular OFF_SLAB 'recursion' before we
          annotate the locks, cure this.
      
          The relevant portion of the stack-trace:
      
          > [    0.000000]  [<c085e24f>] rt_spin_lock+0x50/0x56
          > [    0.000000]  [<c04fb406>] __cache_free+0x43/0xc3
          > [    0.000000]  [<c04fb23f>] kmem_cache_free+0x6c/0xdc
          > [    0.000000]  [<c04fb2fe>] slab_destroy+0x4f/0x53
          > [    0.000000]  [<c04fb396>] free_block+0x94/0xc1
          > [    0.000000]  [<c04fc551>] do_tune_cpucache+0x10b/0x2bb
          > [    0.000000]  [<c04fc8dc>] enable_cpucache+0x7b/0xa7
          > [    0.000000]  [<c0bd9d3c>] kmem_cache_init_late+0x1f/0x61
          > [    0.000000]  [<c0bba687>] start_kernel+0x24c/0x363
          > [    0.000000]  [<c0bba0ba>] i386_start_kernel+0xa9/0xaf
      Reported-by: default avatarFernando Lopez-Lezcano <nando@ccrma.Stanford.EDU>
      Acked-by: default avatarPekka Enberg <penberg@kernel.org>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
          Link: http://lkml.kernel.org/r/1311888176.2617.379.camel@laptop
      
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      
      The commit moved init_lock_keys() before we build up the alien, so we
      failed to reclass it.
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Tested-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Wang <wangyun@linux.vnet.ibm.com>
      Signed-off-by: default avatarPekka Enberg <penberg@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      788a9073
    • Mahesh Sivasubramanian's avatar
      arm: arch_timer: set memory mapped timer interrupt as IRQF_TIMER · 8808c5cc
      Mahesh Sivasubramanian authored
      
      The memory mapped timer is used as a broadcast timer to wake the core for
      timer interrupts when the arch timer might not be functional. When interrupt
      is not marked as IRQF_NO_SUSPEND, the interrupt gets disabled during the
      suspend_device_irqs() callback in the suspend path. If a core were to enter a
      idle low power mode which relies on broadcast timer to process the interrupt,
      the core is never woken up for timer interrupts.
      
      Mark the interrupt with IRQF_TIMER which marks this interrupt as a timer
      interrupt and also marks it as IRQF_NO_SUSPEND
      
      CRs-fixed: 636712
      Change-Id: I0484e92a9d05f66a0c5b3c00c584a3dd3fe6ae85
      Signed-off-by: default avatarMahesh Sivasubramanian <msivasub@codeaurora.org>
      8808c5cc
    • Syed Rameez Mustafa's avatar
      ARM: Allow panic on division by zero in the kernel · aaa03166
      Syed Rameez Mustafa authored
      
      Division by zero errors in the kernel currently trigger warnings.
      Allow panic on these errors so that we can catch the problem closer
      to its source.
      
      Change-Id: Id5fed71b74cd37874ae857a8105455d7561c782d
      Signed-off-by: default avatarSyed Rameez Mustafa <rameezmustafa@codeaurora.org>
      aaa03166
    • Jiang Liu's avatar
      memory hotplug: fix invalid memory access caused by stale kswapd pointer · e7554260
      Jiang Liu authored
      
      commit d8adde17e5f858427504725218c56aef90e90fc7 upstream.
      
      kswapd_stop() is called to destroy the kswapd work thread when all memory
      of a NUMA node has been offlined.  But kswapd_stop() only terminates the
      work thread without resetting NODE_DATA(nid)->kswapd to NULL.  The stale
      pointer will prevent kswapd_run() from creating a new work thread when
      adding memory to the memory-less NUMA node again.  Eventually the stale
      pointer may cause invalid memory access.
      
      An example stack dump as below. It's reproduced with 2.6.32, but latest
      kernel has the same issue.
      
        BUG: unable to handle kernel NULL pointer dereference at (null)
        IP: [<ffffffff81051a94>] exit_creds+0x12/0x78
        PGD 0
        Oops: 0000 [#1] SMP
        last sysfs file: /sys/devices/system/memory/memory391/state
        CPU 11
        Modules linked in: cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq microcode fuse loop dm_mod tpm_tis rtc_cmos i2c_i801 rtc_core tpm serio_raw pcspkr sg tpm_bios igb i2c_core iTCO_wdt rtc_lib mptctl iTCO_vendor_support button dca bnx2 usbhid hid uhci_hcd ehci_hcd usbcore sd_mod crc_t10dif edd ext3 mbcache jbd fan ide_pci_generic ide_core ata_generic ata_piix libata thermal processor thermal_sys hwmon mptsas mptscsih mptbase scsi_transport_sas scsi_mod
        Pid: 7949, comm: sh Not tainted 2.6.32.12-qiuxishi-5-default #92 Tecal RH2285
        RIP: 0010:exit_creds+0x12/0x78
        RSP: 0018:ffff8806044f1d78  EFLAGS: 00010202
        RAX: 0000000000000000 RBX: ffff880604f22140 RCX: 0000000000019502
        RDX: 0000000000000000 RSI: 0000000000000202 RDI: 0000000000000000
        RBP: ffff880604f22150 R08: 0000000000000000 R09: ffffffff81a4dc10
        R10: 00000000000032a0 R11: ffff880006202500 R12: 0000000000000000
        R13: 0000000000c40000 R14: 0000000000008000 R15: 0000000000000001
        FS:  00007fbc03d066f0(0000) GS:ffff8800282e0000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
        CR2: 0000000000000000 CR3: 000000060f029000 CR4: 00000000000006e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
        Process sh (pid: 7949, threadinfo ffff8806044f0000, task ffff880603d7c600)
        Stack:
         ffff880604f22140 ffffffff8103aac5 ffff880604f22140 ffffffff8104d21e
         ffff880006202500 0000000000008000 0000000000c38000 ffffffff810bd5b1
         0000000000000000 ffff880603d7c600 00000000ffffdd29 0000000000000003
        Call Trace:
          __put_task_struct+0x5d/0x97
          kthread_stop+0x50/0x58
          offline_pages+0x324/0x3da
          memory_block_change_state+0x179/0x1db
          store_mem_state+0x9e/0xbb
          sysfs_write_file+0xd0/0x107
          vfs_write+0xad/0x169
          sys_write+0x45/0x6e
          system_call_fastpath+0x16/0x1b
        Code: ff 4d 00 0f 94 c0 84 c0 74 08 48 89 ef e8 1f fd ff ff 5b 5d 31 c0 41 5c c3 53 48 8b 87 20 06 00 00 48 89 fb 48 8b bf 18 06 00 00 <8b> 00 48 c7 83 18 06 00 00 00 00 00 00 f0 ff 0f 0f 94 c0 84 c0
        RIP  exit_creds+0x12/0x78
         RSP <ffff8806044f1d78>
        CR2: 0000000000000000
      
      [akpm@linux-foundation.org: add pglist_data.kswapd locking comments]
      Signed-off-by: default avatarXishi Qiu <qiuxishi@huawei.com>
      Signed-off-by: default avatarJiang Liu <jiang.liu@huawei.com>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Reviewed-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e7554260
    • Gavin Shan's avatar
      mm/memblock: fix memory leak on extending regions · defc6575
      Gavin Shan authored
      
      commit 181eb39425f2b9275afcb015eaa547d11f71a02f upstream.
      
      The overall memblock has been organized into the memory regions and
      reserved regions.  Initially, the memory regions and reserved regions are
      stored in the predetermined arrays of "struct memblock _region".  It's
      possible for the arrays to be enlarged when we have newly added regions,
      but no free space left there.  The policy here is to create double-sized
      array either by slab allocator or memblock allocator.  Unfortunately, we
      didn't free the old array, which might be allocated through slab allocator
      before.  That would cause memory leak.
      
      The patch introduces 2 variables to trace where (slab or memblock) the
      memory and reserved regions come from.  The memory for the memory or
      reserved regions will be deallocated by kfree() if that was allocated by
      slab allocator.  Thus to fix the memory leak issue.
      Signed-off-by: default avatarGavin Shan <shangw@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      defc6575
    • Greg Pearson's avatar
      mm/memblock: fix overlapping allocation when doubling reserved array · ef674d27
      Greg Pearson authored
      
      commit 48c3b583bbddad2220ca4c22319ca5d1f78b2090 upstream.
      
      __alloc_memory_core_early() asks memblock for a range of memory then try
      to reserve it.  If the reserved region array lacks space for the new
      range, memblock_double_array() is called to allocate more space for the
      array.  If memblock is used to allocate memory for the new array it can
      end up using a range that overlaps with the range originally allocated in
      __alloc_memory_core_early(), leading to possible data corruption.
      
      With this patch memblock_double_array() now calls memblock_find_in_range()
      with a narrowed candidate range (in cases where the reserved.regions array
      is being doubled) so any memory allocated will not overlap with the
      original range that was being reserved.  The range is narrowed by passing
      in the starting address and size of the previously allocated range.  Then
      the range above the ending address is searched and if a candidate is not
      found, the range below the starting address is searched.
      Signed-off-by: default avatarGreg Pearson <greg.pearson@hp.com>
      Signed-off-by: default avatarYinghai Lu <yinghai@kernel.org>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ef674d27
    • Mel Gorman's avatar
      mm: hugetlbfs: close race during teardown of hugetlbfs shared page tables · 34967266
      Mel Gorman authored
      commit d833352a4338dc31295ed832a30c9ccff5c7a183 upstream.
      
      If a process creates a large hugetlbfs mapping that is eligible for page
      table sharing and forks heavily with children some of whom fault and
      others which destroy the mapping then it is possible for page tables to
      get corrupted.  Some teardowns of the mapping encounter a "bad pmd" and
      output a message to the kernel log.  The final teardown will trigger a
      BUG_ON in mm/filemap.c.
      
      This was reproduced in 3.4 but is known to have existed for a long time
      and goes back at least as far as 2.6.37.  It was probably was introduced
      in 2.6.20 by [39dde65c
      
      : shared page table for hugetlb page].  The messages
      look like this;
      
      [  ..........] Lots of bad pmd messages followed by this
      [  127.164256] mm/memory.c:391: bad pmd ffff880412e04fe8(80000003de4000e7).
      [  127.164257] mm/memory.c:391: bad pmd ffff880412e04ff0(80000003de6000e7).
      [  127.164258] mm/memory.c:391: bad pmd ffff880412e04ff8(80000003de0000e7).
      [  127.186778] ------------[ cut here ]------------
      [  127.186781] kernel BUG at mm/filemap.c:134!
      [  127.186782] invalid opcode: 0000 [#1] SMP
      [  127.186783] CPU 7
      [  127.186784] Modules linked in: af_packet cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf ext3 jbd dm_mod coretemp crc32c_intel usb_storage ghash_clmulni_intel aesni_intel i2c_i801 r8169 mii uas sr_mod cdrom sg iTCO_wdt iTCO_vendor_support shpchp serio_raw cryptd aes_x86_64 e1000e pci_hotplug dcdbas aes_generic container microcode ext4 mbcache jbd2 crc16 sd_mod crc_t10dif i915 drm_kms_helper drm i2c_algo_bit ehci_hcd ahci libahci usbcore rtc_cmos usb_common button i2c_core intel_agp video intel_gtt fan processor thermal thermal_sys hwmon ata_generic pata_atiixp libata scsi_mod
      [  127.186801]
      [  127.186802] Pid: 9017, comm: hugetlbfs-test Not tainted 3.4.0-autobuild #53 Dell Inc. OptiPlex 990/06D7TR
      [  127.186804] RIP: 0010:[<ffffffff810ed6ce>]  [<ffffffff810ed6ce>] __delete_from_page_cache+0x15e/0x160
      [  127.186809] RSP: 0000:ffff8804144b5c08  EFLAGS: 00010002
      [  127.186810] RAX: 0000000000000001 RBX: ffffea000a5c9000 RCX: 00000000ffffffc0
      [  127.186811] RDX: 0000000000000000 RSI: 0000000000000009 RDI: ffff88042dfdad00
      [  127.186812] RBP: ffff8804144b5c18 R08: 0000000000000009 R09: 0000000000000003
      [  127.186813] R10: 0000000000000000 R11: 000000000000002d R12: ffff880412ff83d8
      [  127.186814] R13: ffff880412ff83d8 R14: 0000000000000000 R15: ffff880412ff83d8
      [  127.186815] FS:  00007fe18ed2c700(0000) GS:ffff88042dce0000(0000) knlGS:0000000000000000
      [  127.186816] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [  127.186817] CR2: 00007fe340000503 CR3: 0000000417a14000 CR4: 00000000000407e0
      [  127.186818] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  127.186819] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      [  127.186820] Process hugetlbfs-test (pid: 9017, threadinfo ffff8804144b4000, task ffff880417f803c0)
      [  127.186821] Stack:
      [  127.186822]  ffffea000a5c9000 0000000000000000 ffff8804144b5c48 ffffffff810ed83b
      [  127.186824]  ffff8804144b5c48 000000000000138a 0000000000001387 ffff8804144b5c98
      [  127.186825]  ffff8804144b5d48 ffffffff811bc925 ffff8804144b5cb8 0000000000000000
      [  127.186827] Call Trace:
      [  127.186829]  [<ffffffff810ed83b>] delete_from_page_cache+0x3b/0x80
      [  127.186832]  [<ffffffff811bc925>] truncate_hugepages+0x115/0x220
      [  127.186834]  [<ffffffff811bca43>] hugetlbfs_evict_inode+0x13/0x30
      [  127.186837]  [<ffffffff811655c7>] evict+0xa7/0x1b0
      [  127.186839]  [<ffffffff811657a3>] iput_final+0xd3/0x1f0
      [  127.186840]  [<ffffffff811658f9>] iput+0x39/0x50
      [  127.186842]  [<ffffffff81162708>] d_kill+0xf8/0x130
      [  127.186843]  [<ffffffff81162812>] dput+0xd2/0x1a0
      [  127.186845]  [<ffffffff8114e2d0>] __fput+0x170/0x230
      [  127.186848]  [<ffffffff81236e0e>] ? rb_erase+0xce/0x150
      [  127.186849]  [<ffffffff8114e3ad>] fput+0x1d/0x30
      [  127.186851]  [<ffffffff81117db7>] remove_vma+0x37/0x80
      [  127.186853]  [<ffffffff81119182>] do_munmap+0x2d2/0x360
      [  127.186855]  [<ffffffff811cc639>] sys_shmdt+0xc9/0x170
      [  127.186857]  [<ffffffff81410a39>] system_call_fastpath+0x16/0x1b
      [  127.186858] Code: 0f 1f 44 00 00 48 8b 43 08 48 8b 00 48 8b 40 28 8b b0 40 03 00 00 85 f6 0f 88 df fe ff ff 48 89 df e8 e7 cb 05 00 e9 d2 fe ff ff <0f> 0b 55 83 e2 fd 48 89 e5 48 83 ec 30 48 89 5d d8 4c 89 65 e0
      [  127.186868] RIP  [<ffffffff810ed6ce>] __delete_from_page_cache+0x15e/0x160
      [  127.186870]  RSP <ffff8804144b5c08>
      [  127.186871] ---[ end trace 7cbac5d1db69f426 ]---
      
      The bug is a race and not always easy to reproduce.  To reproduce it I was
      doing the following on a single socket I7-based machine with 16G of RAM.
      
      $ hugeadm --pool-pages-max DEFAULT:13G
      $ echo $((18*1048576*1024)) > /proc/sys/kernel/shmmax
      $ echo $((18*1048576*1024)) > /proc/sys/kernel/shmall
      $ for i in `seq 1 9000`; do ./hugetlbfs-test; done
      
      On my particular machine, it usually triggers within 10 minutes but
      enabling debug options can change the timing such that it never hits.
      Once the bug is triggered, the machine is in trouble and needs to be
      rebooted.  The machine will respond but processes accessing proc like "ps
      aux" will hang due to the BUG_ON.  shutdown will also hang and needs a
      hard reset or a sysrq-b.
      
      The basic problem is a race between page table sharing and teardown.  For
      the most part page table sharing depends on i_mmap_mutex.  In some cases,
      it is also taking the mm->page_table_lock for the PTE updates but with
      shared page tables, it is the i_mmap_mutex that is more important.
      
      Unfortunately it appears to be also insufficient. Consider the following
      situation
      
      Process A					Process B
      ---------					---------
      hugetlb_fault					shmdt
        						LockWrite(mmap_sem)
          						  do_munmap
      						    unmap_region
      						      unmap_vmas
      						        unmap_single_vma
      						          unmap_hugepage_range
            						            Lock(i_mmap_mutex)
      							    Lock(mm->page_table_lock)
      							    huge_pmd_unshare/unmap tables <--- (1)
      							    Unlock(mm->page_table_lock)
            						            Unlock(i_mmap_mutex)
        huge_pte_alloc				      ...
          Lock(i_mmap_mutex)				      ...
          vma_prio_walk, find svma, spte		      ...
          Lock(mm->page_table_lock)			      ...
          share spte					      ...
          Unlock(mm->page_table_lock)			      ...
          Unlock(i_mmap_mutex)			      ...
        hugetlb_no_page									  <--- (2)
      						      free_pgtables
      						        unlink_file_vma
      							hugetlb_free_pgd_range
      						    remove_vma_list
      
      In this scenario, it is possible for Process A to share page tables with
      Process B that is trying to tear them down.  The i_mmap_mutex on its own
      does not prevent Process A walking Process B's page tables.  At (1) above,
      the page tables are not shared yet so it unmaps the PMDs.  Process A sets
      up page table sharing and at (2) faults a new entry.  Process B then trips
      up on it in free_pgtables.
      
      This patch fixes the problem by adding a new function
      __unmap_hugepage_range_final that is only called when the VMA is about to
      be destroyed.  This function clears VM_MAYSHARE during
      unmap_hugepage_range() under the i_mmap_mutex.  This makes the VMA
      ineligible for sharing and avoids the race.  Superficially this looks like
      it would then be vunerable to truncate and madvise issues but hugetlbfs
      has its own truncate handlers so does not use unmap_mapping_range() and
      does not support madvise(DONTNEED).
      
      This should be treated as a -stable candidate if it is merged.
      
      Test program is as follows. The test case was mostly written by Michal
      Hocko with a few minor changes to reproduce this bug.
      
      ==== CUT HERE ====
      
      static size_t huge_page_size = (2UL << 20);
      static size_t nr_huge_page_A = 512;
      static size_t nr_huge_page_B = 5632;
      
      unsigned int get_random(unsigned int max)
      {
      	struct timeval tv;
      
      	gettimeofday(&tv, NULL);
      	srandom(tv.tv_usec);
      	return random() % max;
      }
      
      static void play(void *addr, size_t size)
      {
      	unsigned char *start = addr,
      		      *end = start + size,
      		      *a;
      	start += get_random(size/2);
      
      	/* we could itterate on huge pages but let's give it more time. */
      	for (a = start; a < end; a += 4096)
      		*a = 0;
      }
      
      int main(int argc, char **argv)
      {
      	key_t key = IPC_PRIVATE;
      	size_t sizeA = nr_huge_page_A * huge_page_size;
      	size_t sizeB = nr_huge_page_B * huge_page_size;
      	int shmidA, shmidB;
      	void *addrA = NULL, *addrB = NULL;
      	int nr_children = 300, n = 0;
      
      	if ((shmidA = shmget(key, sizeA, IPC_CREAT|SHM_HUGETLB|0660)) == -1) {
      		perror("shmget:");
      		return 1;
      	}
      
      	if ((addrA = shmat(shmidA, addrA, SHM_R|SHM_W)) == (void *)-1UL) {
      		perror("shmat");
      		return 1;
      	}
      	if ((shmidB = shmget(key, sizeB, IPC_CREAT|SHM_HUGETLB|0660)) == -1) {
      		perror("shmget:");
      		return 1;
      	}
      
      	if ((addrB = shmat(shmidB, addrB, SHM_R|SHM_W)) == (void *)-1UL) {
      		perror("shmat");
      		return 1;
      	}
      
      fork_child:
      	switch(fork()) {
      		case 0:
      			switch (n%3) {
      			case 0:
      				play(addrA, sizeA);
      				break;
      			case 1:
      				play(addrB, sizeB);
      				break;
      			case 2:
      				break;
      			}
      			break;
      		case -1:
      			perror("fork:");
      			break;
      		default:
      			if (++n < nr_children)
      				goto fork_child;
      			play(addrA, sizeA);
      			break;
      	}
      	shmdt(addrA);
      	shmdt(addrB);
      	do {
      		wait(NULL);
      	} while (--n > 0);
      	shmctl(shmidA, IPC_RMID, NULL);
      	shmctl(shmidB, IPC_RMID, NULL);
      	return 0;
      }
      
      [akpm@linux-foundation.org: name the declaration's args, fix CONFIG_HUGETLBFS=n build]
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      34967266
    • Xiao Guangrong's avatar
      mm: mmu_notifier: fix freed page still mapped in secondary MMU · 6d94102f
      Xiao Guangrong authored
      
      commit 3ad3d901bbcfb15a5e4690e55350db0899095a68 upstream.
      
      mmu_notifier_release() is called when the process is exiting.  It will
      delete all the mmu notifiers.  But at this time the page belonging to the
      process is still present in page tables and is present on the LRU list, so
      this race will happen:
      
            CPU 0                 CPU 1
      mmu_notifier_release:    try_to_unmap:
         hlist_del_init_rcu(&mn->hlist);
                                  ptep_clear_flush_notify:
                                        mmu nofifler not found
                                  free page  !!!!!!
                                  /*
                                   * At the point, the page has been
                                   * freed, but it is still mapped in
                                   * the secondary MMU.
                                   */
      
        mn->ops->release(mn, mm);
      
      Then the box is not stable and sometimes we can get this bug:
      
      [  738.075923] BUG: Bad page state in process migrate-perf  pfn:03bec
      [  738.075931] page:ffffea00000efb00 count:0 mapcount:0 mapping:          (null) index:0x8076
      [  738.075936] page flags: 0x20000000000014(referenced|dirty)
      
      The same issue is present in mmu_notifier_unregister().
      
      We can call ->release before deleting the notifier to ensure the page has
      been unmapped from the secondary MMU before it is freed.
      Signed-off-by: default avatarXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6d94102f
    • Hugh Dickins's avatar
      mm: fix crashes from mbind() merging vmas · f14412af
      Hugh Dickins authored
      commit d05f0cdcbe6388723f1900c549b4850360545201 upstream.
      
      In v2.6.34 commit 9d8cebd4 ("mm: fix mbind vma merge problem")
      introduced vma merging to mbind(), but it should have also changed the
      convention of passing start vma from queue_pages_range() (formerly
      check_range()) to new_vma_page(): vma merging may have already freed
      that structure, resulting in BUG at mm/mempolicy.c:1738 and probably
      worse crashes.
      
      Fixes: 9d8cebd4
      
       ("mm: fix mbind vma merge problem")
      Reported-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Tested-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f14412af
    • Xiaozhe Shi's avatar
      power: qpnp-bms: always limit soc to [0, 100] · 124e06c1
      Xiaozhe Shi authored
      
      Currently, soc is only limited to [0, 100] when adjust_soc runs.
      However, there are some cases where the main algorithm of adjust_soc is
      skipped, due to charging, SOC being too high or in the flat region of
      the PC/OCV curve.
      
      This can cause issues where SOC is calculated to be over 100 or under
      0, which is undesirable. Fix this by moving the bound_soc call to the
      main calculate_soc function so that it is never skipped.
      
      CRs-Fixed: 697713
      Change-Id: I641f513d182c62731a4fc115f29c0e38e5ec4c14
      Signed-off-by: default avatarXiaozhe Shi <xiaozhes@codeaurora.org>
      124e06c1
    • Sujit Reddy Thumma's avatar
      mmc: sdhci-msm: Fix clock gating while voltage switch is in progress · 35afe159
      Sujit Reddy Thumma authored
      
      CLK_PWRSAVE bit in vendor specific register gates the output clock to
      card automatically if there are no data/cmd operations.
      
      According the SD3.0 voltage switch sequence the host should provide
      clock to the card for atleast one millisecond before DAT[3:0] lines
      are pulled high by the card. In this case if power save bit is enabled
      it might auto-gate clocks even before the card completes voltage
      switch sequence.
      
      Fix this by disabling power save operation when the clocks are turned
      off and enable only when clock rate is >400KHz i.e., end of initialization.
      
      CRs-Fixed: 589992
      Change-Id: If82d6d2e303b8d1189b76712e514f41fe6e2cf8b
      Signed-off-by: default avatarSujit Reddy Thumma <sthumma@codeaurora.org>
      35afe159
    • Krishna Konda's avatar
      mmc: core: get drive types supported by eMMC cards · 8a08a0e6
      Krishna Konda authored
      
      Get the various drive types other than the default supported
      by the card.
      
      Change-Id: I122971e4fb4a3ab98f0078ceafca3380e9c0e2d1
      Signed-off-by: default avatarKrishna Konda <kkonda@codeaurora.org>
      8a08a0e6
    • Venkat Gopalakrishnan's avatar
      mmc: core: Fix power class config for HS400 · c4ea5bef
      Venkat Gopalakrishnan authored
      
      Use the correct power class field from the extended CSD register
      for HS400 mode as defined in the eMMC5.0 specification.
      
      CRs-fixed: 690341
      Change-Id: Ie10e35941fd3c6ee49c686f721bf5af6fcd74862
      Signed-off-by: default avatarVenkat Gopalakrishnan <venkatg@codeaurora.org>
      c4ea5bef
    • Balamurugan Alagarsamy's avatar
      v4l2: vb2: replace VIDEO_MAX_FRAME with VB2_MAX_FRAME · 5145166e
      Balamurugan Alagarsamy authored
      
      - vb2 drivers to rely on VB2_MAX_FRAME.
      
      - VB2_MAX_FRAME bumps the value to 64 from current 32
      
      Change-Id: I3d7998898df43553486166c44b54524aac449deb
      Signed-off-by: default avatarBalamurugan Alagarsamy <balaga@codeaurora.org>
      5145166e
    • Matt Wagantall's avatar
      cpaccess: remove use of set_get_l2_indirect_reg() · f92d2bb2
      Matt Wagantall authored
      
      cpaccess is the only client of set_get_l2_indirect_reg(), a special-
      purpose API for writing an indirect Krait CP15 register and reading
      back the new value. For simplicity, this API is being removed and
      replaced with bacl-to-back calls to set_l2_indirect_reg() and
      get_l2_indirect_reg(), which should perform comparably.
      
      Change-Id: I868ae115265c59e58ffd37dc405a91c6962e0c3d
      Signed-off-by: default avatarMatt Wagantall <mattw@codeaurora.org>
      f92d2bb2
    • Rafael Aquini's avatar
      swap: avoid read_swap_cache_async() race to deadlock while waiting on discard I/O completion · 4a9432dc
      Rafael Aquini authored
      
      commit cbab0e4eec299e9059199ebe6daf48730be46d2b upstream.
      
      read_swap_cache_async() can race against get_swap_page(), and stumble
      across a SWAP_HAS_CACHE entry in the swap map whose page wasn't brought
      into the swapcache yet.
      
      This transient swap_map state is expected to be transitory, but the
      actual placement of discard at scan_swap_map() inserts a wait for I/O
      completion thus making the thread at read_swap_cache_async() to loop
      around its -EEXIST case, while the other end at get_swap_page() is
      scheduled away at scan_swap_map().  This can leave the system deadlocked
      if the I/O completion happens to be waiting on the CPU waitqueue where
      read_swap_cache_async() is busy looping and !CONFIG_PREEMPT.
      
      This patch introduces a cond_resched() call to make the aforementioned
      read_swap_cache_async() busy loop condition to bail out when necessary,
      thus avoiding the subtle race window.
      Signed-off-by: default avatarRafael Aquini <aquini@redhat.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Shaohua Li <shli@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4a9432dc
    • Mathieu Desnoyers's avatar
      Fix: compat_rw_copy_check_uvector() misuse in aio, readv, writev, and security keys · a10059bc
      Mathieu Desnoyers authored
      
      commit 8aec0f5d4137532de14e6554fd5dd201ff3a3c49 upstream.
      
      Looking at mm/process_vm_access.c:process_vm_rw() and comparing it to
      compat_process_vm_rw() shows that the compatibility code requires an
      explicit "access_ok()" check before calling
      compat_rw_copy_check_uvector(). The same difference seems to appear when
      we compare fs/read_write.c:do_readv_writev() to
      fs/compat.c:compat_do_readv_writev().
      
      This subtle difference between the compat and non-compat requirements
      should probably be debated, as it seems to be error-prone. In fact,
      there are two others sites that use this function in the Linux kernel,
      and they both seem to get it wrong:
      
      Now shifting our attention to fs/aio.c, we see that aio_setup_iocb()
      also ends up calling compat_rw_copy_check_uvector() through
      aio_setup_vectored_rw(). Unfortunately, the access_ok() check appears to
      be missing. Same situation for
      security/keys/compat.c:compat_keyctl_instantiate_key_iov().
      
      I propose that we add the access_ok() check directly into
      compat_rw_copy_check_uvector(), so callers don't have to worry about it,
      and it therefore makes the compat call code similar to its non-compat
      counterpart. Place the access_ok() check in the same location where
      copy_from_user() can trigger a -EFAULT error in the non-compat code, so
      the ABI behaviors are alike on both compat and non-compat.
      
      While we are here, fix compat_do_readv_writev() so it checks for
      compat_rw_copy_check_uvector() negative return values.
      
      And also, fix a memory leak in compat_keyctl_instantiate_key_iov() error
      handling.
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: default avatarAl Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a10059bc
    • Michal Hocko's avatar
      mm: limit mmu_gather batching to fix soft lockups on !CONFIG_PREEMPT · 368b15bb
      Michal Hocko authored
      commit 53a59fc67f97374758e63a9c785891ec62324c81 upstream.
      
      Since commit e303297e
      
       ("mm: extended batches for generic
      mmu_gather") we are batching pages to be freed until either
      tlb_next_batch cannot allocate a new batch or we are done.
      
      This works just fine most of the time but we can get in troubles with
      non-preemptible kernel (CONFIG_PREEMPT_NONE or CONFIG_PREEMPT_VOLUNTARY)
      on large machines where too aggressive batching might lead to soft
      lockups during process exit path (exit_mmap) because there are no
      scheduling points down the free_pages_and_swap_cache path and so the
      freeing can take long enough to trigger the soft lockup.
      
      The lockup is harmless except when the system is setup to panic on
      softlockup which is not that unusual.
      
      The simplest way to work around this issue is to limit the maximum
      number of batches in a single mmu_gather.  10k of collected pages should
      be safe to prevent from soft lockups (we would have 2ms for one) even if
      they are all freed without an explicit scheduling point.
      
      This patch doesn't add any new explicit scheduling points because it
      relies on zap_pmd_range during page tables zapping which calls
      cond_resched per PMD.
      
      The following lockup has been reported for 3.0 kernel with a huge
      process (in order of hundreds gigs but I do know any more details).
      
        BUG: soft lockup - CPU#56 stuck for 22s! [kernel:31053]
        Modules linked in: af_packet nfs lockd fscache auth_rpcgss nfs_acl sunrpc mptctl mptbase autofs4 binfmt_misc dm_round_robin dm_multipath bonding cpufreq_conservative cpufreq_userspace cpufreq_powersave pcc_cpufreq mperf microcode fuse loop osst sg sd_mod crc_t10dif st qla2xxx scsi_transport_fc scsi_tgt netxen_nic i7core_edac iTCO_wdt joydev e1000e serio_raw pcspkr edac_core iTCO_vendor_support acpi_power_meter rtc_cmos hpwdt hpilo button container usbhid hid dm_mirror dm_region_hash dm_log linear uhci_hcd ehci_hcd usbcore usb_common scsi_dh_emc scsi_dh_alua scsi_dh_hp_sw scsi_dh_rdac scsi_dh dm_snapshot pcnet32 mii edd dm_mod raid1 ext3 mbcache jbd fan thermal processor thermal_sys hwmon cciss scsi_mod
        Supported: Yes
        CPU 56
        Pid: 31053, comm: kernel Not tainted 3.0.31-0.9-default #1 HP ProLiant DL580 G7
        RIP: 0010:  _raw_spin_unlock_irqrestore+0x8/0x10
        RSP: 0018:ffff883ec1037af0  EFLAGS: 00000206
        RAX: 0000000000000e00 RBX: ffffea01a0817e28 RCX: ffff88803ffd9e80
        RDX: 0000000000000200 RSI: 0000000000000206 RDI: 0000000000000206
        RBP: 0000000000000002 R08: 0000000000000001 R09: ffff887ec724a400
        R10: 0000000000000000 R11: dead000000200200 R12: ffffffff8144c26e
        R13: 0000000000000030 R14: 0000000000000297 R15: 000000000000000e
        FS:  00007ed834282700(0000) GS:ffff88c03f200000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
        CR2: 000000000068b240 CR3: 0000003ec13c5000 CR4: 00000000000006e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
        Process kernel (pid: 31053, threadinfo ffff883ec1036000, task ffff883ebd5d4100)
        Call Trace:
          release_pages+0xc5/0x260
          free_pages_and_swap_cache+0x9d/0xc0
          tlb_flush_mmu+0x5c/0x80
          tlb_finish_mmu+0xe/0x50
          exit_mmap+0xbd/0x120
          mmput+0x49/0x120
          exit_mm+0x122/0x160
          do_exit+0x17a/0x430
          do_group_exit+0x3d/0xb0
          get_signal_to_deliver+0x247/0x480
          do_signal+0x71/0x1b0
          do_notify_resume+0x98/0xb0
          int_signal+0x12/0x17
        DWARF2 unwinder stuck at int_signal+0x12/0x17
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      368b15bb
    • Hugh Dickins's avatar
      tmpfs,ceph,gfs2,isofs,reiserfs,xfs: fix fh_len checking · 6cd2d508
      Hugh Dickins authored
      
      commit 35c2a7f4908d404c9124c2efc6ada4640ca4d5d5 upstream.
      
      Fuzzing with trinity oopsed on the 1st instruction of shmem_fh_to_dentry(),
      	u64 inum = fid->raw[2];
      which is unhelpfully reported as at the end of shmem_alloc_inode():
      
      BUG: unable to handle kernel paging request at ffff880061cd3000
      IP: [<ffffffff812190d0>] shmem_alloc_inode+0x40/0x40
      Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
      Call Trace:
       [<ffffffff81488649>] ? exportfs_decode_fh+0x79/0x2d0
       [<ffffffff812d77c3>] do_handle_open+0x163/0x2c0
       [<ffffffff812d792c>] sys_open_by_handle_at+0xc/0x10
       [<ffffffff83a5f3f8>] tracesys+0xe1/0xe6
      
      Right, tmpfs is being stupid to access fid->raw[2] before validating that
      fh_len includes it: the buffer kmalloc'ed by do_sys_name_to_handle() may
      fall at the end of a page, and the next page not be present.
      
      But some other filesystems (ceph, gfs2, isofs, reiserfs, xfs) are being
      careless about fh_len too, in fh_to_dentry() and/or fh_to_parent(), and
      could oops in the same way: add the missing fh_len checks to those.
      Reported-by: default avatarSasha Levin <levinsasha928@gmail.com>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Sage Weil <sage@inktank.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6cd2d508
    • Vinson Lee's avatar
      perf tools: Fix build with bison 2.3 and older. · 51de4318
      Vinson Lee authored
      
      commit 85df3b3769222894e9692b383c7af124b7721086 upstream.
      
      The %name-prefix "prefix" syntax is not available on bison 2.3 and
      older. Substitute with the -p "prefix" command-line option for
      compatibility with older versions of bison.
      
      This patch fixes this build error with older versions of bison.
      
          CC util/sysfs.o
          BISON util/pmu-bison.c
      util/pmu.y:2.14-24: syntax error, unexpected string, expecting =
      make: *** [util/pmu-bison.c] Error 1
      Signed-off-by: default avatarVinson Lee <vlee@twitter.com>
      Tested-by: default avatarLi Zefan <lizefan@huawei.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Namhyung Kim <namhyung@gmail.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Link: http://lkml.kernel.org/r/1360792138-29186-1-git-send-email-vlee@twitter.com
      
      Signed-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      51de4318
    • Nikola Majkić's avatar
      Revert "msm: kgsl: don't use sscanf()" · bd8d594b
      Nikola Majkić authored
      This reverts commit 3f9fee8a.
      bd8d594b
    • Nikola Majkić's avatar
      Revert "msm: kgsl: Keep track of kernel space mappings to memory" · ebd7b7a7
      Nikola Majkić authored
      This reverts commit afc733137089159e86e3bcd4fb2fb5aab66726ce.
      ebd7b7a7
    • Nikola Majkić's avatar