1. 21 Feb, 2016 40 commits
    • Laura Abbott's avatar
      ion: Skip zeroing on secure buffers · e03742da
      Laura Abbott authored
      
      Secure buffers are never passed to userspace and are controled
      by the secure world so there is no real need to zero. Pass the
      dma attribute to skip zeroing.
      
      Change-Id: Iad870d0d7732d3dea09443418b9294cb9e05b5e0
      Signed-off-by: default avatarLaura Abbott <lauraa@codeaurora.org>
      e03742da
    • Abhijeet Dharmapurikar's avatar
      msm: krait-regulator-pmic: account for PWM_CL register security · 1ea28bfb
      Abhijeet Dharmapurikar authored
      
      The PWM_CL register is secure i.e. driver has to write 0xA5 to 0XD0
      register to allow updates to PWM_CL register.
      
      Fix it.
      
      Change-Id: I3f273627bdc137d8c10768c7d5824abe96ee8707
      Signed-off-by: default avatarAbhijeet Dharmapurikar <adharmap@codeaurora.org>
      1ea28bfb
    • Hariram Purushothaman's avatar
      msm: camera: make sure num_buf is not out of bound · c9554c7b
      Hariram Purushothaman authored
      
      V4L2 only allow 32 buffers. Check the num_buf to make sure that
      user space passed value is not out of bound.
      
      CRs-Fixed: 514698
      Change-Id: I662ec1eb998ed8bfb2a7f188e645410aa78c83b0
      Signed-off-by: default avatarHariram Purushothaman <hpurus@codeaurora.org>
      Signed-off-by: default avatarAnkit Premrajka <ankitp@codeaurora.org>
      Signed-off-by: default avatarRaghu DP <dp.raghu@codeaurora.org>
      c9554c7b
    • Ravi Kiran Vonteddu's avatar
      msm: vidc: Check maximum width capability of Q6 · 5c01ea47
      Ravi Kiran Vonteddu authored
      
      Maximum width capability check is required to take
      care of indefinite behavior when a clip having width
      more than Q6 capability is played. Also, make the
      capability check generic for Q6 and Venus.
      
      CRs-fixed: 626642
      Change-Id: Ic10be0ad4434019fea45e7a090b21ba5cf54d9a6
      Signed-off-by: default avatarRavi Kiran Vonteddu <rvontedd@codeaurora.org>
      5c01ea47
    • Sarada Prasanna Garnayak's avatar
      input: touchscreen: change the focaltech firmware upgrade method · 330ccb34
      Sarada Prasanna Garnayak authored
      
      Upgarde firmware on the touch controller when the new firmware
      version is geater than the current firmware version. Update the
      version id after successful firmware update. skip firmware
      update process when device is in suspend state.
      
      CRs-Fixed: 623803
      Change-Id: Ic462f6483887a3654665852e58ae9891de9f5eff
      Signed-off-by: default avatarSarada Prasanna Garnayak <c_sgarna@codeaurora.org>
      330ccb34
    • Saravana Kannan's avatar
      devfreq: governor_cpubw_hwmon: Remove sample_ms and fix round up of freq · 75c250d2
      Saravana Kannan authored
      
      devfreq already provides a sysfs interface for changing polling/sampling
      period. So, there is no need for the governor to separately expose
      sampling_ms. Also, sampling_ms had to be explicitly updated whenever
      polling_ms was updated for the governor to function correctly. Make the
      interface simpler by combining sample_ms with polling_interval control
      provided by devfreq.
      
      The rounding of freq to multiples of bw_step is unnecessary since the
      devfreq device already does the rounding to the next valid level. Rounding
      of AB to multiples of bw_step is still necessary since it's a vote that's
      summed up and doesn't have a direct 1-to-1 mapping to frequencies.
      
      Also update default bw_step to 190 MB/s instead of 200 MB/s to account for
      the fact that MB/s to MHz conversion needs to take into account the
      difference in the meaning of M (2^20 vs 10^6) between MB and MHz. Similarly
      also update io_percent to 16.
      
      Change-Id: I5fea989c647955103de3813be8eb9ec612f131bc
      Signed-off-by: default avatarSaravana Kannan <skannan@codeaurora.org>
      75c250d2
    • Eric Dumazet's avatar
      tcp: cubic: fix bug in bictcp_acked() · 20c1363d
      Eric Dumazet authored
      
      [ Upstream commit cd6b423afd3c08b27e1fed52db828ade0addbc6b ]
      
      While investigating about strange increase of retransmit rates
      on hosts ~24 days after boot, Van found hystart was disabled
      if ca->epoch_start was 0, as following condition is true
      when tcp_time_stamp high order bit is set.
      
      (s32)(tcp_time_stamp - ca->epoch_start) < HZ
      
      Quoting Van :
      
       At initialization & after every loss ca->epoch_start is set to zero so
       I believe that the above line will turn off hystart as soon as the 2^31
       bit is set in tcp_time_stamp & hystart will stay off for 24 days.
       I think we've observed that cubic's restart is too aggressive without
       hystart so this might account for the higher drop rate we observe.
      Diagnosed-by: default avatarVan Jacobson <vanj@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      20c1363d
    • Eric Dumazet's avatar
      tcp: cubic: fix overflow error in bictcp_update() · 27d098d3
      Eric Dumazet authored
      [ Upstream commit 2ed0edf9090bf4afa2c6fc4f38575a85a80d4b20 ]
      
      commit 17a6e9f1
      
       ("tcp_cubic: fix clock dependency") added an
      overflow error in bictcp_update() in following code :
      
      /* change the unit from HZ to bictcp_HZ */
      t = ((tcp_time_stamp + msecs_to_jiffies(ca->delay_min>>3) -
            ca->epoch_start) << BICTCP_HZ) / HZ;
      
      Because msecs_to_jiffies() being unsigned long, compiler does
      implicit type promotion.
      
      We really want to constrain (tcp_time_stamp - ca->epoch_start)
      to a signed 32bit value, or else 't' has unexpected high values.
      
      This bugs triggers an increase of retransmit rates ~24 days after
      boot [1], as the high order bit of tcp_time_stamp flips.
      
      [1] for hosts with HZ=1000
      
      Big thanks to Van Jacobson for spotting this problem.
      Diagnosed-by: default avatarVan Jacobson <vanj@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      27d098d3
    • Lars-Peter Clausen's avatar
      regmap: cache Fix regcache-rbtree sync · c68012cb
      Lars-Peter Clausen authored
      commit 8abac3ba51b5525354e9b2ec0eed1c9e95c905d9 upstream.
      
      The last register block, which falls into the specified range, is not handled
      correctly. The formula which calculates the number of register which should be
      synced is inverse (and off by one). E.g. if all registers in that block should
      be synced only one is synced, and if only one should be synced all (but one) are
      synced. To calculate the number of registers that need to be synced we need to
      subtract the number of the first register in the block from the max register
      number and add one. This patch updates the code accordingly.
      
      The issue was introduced in commit ac8d91c8
      
       ("regmap: Supply ranges to the sync
      operations").
      Signed-off-by: default avatarLars-Peter Clausen <lars@metafoo.de>
      Signed-off-by: default avatarMark Brown <broonie@opensource.wolfsonmicro.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c68012cb
    • Balakumaran Kannan's avatar
      net IPv6 : Fix broken IPv6 routing table after loopback down-up · 85878d73
      Balakumaran Kannan authored
      
      [ Upstream commit 25fb6ca4ed9cad72f14f61629b68dc03c0d9713f ]
      
      IPv6 Routing table becomes broken once we do ifdown, ifup of the loopback(lo)
      interface. After down-up, routes of other interface's IPv6 addresses through
      'lo' are lost.
      
      IPv6 addresses assigned to all interfaces are routed through 'lo' for internal
      communication. Once 'lo' is down, those routing entries are removed from routing
      table. But those removed entries are not being re-created properly when 'lo' is
      brought up. So IPv6 addresses of other interfaces becomes unreachable from the
      same machine. Also this breaks communication with other machines because of
      NDISC packet processing failure.
      
      This patch fixes this issue by reading all interface's IPv6 addresses and adding
      them to IPv6 routing table while bringing up 'lo'.
      
      ==Testing==
      Before applying the patch:
      $ route -A inet6
      Kernel IPv6 routing table
      Destination                    Next Hop                   Flag Met Ref Use If
      2000::20/128                   ::                         U    256 0     0 eth0
      fe80::/64                      ::                         U    256 0     0 eth0
      ::/0                           ::                         !n   -1  1     1 lo
      ::1/128                        ::                         Un   0   1     0 lo
      2000::20/128                   ::                         Un   0   1     0 lo
      fe80::xxxx:xxxx:xxxx:xxxx/128  ::                         Un   0   1     0 lo
      ff00::/8                       ::                         U    256 0     0 eth0
      ::/0                           ::                         !n   -1  1     1 lo
      $ sudo ifdown lo
      $ sudo ifup lo
      $ route -A inet6
      Kernel IPv6 routing table
      Destination                    Next Hop                   Flag Met Ref Use If
      2000::20/128                   ::                         U    256 0     0 eth0
      fe80::/64                      ::                         U    256 0     0 eth0
      ::/0                           ::                         !n   -1  1     1 lo
      ::1/128                        ::                         Un   0   1     0 lo
      ff00::/8                       ::                         U    256 0     0 eth0
      ::/0                           ::                         !n   -1  1     1 lo
      $
      
      After applying the patch:
      $ route -A inet6
      Kernel IPv6 routing
      table
      Destination                    Next Hop                   Flag Met Ref Use If
      2000::20/128                   ::                         U    256 0     0 eth0
      fe80::/64                      ::                         U    256 0     0 eth0
      ::/0                           ::                         !n   -1  1     1 lo
      ::1/128                        ::                         Un   0   1     0 lo
      2000::20/128                   ::                         Un   0   1     0 lo
      fe80::xxxx:xxxx:xxxx:xxxx/128  ::                         Un   0   1     0 lo
      ff00::/8                       ::                         U    256 0     0 eth0
      ::/0                           ::                         !n   -1  1     1 lo
      $ sudo ifdown lo
      $ sudo ifup lo
      $ route -A inet6
      Kernel IPv6 routing table
      Destination                    Next Hop                   Flag Met Ref Use If
      2000::20/128                   ::                         U    256 0     0 eth0
      fe80::/64                      ::                         U    256 0     0 eth0
      ::/0                           ::                         !n   -1  1     1 lo
      ::1/128                        ::                         Un   0   1     0 lo
      2000::20/128                   ::                         Un   0   1     0 lo
      fe80::xxxx:xxxx:xxxx:xxxx/128  ::                         Un   0   1     0 lo
      ff00::/8                       ::                         U    256 0     0 eth0
      ::/0                           ::                         !n   -1  1     1 lo
      $
      Signed-off-by: default avatarBalakumaran Kannan <Balakumaran.Kannan@ap.sony.com>
      Signed-off-by: default avatarMaruthi Thotad <Maruthi.Thotad@ap.sony.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      85878d73
    • Jiri Slaby's avatar
      TTY: fix atime/mtime regression · 683197c7
      Jiri Slaby authored
      
      commit 37b7f3c76595e23257f61bd80b223de8658617ee upstream.
      
      In commit b0de59b5733d ("TTY: do not update atime/mtime on read/write")
      we removed timestamps from tty inodes to fix a security issue and waited
      if something breaks.  Well, 'w', the utility to find out logged users
      and their inactivity time broke.  It shows that users are inactive since
      the time they logged in.
      
      To revert to the old behaviour while still preventing attackers to
      guess the password length, we update the timestamps in one-minute
      intervals by this patch.
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      683197c7
    • Jiri Slaby's avatar
      TTY: do not update atime/mtime on read/write · dfdc56a3
      Jiri Slaby authored
      commit b0de59b5733d18b0d1974a060860a8b5c1b36a2e upstream.
      
      On http://vladz.devzero.fr/013_ptmx-timing.php
      
      , we can see how to find
      out length of a password using timestamps of /dev/ptmx. It is
      documented in "Timing Analysis of Keystrokes and Timing Attacks on
      SSH". To avoid that problem, do not update time when reading
      from/writing to a TTY.
      
      I am afraid of regressions as this is a behavior we have since 0.97
      and apps may expect the time to be current, e.g. for monitoring
      whether there was a change on the TTY. Now, there is no change. So
      this would better have a lot of testing before it goes upstream.
      
      References: CVE-2013-0160
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      dfdc56a3
    • Russell King's avatar
      ARM: Do 15e0d9e3 (ARM: pm: let platforms select cpu_suspend support) properly · 9bd8ce11
      Russell King authored
      
      commit b6c7aabd923a17af993c5a5d5d7995f0b27c000a upstream.
      
      Let's do the changes properly and fix the same problem everywhere, not
      just for one case.
      Signed-off-by: default avatarRussell King <rmk+kernel@arm.linux.org.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9bd8ce11
    • Lukas Czerner's avatar
      ext4: convert number of blocks to clusters properly · 5f05c0a0
      Lukas Czerner authored
      
      commit 810da240f221d64bf90020f25941b05b378186fe upstream.
      
      We're using macro EXT4_B2C() to convert number of blocks to number of
      clusters for bigalloc file systems.  However, we should be using
      EXT4_NUM_B2C().
      Signed-off-by: default avatarLukas Czerner <lczerner@redhat.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: default avatarCAI Qian <caiqian@redhat.com>
      Signed-off-by: default avatarLingzhu Xiang <lxiang@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5f05c0a0
    • Riley Andrews's avatar
      android: drivers: Fix build broken by android debugfs fix. · abdc7b3d
      Riley Andrews authored
      Fix build broken by commit b0bd81a67ae5ced88b ("android: drivers: workaround
      debugfs race in binder").
      
      Change-Id: I10c5c0211144a4a1c270dc03f92cf6a1a829e8f8
      abdc7b3d
    • Russell King's avatar
      ARM: footbridge: fix VGA initialisation · 94ed6132
      Russell King authored
      commit 43659222e7a0113912ed02f6b2231550b3e471ac upstream.
      
      It's no good setting vga_base after the VGA console has been
      initialised, because if we do that we get this:
      
      Unable to handle kernel paging request at virtual address 000b8000
      pgd = c0004000
      [000b8000] *pgd=07ffc831, *pte=00000000, *ppte=00000000
      0Internal error: Oops: 5017 [#1] ARM
      Modules linked in:
      CPU: 0 PID: 0 Comm: swapper Not tainted 3.12.0+ #49
      task: c03e2974 ti: c03d8000 task.ti: c03d8000
      PC is at vgacon_startup+0x258/0x39c
      LR is at request_resource+0x10/0x1c
      pc : [<c01725d0>]    lr : [<c0022b50>]    psr: 60000053
      sp : c03d9f68  ip : 000b8000  fp : c03d9f8c
      r10: 000055aa  r9 : 4401a103  r8 : ffffaa55
      r7 : c03e357c  r6 : c051b460  r5 : 000000ff  r4 : 000c0000
      r3 : 000b8000  r2 : c03e0514  r1 : 00000000  r0 : c0304971
      Flags: nZCv  IRQs on  FIQs off  Mode SVC_32  ISA ARM  Segment kernel
      
      which is an access to the 0xb8000 without the PCI offset required to
      make it work.
      
      Fixes: cc22b4c1
      
       ("ARM: set vga memory base at run-time")
      Signed-off-by: default avatarRussell King <rmk+kernel@arm.linux.org.uk>
      [bwh: Backported to 3.2: adjust context]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      Cc: Yang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      94ed6132
    • Arnd Bergmann's avatar
      ARM: 7742/1: topology: export cpu_topology · 71ada24d
      Arnd Bergmann authored
      
      commit 92bdd3f5eba299b33c2f4407977d6fa2e2a6a0da upstream.
      
      The cpu_topology symbol is required by any driver using the topology
      interfaces, which leads to a couple of build errors:
      
      ERROR: "cpu_topology" [drivers/net/ethernet/sfc/sfc.ko] undefined!
      ERROR: "cpu_topology" [drivers/cpufreq/arm_big_little.ko] undefined!
      ERROR: "cpu_topology" [drivers/block/mtip32xx/mtip32xx.ko] undefined!
      
      The obvious solution is to export this symbol.
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Acked-by: default avatarWill Deacon <will.deacon@arm.com>
      Cc: Nicolas Pitre <nico@linaro.org>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: default avatarRussell King <rmk+kernel@arm.linux.org.uk>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      Cc: Yang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      71ada24d
    • Russell King's avatar
      ARM: fix "bad mode in ... handler" message for undefined instructions · b2a6a5a9
      Russell King authored
      
      commit 29c350bf28da333e41e30497b649fe335712a2ab upstream.
      
      The array was missing the final entry for the undefined instruction
      exception handler; this commit adds it.
      Signed-off-by: default avatarRussell King <rmk+kernel@arm.linux.org.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b2a6a5a9
    • Sujit Reddy Thumma's avatar
      mmc: core: Fix clock frequency transitions during invalid states · d68e6610
      Sujit Reddy Thumma authored
      
      eMMC and SD card specifications restrict the usage of a class of
      commands while commands in other class are in progress. For example,
      during erase operations the SD/eMMC spec. allows only CMD35, CMD36,
      CMD38. If clock scaling is enabled and decide to scale up the clocks
      it may be possible that CMD19/21 tuning commands are sent in between
      erase commands, which is illegal as per specification.
      
      Fix such illegal transactions to the card and also make clock scaling
      statistics accountable only for read/write commands instead of time
      consuming commands, like CMD38 erase, where transactions are independent
      of bus frequency.
      
      Change-Id: Iffba175787837e7f95bde8970f19d0f0f9d7d67d
      Signed-off-by: default avatarSujit Reddy Thumma <sthumma@codeaurora.org>
      d68e6610
    • Greg Thelen's avatar
      tmpfs: fix use-after-free of mempolicy object · c87fe6ad
      Greg Thelen authored
      
      commit 5f00110f7273f9ff04ac69a5f85bb535a4fd0987 upstream.
      
      The tmpfs remount logic preserves filesystem mempolicy if the mpol=M
      option is not specified in the remount request.  A new policy can be
      specified if mpol=M is given.
      
      Before this patch remounting an mpol bound tmpfs without specifying
      mpol= mount option in the remount request would set the filesystem's
      mempolicy object to a freed mempolicy object.
      
      To reproduce the problem boot a DEBUG_PAGEALLOC kernel and run:
          # mkdir /tmp/x
      
          # mount -t tmpfs -o size=100M,mpol=interleave nodev /tmp/x
      
          # grep /tmp/x /proc/mounts
          nodev /tmp/x tmpfs rw,relatime,size=102400k,mpol=interleave:0-3 0 0
      
          # mount -o remount,size=200M nodev /tmp/x
      
          # grep /tmp/x /proc/mounts
          nodev /tmp/x tmpfs rw,relatime,size=204800k,mpol=??? 0 0
              # note ? garbage in mpol=... output above
      
          # dd if=/dev/zero of=/tmp/x/f count=1
              # panic here
      
      Panic:
          BUG: unable to handle kernel NULL pointer dereference at           (null)
          IP: [<          (null)>]           (null)
          [...]
          Oops: 0010 [#1] SMP DEBUG_PAGEALLOC
          Call Trace:
            mpol_shared_policy_init+0xa5/0x160
            shmem_get_inode+0x209/0x270
            shmem_mknod+0x3e/0xf0
            shmem_create+0x18/0x20
            vfs_create+0xb5/0x130
            do_last+0x9a1/0xea0
            path_openat+0xb3/0x4d0
            do_filp_open+0x42/0xa0
            do_sys_open+0xfe/0x1e0
            compat_sys_open+0x1b/0x20
            cstar_dispatch+0x7/0x1f
      
      Non-debug kernels will not crash immediately because referencing the
      dangling mpol will not cause a fault.  Instead the filesystem will
      reference a freed mempolicy object, which will cause unpredictable
      behavior.
      
      The problem boils down to a dropped mpol reference below if
      shmem_parse_options() does not allocate a new mpol:
      
          config = *sbinfo
          shmem_parse_options(data, &config, true)
          mpol_put(sbinfo->mpol)
          sbinfo->mpol = config.mpol  /* BUG: saves unreferenced mpol */
      
      This patch avoids the crash by not releasing the mempolicy if
      shmem_parse_options() doesn't create a new mpol.
      
      How far back does this issue go? I see it in both 2.6.36 and 3.3.  I did
      not look back further.
      Signed-off-by: default avatarGreg Thelen <gthelen@google.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c87fe6ad
    • Linus Torvalds's avatar
      mm: fix pageblock bitmap allocation · da587821
      Linus Torvalds authored
      
      commit 7c45512df987c5619db041b5c9b80d281e26d3db upstream.
      
      Commit c060f943d092 ("mm: use aligned zone start for pfn_to_bitidx
      calculation") fixed out calculation of the index into the pageblock
      bitmap when a !SPARSEMEM zome was not aligned to pageblock_nr_pages.
      
      However, the _allocation_ of that bitmap had never taken this alignment
      requirement into accout, so depending on the exact size and alignment of
      the zone, the use of that index could then access past the allocation,
      resulting in some very subtle memory corruption.
      
      This was reported (and bisected) by Ingo Molnar: one of his random
      config builds would hang with certain very specific kernel command line
      options.
      
      In the meantime, commit c060f943d092 has been marked for stable, so this
      fix needs to be back-ported to the stable kernels that backported the
      commit to use the right alignment.
      Bisected-and-tested-by: default avatarIngo Molnar <mingo@kernel.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      da587821
    • Mel Gorman's avatar
      tmpfs: fix shared mempolicy leak · ffa815dc
      Mel Gorman authored
      
      commit 18a2f371f5edf41810f6469cb9be39931ef9deb9 upstream.
      
      This fixes a regression in 3.7-rc, which has since gone into stable.
      
      Commit 00442ad04a5e ("mempolicy: fix a memory corruption by refcount
      imbalance in alloc_pages_vma()") changed get_vma_policy() to raise the
      refcount on a shmem shared mempolicy; whereas shmem_alloc_page() went
      on expecting alloc_page_vma() to drop the refcount it had acquired.
      This deserves a rework: but for now fix the leak in shmem_alloc_page().
      
      Hugh: shmem_swapin() did not need a fix, but surely it's clearer to use
      the same refcounting there as in shmem_alloc_page(), delete its onstack
      mempolicy, and the strange mpol_cond_copy() and __mpol_cond_copy() -
      those were invented to let swapin_readahead() make an unknown number of
      calls to alloc_pages_vma() with one mempolicy; but since 00442ad04a5e,
      alloc_pages_vma() has kept refcount in balance, so now no problem.
      Reported-and-tested-by: default avatarTommi Rantala <tt.rantala@gmail.com>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ffa815dc
    • Hugh Dickins's avatar
      tmpfs mempolicy: fix /proc/mounts corrupting memory · c4a3839f
      Hugh Dickins authored
      commit f2a07f40dbc603c15f8b06e6ec7f768af67b424f upstream.
      
      Recently I suggested using "mount -o remount,mpol=local /tmp" in NUMA
      mempolicy testing.  Very nasty.  Reading /proc/mounts, /proc/pid/mounts
      or /proc/pid/mountinfo may then corrupt one bit of kernel memory, often
      in a page table (causing "Bad swap" or "Bad page map" warning or "Bad
      pagetable" oops), sometimes in a vm_area_struct or rbnode or somewhere
      worse.  "mpol=prefer" and "mpol=prefer:Node" are equally toxic.
      
      Recent NUMA enhancements are not to blame: this dates back to 2.6.35,
      when commit e17f74af
      
       "mempolicy: don't call mpol_set_nodemask() when
      no_context" skipped mpol_parse_str()'s call to mpol_set_nodemask(),
      which used to initialize v.preferred_node, or set MPOL_F_LOCAL in flags.
      With slab poisoning, you can then rely on mpol_to_str() to set the bit
      for node 0x6b6b, probably in the next page above the caller's stack.
      
      mpol_parse_str() is only called from shmem_parse_options(): no_context
      is always true, so call it unused for now, and remove !no_context code.
      Set v.nodes or v.preferred_node or MPOL_F_LOCAL as mpol_to_str() might
      expect.  Then mpol_to_str() can ignore its no_context argument also,
      the mpol being appropriately initialized whether contextualized or not.
      Rename its no_context unused too, and let subsequent patch remove them
      (that's not needed for stable backporting, which would involve rejects).
      
      I don't understand why MPOL_LOCAL is described as a pseudo-policy:
      it's a reasonable policy which suffers from a confusing implementation
      in terms of MPOL_PREFERRED with MPOL_F_LOCAL.  I believe this would be
      much more robust if MPOL_LOCAL were recognized in switch statements
      throughout, MPOL_F_LOCAL deleted, and MPOL_PREFERRED use the (possibly
      empty) nodes mask like everyone else, instead of its preferred_node
      variant (I presume an optimization from the days before MPOL_LOCAL).
      But that would take me too long to get right and fully tested.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c4a3839f
    • Michal Hocko's avatar
      memcg: oom: fix totalpages calculation for memory.swappiness==0 · d632f7e2
      Michal Hocko authored
      
      commit 9a5a8f19b43430752067ecaee62fc59e11e88fa6 upstream.
      
      oom_badness() takes a totalpages argument which says how many pages are
      available and it uses it as a base for the score calculation.  The value
      is calculated by mem_cgroup_get_limit which considers both limit and
      total_swap_pages (resp.  memsw portion of it).
      
      This is usually correct but since fe35004fbf9e ("mm: avoid swapping out
      with swappiness==0") we do not swap when swappiness is 0 which means
      that we cannot really use up all the totalpages pages.  This in turn
      confuses oom score calculation if the memcg limit is much smaller than
      the available swap because the used memory (capped by the limit) is
      negligible comparing to totalpages so the resulting score is too small
      if adj!=0 (typically task with CAP_SYS_ADMIN or non zero oom_score_adj).
      A wrong process might be selected as result.
      
      The problem can be worked around by checking mem_cgroup_swappiness==0
      and not considering swap at all in such a case.
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d632f7e2
    • Mel Gorman's avatar
      mempolicy: fix a race in shared_policy_replace() · b0dc0842
      Mel Gorman authored
      
      commit b22d127a39ddd10d93deee3d96e643657ad53a49 upstream.
      
      shared_policy_replace() use of sp_alloc() is unsafe.  1) sp_node cannot
      be dereferenced if sp->lock is not held and 2) another thread can modify
      sp_node between spin_unlock for allocating a new sp node and next
      spin_lock.  The bug was introduced before 2.6.12-rc2.
      
      Kosaki's original patch for this problem was to allocate an sp node and
      policy within shared_policy_replace and initialise it when the lock is
      reacquired.  I was not keen on this approach because it partially
      duplicates sp_alloc().  As the paths were sp->lock is taken are not that
      performance critical this patch converts sp->lock to sp->mutex so it can
      sleep when calling sp_alloc().
      
      [kosaki.motohiro@jp.fujitsu.com: Original patch]
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Josh Boyer <jwboyer@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b0dc0842
    • Laura Abbott's avatar
      mm: Add notifier framework for showing memory · 42844307
      Laura Abbott authored
      
      There are many drivers in the kernel which can hold on
      to lots of memory. It can be useful to dump out all those
      drivers at key points in the kernel. Introduct a notifier
      framework for dumping this information. When the notifiers
      are called, drivers can dump out the state of any memory
      they may be using.
      
      Change-Id: I514ef1d01510a50970a661c8e9bedc8b78683eab
      Signed-off-by: default avatarLaura Abbott <lauraa@codeaurora.org>
      42844307
    • Vinayak Menon's avatar
      mm: vmpressure: allow in-kernel clients to subscribe for events · 2570b7de
      Vinayak Menon authored
      
      Currently, vmpressure is tied to memcg and its events are
      available only to userspace clients. This patch removes
      the dependency on CONFIG_MEMCG and adds a mechanism for
      in-kernel clients to subscribe for vmpressure events (in
      fact raw vmpressure values are delivered instead of vmpressure
      levels, to provide clients more flexibility to take actions
      on custom pressure levels which are not currently defined
      by vmpressure module).
      
      Change-Id: I1500c098cde11010e463d67955e8a03feb193a67
      Signed-off-by: default avatarVinayak Menon <vinmenon@codeaurora.org>
      2570b7de
    • Liam Mark's avatar
      mm, oom: make dump_tasks public · 1e9e190a
      Liam Mark authored
      
      Allow other functions to dump the list of tasks.
      Useful for when debugging memory leaks.
      
      Bug: 17871993
      Change-Id: I0d9e812d242cbd9e152d561be9a16c00bad3c032
      Signed-off-by: default avatarLiam Mark <lmark@codeaurora.org>
      Signed-off-by: default avatarNaveen Ramaraj <nramaraj@codeaurora.org>
      1e9e190a
    • Anton Vorontsov's avatar
      memcg: add memory.pressure_level events · d12c78e5
      Anton Vorontsov authored
      With this patch userland applications that want to maintain the
      interactivity/memory allocation cost can use the pressure level
      notifications.  The levels are defined like this:
      
      The "low" level means that the system is reclaiming memory for new
      allocations.  Monitoring this reclaiming activity might be useful for
      maintaining cache level.  Upon notification, the program (typically
      "Activity Manager") might analyze vmstat and act in advance (i.e.
      prematurely shutdown unimportant services).
      
      The "medium" level means that the system is experiencing medium memory
      pressure, the system might be making swap, paging out active file
      caches, etc.  Upon this event applications may decide to further analyze
      vmstat/zoneinfo/memcg or internal memory usage statistics and free any
      resources that can be easily reconstructed or re-read from a disk.
      
      The "critical" level means that the system is actively thrashing, it is
      about to out of memory (OOM) or even the in-kernel OOM killer is on its
      way to trigger.  Applications should do whatever they can to help the
      system.  It might be too late to consult with vmstat or any other
      statistics, so it's advisable to take an immediate action.
      
      The events are propagated upward until the event is handled, i.e.  the
      events are not pass-through.  Here is what this means: for example you
      have three cgroups: A->B->C.  Now you set up an event listener on
      cgroups A, B and C, and suppose group C experiences some pressure.  In
      this situation, only group C will receive the notification, i.e.  groups
      A and B will not receive it.  This is done to avoid excessive
      "broadcasting" of messages, which disturbs the system and which is
      especially bad if we are low on memory or thrashing.  So, organize the
      cgroups wisely, or propagate the events manually (or, ask us to
      implement the pass-through events, explaining why would you need them.)
      
      Performance wise, the memory pressure notifications feature itself is
      lightweight and does not require much of bookkeeping, in contrast to the
      rest of memcg features.  Unfortunately, as of current memcg
      implementation, pages accounting is an inseparable part and cannot be
      turned off.  The good news is that there are some efforts[1] to improve
      the situation; plus, implementing the same, fully API-compatible[2]
      interface for CONFIG_CGROUP_MEM_RES_CTLR=n case (e.g.  embedded) is also
      a viable option, so it will not require any changes on the userland
      side.
      
      [1] http://permalink.gmane.org/gmane.linux.kernel.cgroups/6291
      [2] http://lkml.org/lkml/2013/2/21/454
      
      
      
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: fix CONFIG_CGROPUPS=n warnings]
      Signed-off-by: default avatarAnton Vorontsov <anton.vorontsov@linaro.org>
      Acked-by: default avatarKirill A. Shutemov <kirill@shutemov.name>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Leonid Moiseichuk <leonid.moiseichuk@nokia.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      
      Change-Id: I4e703d3688c74466e02cf0f2b866e85043fe799d
      d12c78e5
    • Michael Wang's avatar
      slab: fix the DEADLOCK issue on l3 alien lock · 788a9073
      Michael Wang authored
      commit 947ca1856a7e60aa6d20536785e6a42dff25aa6e upstream.
      
      DEADLOCK will be report while running a kernel with NUMA and LOCKDEP enabled,
      the process of this fake report is:
      
      	   kmem_cache_free()	//free obj in cachep
      	-> cache_free_alien()	//acquire cachep's l3 alien lock
      	-> __drain_alien_cache()
      	-> free_block()
      	-> slab_destroy()
      	-> kmem_cache_free()	//free slab in cachep->slabp_cache
      	-> cache_free_alien()	//acquire cachep->slabp_cache's l3 alien lock
      
      Since the cachep and cachep->slabp_cache's l3 alien are in the same lock class,
      fake report generated.
      
      This should not happen since we already have init_lock_keys() which will
      reassign the lock class for both l3 list and l3 alien.
      
      However, init_lock_keys() was invoked at a wrong position which is before we
      invoke enable_cpucache() on each cache.
      
      Since until set slab_state to be FULL, we won't invoke enable_cpucache()
      on caches to build their l3 alien while creating them, so although we invoked
      init_lock_keys(), the l3 alien lock class won't change since we don't have
      them until invoked enable_cpucache() later.
      
      This patch will invoke init_lock_keys() after we done enable_cpucache()
      instead of before to avoid the fake DEADLOCK report.
      
      Michael traced the problem back to a commit in release 3.0.0:
      
      commit 30765b92
      
      
      Author: Peter Zijlstra <peterz@infradead.org>
      Date:   Thu Jul 28 23:22:56 2011 +0200
      
          slab, lockdep: Annotate the locks before using them
      
          Fernando found we hit the regular OFF_SLAB 'recursion' before we
          annotate the locks, cure this.
      
          The relevant portion of the stack-trace:
      
          > [    0.000000]  [<c085e24f>] rt_spin_lock+0x50/0x56
          > [    0.000000]  [<c04fb406>] __cache_free+0x43/0xc3
          > [    0.000000]  [<c04fb23f>] kmem_cache_free+0x6c/0xdc
          > [    0.000000]  [<c04fb2fe>] slab_destroy+0x4f/0x53
          > [    0.000000]  [<c04fb396>] free_block+0x94/0xc1
          > [    0.000000]  [<c04fc551>] do_tune_cpucache+0x10b/0x2bb
          > [    0.000000]  [<c04fc8dc>] enable_cpucache+0x7b/0xa7
          > [    0.000000]  [<c0bd9d3c>] kmem_cache_init_late+0x1f/0x61
          > [    0.000000]  [<c0bba687>] start_kernel+0x24c/0x363
          > [    0.000000]  [<c0bba0ba>] i386_start_kernel+0xa9/0xaf
      Reported-by: default avatarFernando Lopez-Lezcano <nando@ccrma.Stanford.EDU>
      Acked-by: default avatarPekka Enberg <penberg@kernel.org>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
          Link: http://lkml.kernel.org/r/1311888176.2617.379.camel@laptop
      
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      
      The commit moved init_lock_keys() before we build up the alien, so we
      failed to reclass it.
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Tested-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Wang <wangyun@linux.vnet.ibm.com>
      Signed-off-by: default avatarPekka Enberg <penberg@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      788a9073
    • Mahesh Sivasubramanian's avatar
      arm: arch_timer: set memory mapped timer interrupt as IRQF_TIMER · 8808c5cc
      Mahesh Sivasubramanian authored
      
      The memory mapped timer is used as a broadcast timer to wake the core for
      timer interrupts when the arch timer might not be functional. When interrupt
      is not marked as IRQF_NO_SUSPEND, the interrupt gets disabled during the
      suspend_device_irqs() callback in the suspend path. If a core were to enter a
      idle low power mode which relies on broadcast timer to process the interrupt,
      the core is never woken up for timer interrupts.
      
      Mark the interrupt with IRQF_TIMER which marks this interrupt as a timer
      interrupt and also marks it as IRQF_NO_SUSPEND
      
      CRs-fixed: 636712
      Change-Id: I0484e92a9d05f66a0c5b3c00c584a3dd3fe6ae85
      Signed-off-by: default avatarMahesh Sivasubramanian <msivasub@codeaurora.org>
      8808c5cc
    • Syed Rameez Mustafa's avatar
      ARM: Allow panic on division by zero in the kernel · aaa03166
      Syed Rameez Mustafa authored
      
      Division by zero errors in the kernel currently trigger warnings.
      Allow panic on these errors so that we can catch the problem closer
      to its source.
      
      Change-Id: Id5fed71b74cd37874ae857a8105455d7561c782d
      Signed-off-by: default avatarSyed Rameez Mustafa <rameezmustafa@codeaurora.org>
      aaa03166
    • Jiang Liu's avatar
      memory hotplug: fix invalid memory access caused by stale kswapd pointer · e7554260
      Jiang Liu authored
      
      commit d8adde17e5f858427504725218c56aef90e90fc7 upstream.
      
      kswapd_stop() is called to destroy the kswapd work thread when all memory
      of a NUMA node has been offlined.  But kswapd_stop() only terminates the
      work thread without resetting NODE_DATA(nid)->kswapd to NULL.  The stale
      pointer will prevent kswapd_run() from creating a new work thread when
      adding memory to the memory-less NUMA node again.  Eventually the stale
      pointer may cause invalid memory access.
      
      An example stack dump as below. It's reproduced with 2.6.32, but latest
      kernel has the same issue.
      
        BUG: unable to handle kernel NULL pointer dereference at (null)
        IP: [<ffffffff81051a94>] exit_creds+0x12/0x78
        PGD 0
        Oops: 0000 [#1] SMP
        last sysfs file: /sys/devices/system/memory/memory391/state
        CPU 11
        Modules linked in: cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq microcode fuse loop dm_mod tpm_tis rtc_cmos i2c_i801 rtc_core tpm serio_raw pcspkr sg tpm_bios igb i2c_core iTCO_wdt rtc_lib mptctl iTCO_vendor_support button dca bnx2 usbhid hid uhci_hcd ehci_hcd usbcore sd_mod crc_t10dif edd ext3 mbcache jbd fan ide_pci_generic ide_core ata_generic ata_piix libata thermal processor thermal_sys hwmon mptsas mptscsih mptbase scsi_transport_sas scsi_mod
        Pid: 7949, comm: sh Not tainted 2.6.32.12-qiuxishi-5-default #92 Tecal RH2285
        RIP: 0010:exit_creds+0x12/0x78
        RSP: 0018:ffff8806044f1d78  EFLAGS: 00010202
        RAX: 0000000000000000 RBX: ffff880604f22140 RCX: 0000000000019502
        RDX: 0000000000000000 RSI: 0000000000000202 RDI: 0000000000000000
        RBP: ffff880604f22150 R08: 0000000000000000 R09: ffffffff81a4dc10
        R10: 00000000000032a0 R11: ffff880006202500 R12: 0000000000000000
        R13: 0000000000c40000 R14: 0000000000008000 R15: 0000000000000001
        FS:  00007fbc03d066f0(0000) GS:ffff8800282e0000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
        CR2: 0000000000000000 CR3: 000000060f029000 CR4: 00000000000006e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
        Process sh (pid: 7949, threadinfo ffff8806044f0000, task ffff880603d7c600)
        Stack:
         ffff880604f22140 ffffffff8103aac5 ffff880604f22140 ffffffff8104d21e
         ffff880006202500 0000000000008000 0000000000c38000 ffffffff810bd5b1
         0000000000000000 ffff880603d7c600 00000000ffffdd29 0000000000000003
        Call Trace:
          __put_task_struct+0x5d/0x97
          kthread_stop+0x50/0x58
          offline_pages+0x324/0x3da
          memory_block_change_state+0x179/0x1db
          store_mem_state+0x9e/0xbb
          sysfs_write_file+0xd0/0x107
          vfs_write+0xad/0x169
          sys_write+0x45/0x6e
          system_call_fastpath+0x16/0x1b
        Code: ff 4d 00 0f 94 c0 84 c0 74 08 48 89 ef e8 1f fd ff ff 5b 5d 31 c0 41 5c c3 53 48 8b 87 20 06 00 00 48 89 fb 48 8b bf 18 06 00 00 <8b> 00 48 c7 83 18 06 00 00 00 00 00 00 f0 ff 0f 0f 94 c0 84 c0
        RIP  exit_creds+0x12/0x78
         RSP <ffff8806044f1d78>
        CR2: 0000000000000000
      
      [akpm@linux-foundation.org: add pglist_data.kswapd locking comments]
      Signed-off-by: default avatarXishi Qiu <qiuxishi@huawei.com>
      Signed-off-by: default avatarJiang Liu <jiang.liu@huawei.com>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Reviewed-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e7554260
    • Gavin Shan's avatar
      mm/memblock: fix memory leak on extending regions · defc6575
      Gavin Shan authored
      
      commit 181eb39425f2b9275afcb015eaa547d11f71a02f upstream.
      
      The overall memblock has been organized into the memory regions and
      reserved regions.  Initially, the memory regions and reserved regions are
      stored in the predetermined arrays of "struct memblock _region".  It's
      possible for the arrays to be enlarged when we have newly added regions,
      but no free space left there.  The policy here is to create double-sized
      array either by slab allocator or memblock allocator.  Unfortunately, we
      didn't free the old array, which might be allocated through slab allocator
      before.  That would cause memory leak.
      
      The patch introduces 2 variables to trace where (slab or memblock) the
      memory and reserved regions come from.  The memory for the memory or
      reserved regions will be deallocated by kfree() if that was allocated by
      slab allocator.  Thus to fix the memory leak issue.
      Signed-off-by: default avatarGavin Shan <shangw@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      defc6575
    • Greg Pearson's avatar
      mm/memblock: fix overlapping allocation when doubling reserved array · ef674d27
      Greg Pearson authored
      
      commit 48c3b583bbddad2220ca4c22319ca5d1f78b2090 upstream.
      
      __alloc_memory_core_early() asks memblock for a range of memory then try
      to reserve it.  If the reserved region array lacks space for the new
      range, memblock_double_array() is called to allocate more space for the
      array.  If memblock is used to allocate memory for the new array it can
      end up using a range that overlaps with the range originally allocated in
      __alloc_memory_core_early(), leading to possible data corruption.
      
      With this patch memblock_double_array() now calls memblock_find_in_range()
      with a narrowed candidate range (in cases where the reserved.regions array
      is being doubled) so any memory allocated will not overlap with the
      original range that was being reserved.  The range is narrowed by passing
      in the starting address and size of the previously allocated range.  Then
      the range above the ending address is searched and if a candidate is not
      found, the range below the starting address is searched.
      Signed-off-by: default avatarGreg Pearson <greg.pearson@hp.com>
      Signed-off-by: default avatarYinghai Lu <yinghai@kernel.org>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ef674d27
    • Mel Gorman's avatar
      mm: hugetlbfs: close race during teardown of hugetlbfs shared page tables · 34967266
      Mel Gorman authored
      commit d833352a4338dc31295ed832a30c9ccff5c7a183 upstream.
      
      If a process creates a large hugetlbfs mapping that is eligible for page
      table sharing and forks heavily with children some of whom fault and
      others which destroy the mapping then it is possible for page tables to
      get corrupted.  Some teardowns of the mapping encounter a "bad pmd" and
      output a message to the kernel log.  The final teardown will trigger a
      BUG_ON in mm/filemap.c.
      
      This was reproduced in 3.4 but is known to have existed for a long time
      and goes back at least as far as 2.6.37.  It was probably was introduced
      in 2.6.20 by [39dde65c
      
      : shared page table for hugetlb page].  The messages
      look like this;
      
      [  ..........] Lots of bad pmd messages followed by this
      [  127.164256] mm/memory.c:391: bad pmd ffff880412e04fe8(80000003de4000e7).
      [  127.164257] mm/memory.c:391: bad pmd ffff880412e04ff0(80000003de6000e7).
      [  127.164258] mm/memory.c:391: bad pmd ffff880412e04ff8(80000003de0000e7).
      [  127.186778] ------------[ cut here ]------------
      [  127.186781] kernel BUG at mm/filemap.c:134!
      [  127.186782] invalid opcode: 0000 [#1] SMP
      [  127.186783] CPU 7
      [  127.186784] Modules linked in: af_packet cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf ext3 jbd dm_mod coretemp crc32c_intel usb_storage ghash_clmulni_intel aesni_intel i2c_i801 r8169 mii uas sr_mod cdrom sg iTCO_wdt iTCO_vendor_support shpchp serio_raw cryptd aes_x86_64 e1000e pci_hotplug dcdbas aes_generic container microcode ext4 mbcache jbd2 crc16 sd_mod crc_t10dif i915 drm_kms_helper drm i2c_algo_bit ehci_hcd ahci libahci usbcore rtc_cmos usb_common button i2c_core intel_agp video intel_gtt fan processor thermal thermal_sys hwmon ata_generic pata_atiixp libata scsi_mod
      [  127.186801]
      [  127.186802] Pid: 9017, comm: hugetlbfs-test Not tainted 3.4.0-autobuild #53 Dell Inc. OptiPlex 990/06D7TR
      [  127.186804] RIP: 0010:[<ffffffff810ed6ce>]  [<ffffffff810ed6ce>] __delete_from_page_cache+0x15e/0x160
      [  127.186809] RSP: 0000:ffff8804144b5c08  EFLAGS: 00010002
      [  127.186810] RAX: 0000000000000001 RBX: ffffea000a5c9000 RCX: 00000000ffffffc0
      [  127.186811] RDX: 0000000000000000 RSI: 0000000000000009 RDI: ffff88042dfdad00
      [  127.186812] RBP: ffff8804144b5c18 R08: 0000000000000009 R09: 0000000000000003
      [  127.186813] R10: 0000000000000000 R11: 000000000000002d R12: ffff880412ff83d8
      [  127.186814] R13: ffff880412ff83d8 R14: 0000000000000000 R15: ffff880412ff83d8
      [  127.186815] FS:  00007fe18ed2c700(0000) GS:ffff88042dce0000(0000) knlGS:0000000000000000
      [  127.186816] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [  127.186817] CR2: 00007fe340000503 CR3: 0000000417a14000 CR4: 00000000000407e0
      [  127.186818] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  127.186819] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      [  127.186820] Process hugetlbfs-test (pid: 9017, threadinfo ffff8804144b4000, task ffff880417f803c0)
      [  127.186821] Stack:
      [  127.186822]  ffffea000a5c9000 0000000000000000 ffff8804144b5c48 ffffffff810ed83b
      [  127.186824]  ffff8804144b5c48 000000000000138a 0000000000001387 ffff8804144b5c98
      [  127.186825]  ffff8804144b5d48 ffffffff811bc925 ffff8804144b5cb8 0000000000000000
      [  127.186827] Call Trace:
      [  127.186829]  [<ffffffff810ed83b>] delete_from_page_cache+0x3b/0x80
      [  127.186832]  [<ffffffff811bc925>] truncate_hugepages+0x115/0x220
      [  127.186834]  [<ffffffff811bca43>] hugetlbfs_evict_inode+0x13/0x30
      [  127.186837]  [<ffffffff811655c7>] evict+0xa7/0x1b0
      [  127.186839]  [<ffffffff811657a3>] iput_final+0xd3/0x1f0
      [  127.186840]  [<ffffffff811658f9>] iput+0x39/0x50
      [  127.186842]  [<ffffffff81162708>] d_kill+0xf8/0x130
      [  127.186843]  [<ffffffff81162812>] dput+0xd2/0x1a0
      [  127.186845]  [<ffffffff8114e2d0>] __fput+0x170/0x230
      [  127.186848]  [<ffffffff81236e0e>] ? rb_erase+0xce/0x150
      [  127.186849]  [<ffffffff8114e3ad>] fput+0x1d/0x30
      [  127.186851]  [<ffffffff81117db7>] remove_vma+0x37/0x80
      [  127.186853]  [<ffffffff81119182>] do_munmap+0x2d2/0x360
      [  127.186855]  [<ffffffff811cc639>] sys_shmdt+0xc9/0x170
      [  127.186857]  [<ffffffff81410a39>] system_call_fastpath+0x16/0x1b
      [  127.186858] Code: 0f 1f 44 00 00 48 8b 43 08 48 8b 00 48 8b 40 28 8b b0 40 03 00 00 85 f6 0f 88 df fe ff ff 48 89 df e8 e7 cb 05 00 e9 d2 fe ff ff <0f> 0b 55 83 e2 fd 48 89 e5 48 83 ec 30 48 89 5d d8 4c 89 65 e0
      [  127.186868] RIP  [<ffffffff810ed6ce>] __delete_from_page_cache+0x15e/0x160
      [  127.186870]  RSP <ffff8804144b5c08>
      [  127.186871] ---[ end trace 7cbac5d1db69f426 ]---
      
      The bug is a race and not always easy to reproduce.  To reproduce it I was
      doing the following on a single socket I7-based machine with 16G of RAM.
      
      $ hugeadm --pool-pages-max DEFAULT:13G
      $ echo $((18*1048576*1024)) > /proc/sys/kernel/shmmax
      $ echo $((18*1048576*1024)) > /proc/sys/kernel/shmall
      $ for i in `seq 1 9000`; do ./hugetlbfs-test; done
      
      On my particular machine, it usually triggers within 10 minutes but
      enabling debug options can change the timing such that it never hits.
      Once the bug is triggered, the machine is in trouble and needs to be
      rebooted.  The machine will respond but processes accessing proc like "ps
      aux" will hang due to the BUG_ON.  shutdown will also hang and needs a
      hard reset or a sysrq-b.
      
      The basic problem is a race between page table sharing and teardown.  For
      the most part page table sharing depends on i_mmap_mutex.  In some cases,
      it is also taking the mm->page_table_lock for the PTE updates but with
      shared page tables, it is the i_mmap_mutex that is more important.
      
      Unfortunately it appears to be also insufficient. Consider the following
      situation
      
      Process A					Process B
      ---------					---------
      hugetlb_fault					shmdt
        						LockWrite(mmap_sem)
          						  do_munmap
      						    unmap_region
      						      unmap_vmas
      						        unmap_single_vma
      						          unmap_hugepage_range
            						            Lock(i_mmap_mutex)
      							    Lock(mm->page_table_lock)
      							    huge_pmd_unshare/unmap tables <--- (1)
      							    Unlock(mm->page_table_lock)
            						            Unlock(i_mmap_mutex)
        huge_pte_alloc				      ...
          Lock(i_mmap_mutex)				      ...
          vma_prio_walk, find svma, spte		      ...
          Lock(mm->page_table_lock)			      ...
          share spte					      ...
          Unlock(mm->page_table_lock)			      ...
          Unlock(i_mmap_mutex)			      ...
        hugetlb_no_page									  <--- (2)
      						      free_pgtables
      						        unlink_file_vma
      							hugetlb_free_pgd_range
      						    remove_vma_list
      
      In this scenario, it is possible for Process A to share page tables with
      Process B that is trying to tear them down.  The i_mmap_mutex on its own
      does not prevent Process A walking Process B's page tables.  At (1) above,
      the page tables are not shared yet so it unmaps the PMDs.  Process A sets
      up page table sharing and at (2) faults a new entry.  Process B then trips
      up on it in free_pgtables.
      
      This patch fixes the problem by adding a new function
      __unmap_hugepage_range_final that is only called when the VMA is about to
      be destroyed.  This function clears VM_MAYSHARE during
      unmap_hugepage_range() under the i_mmap_mutex.  This makes the VMA
      ineligible for sharing and avoids the race.  Superficially this looks like
      it would then be vunerable to truncate and madvise issues but hugetlbfs
      has its own truncate handlers so does not use unmap_mapping_range() and
      does not support madvise(DONTNEED).
      
      This should be treated as a -stable candidate if it is merged.
      
      Test program is as follows. The test case was mostly written by Michal
      Hocko with a few minor changes to reproduce this bug.
      
      ==== CUT HERE ====
      
      static size_t huge_page_size = (2UL << 20);
      static size_t nr_huge_page_A = 512;
      static size_t nr_huge_page_B = 5632;
      
      unsigned int get_random(unsigned int max)
      {
      	struct timeval tv;
      
      	gettimeofday(&tv, NULL);
      	srandom(tv.tv_usec);
      	return random() % max;
      }
      
      static void play(void *addr, size_t size)
      {
      	unsigned char *start = addr,
      		      *end = start + size,
      		      *a;
      	start += get_random(size/2);
      
      	/* we could itterate on huge pages but let's give it more time. */
      	for (a = start; a < end; a += 4096)
      		*a = 0;
      }
      
      int main(int argc, char **argv)
      {
      	key_t key = IPC_PRIVATE;
      	size_t sizeA = nr_huge_page_A * huge_page_size;
      	size_t sizeB = nr_huge_page_B * huge_page_size;
      	int shmidA, shmidB;
      	void *addrA = NULL, *addrB = NULL;
      	int nr_children = 300, n = 0;
      
      	if ((shmidA = shmget(key, sizeA, IPC_CREAT|SHM_HUGETLB|0660)) == -1) {
      		perror("shmget:");
      		return 1;
      	}
      
      	if ((addrA = shmat(shmidA, addrA, SHM_R|SHM_W)) == (void *)-1UL) {
      		perror("shmat");
      		return 1;
      	}
      	if ((shmidB = shmget(key, sizeB, IPC_CREAT|SHM_HUGETLB|0660)) == -1) {
      		perror("shmget:");
      		return 1;
      	}
      
      	if ((addrB = shmat(shmidB, addrB, SHM_R|SHM_W)) == (void *)-1UL) {
      		perror("shmat");
      		return 1;
      	}
      
      fork_child:
      	switch(fork()) {
      		case 0:
      			switch (n%3) {
      			case 0:
      				play(addrA, sizeA);
      				break;
      			case 1:
      				play(addrB, sizeB);
      				break;
      			case 2:
      				break;
      			}
      			break;
      		case -1:
      			perror("fork:");
      			break;
      		default:
      			if (++n < nr_children)
      				goto fork_child;
      			play(addrA, sizeA);
      			break;
      	}
      	shmdt(addrA);
      	shmdt(addrB);
      	do {
      		wait(NULL);
      	} while (--n > 0);
      	shmctl(shmidA, IPC_RMID, NULL);
      	shmctl(shmidB, IPC_RMID, NULL);
      	return 0;
      }
      
      [akpm@linux-foundation.org: name the declaration's args, fix CONFIG_HUGETLBFS=n build]
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      34967266
    • Xiao Guangrong's avatar
      mm: mmu_notifier: fix freed page still mapped in secondary MMU · 6d94102f
      Xiao Guangrong authored
      
      commit 3ad3d901bbcfb15a5e4690e55350db0899095a68 upstream.
      
      mmu_notifier_release() is called when the process is exiting.  It will
      delete all the mmu notifiers.  But at this time the page belonging to the
      process is still present in page tables and is present on the LRU list, so
      this race will happen:
      
            CPU 0                 CPU 1
      mmu_notifier_release:    try_to_unmap:
         hlist_del_init_rcu(&mn->hlist);
                                  ptep_clear_flush_notify:
                                        mmu nofifler not found
                                  free page  !!!!!!
                                  /*
                                   * At the point, the page has been
                                   * freed, but it is still mapped in
                                   * the secondary MMU.
                                   */
      
        mn->ops->release(mn, mm);
      
      Then the box is not stable and sometimes we can get this bug:
      
      [  738.075923] BUG: Bad page state in process migrate-perf  pfn:03bec
      [  738.075931] page:ffffea00000efb00 count:0 mapcount:0 mapping:          (null) index:0x8076
      [  738.075936] page flags: 0x20000000000014(referenced|dirty)
      
      The same issue is present in mmu_notifier_unregister().
      
      We can call ->release before deleting the notifier to ensure the page has
      been unmapped from the secondary MMU before it is freed.
      Signed-off-by: default avatarXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6d94102f
    • Hugh Dickins's avatar
      mm: fix crashes from mbind() merging vmas · f14412af
      Hugh Dickins authored
      commit d05f0cdcbe6388723f1900c549b4850360545201 upstream.
      
      In v2.6.34 commit 9d8cebd4 ("mm: fix mbind vma merge problem")
      introduced vma merging to mbind(), but it should have also changed the
      convention of passing start vma from queue_pages_range() (formerly
      check_range()) to new_vma_page(): vma merging may have already freed
      that structure, resulting in BUG at mm/mempolicy.c:1738 and probably
      worse crashes.
      
      Fixes: 9d8cebd4
      
       ("mm: fix mbind vma merge problem")
      Reported-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Tested-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f14412af
    • Xiaozhe Shi's avatar
      power: qpnp-bms: always limit soc to [0, 100] · 124e06c1
      Xiaozhe Shi authored
      
      Currently, soc is only limited to [0, 100] when adjust_soc runs.
      However, there are some cases where the main algorithm of adjust_soc is
      skipped, due to charging, SOC being too high or in the flat region of
      the PC/OCV curve.
      
      This can cause issues where SOC is calculated to be over 100 or under
      0, which is undesirable. Fix this by moving the bound_soc call to the
      main calculate_soc function so that it is never skipped.
      
      CRs-Fixed: 697713
      Change-Id: I641f513d182c62731a4fc115f29c0e38e5ec4c14
      Signed-off-by: default avatarXiaozhe Shi <xiaozhes@codeaurora.org>
      124e06c1
    • Sujit Reddy Thumma's avatar
      mmc: sdhci-msm: Fix clock gating while voltage switch is in progress · 35afe159
      Sujit Reddy Thumma authored
      
      CLK_PWRSAVE bit in vendor specific register gates the output clock to
      card automatically if there are no data/cmd operations.
      
      According the SD3.0 voltage switch sequence the host should provide
      clock to the card for atleast one millisecond before DAT[3:0] lines
      are pulled high by the card. In this case if power save bit is enabled
      it might auto-gate clocks even before the card completes voltage
      switch sequence.
      
      Fix this by disabling power save operation when the clocks are turned
      off and enable only when clock rate is >400KHz i.e., end of initialization.
      
      CRs-Fixed: 589992
      Change-Id: If82d6d2e303b8d1189b76712e514f41fe6e2cf8b
      Signed-off-by: default avatarSujit Reddy Thumma <sthumma@codeaurora.org>
      35afe159