1. 21 Feb, 2016 40 commits
    • Anton Vorontsov's avatar
      memcg: add memory.pressure_level events · d12c78e5
      Anton Vorontsov authored
      With this patch userland applications that want to maintain the
      interactivity/memory allocation cost can use the pressure level
      notifications.  The levels are defined like this:
      
      The "low" level means that the system is reclaiming memory for new
      allocations.  Monitoring this reclaiming activity might be useful for
      maintaining cache level.  Upon notification, the program (typically
      "Activity Manager") might analyze vmstat and act in advance (i.e.
      prematurely shutdown unimportant services).
      
      The "medium" level means that the system is experiencing medium memory
      pressure, the system might be making swap, paging out active file
      caches, etc.  Upon this event applications may decide to further analyze
      vmstat/zoneinfo/memcg or internal memory usage statistics and free any
      resources that can be easily reconstructed or re-read from a disk.
      
      The "critical" level means that the system is actively thrashing, it is
      about to out of memory (OOM) or even the in-kernel OOM killer is on its
      way to trigger.  Applications should do whatever they can to help the
      system.  It might be too late to consult with vmstat or any other
      statistics, so it's advisable to take an immediate action.
      
      The events are propagated upward until the event is handled, i.e.  the
      events are not pass-through.  Here is what this means: for example you
      have three cgroups: A->B->C.  Now you set up an event listener on
      cgroups A, B and C, and suppose group C experiences some pressure.  In
      this situation, only group C will receive the notification, i.e.  groups
      A and B will not receive it.  This is done to avoid excessive
      "broadcasting" of messages, which disturbs the system and which is
      especially bad if we are low on memory or thrashing.  So, organize the
      cgroups wisely, or propagate the events manually (or, ask us to
      implement the pass-through events, explaining why would you need them.)
      
      Performance wise, the memory pressure notifications feature itself is
      lightweight and does not require much of bookkeeping, in contrast to the
      rest of memcg features.  Unfortunately, as of current memcg
      implementation, pages accounting is an inseparable part and cannot be
      turned off.  The good news is that there are some efforts[1] to improve
      the situation; plus, implementing the same, fully API-compatible[2]
      interface for CONFIG_CGROUP_MEM_RES_CTLR=n case (e.g.  embedded) is also
      a viable option, so it will not require any changes on the userland
      side.
      
      [1] http://permalink.gmane.org/gmane.linux.kernel.cgroups/6291
      [2] http://lkml.org/lkml/2013/2/21/454
      
      
      
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: fix CONFIG_CGROPUPS=n warnings]
      Signed-off-by: default avatarAnton Vorontsov <anton.vorontsov@linaro.org>
      Acked-by: default avatarKirill A. Shutemov <kirill@shutemov.name>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Leonid Moiseichuk <leonid.moiseichuk@nokia.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      
      Change-Id: I4e703d3688c74466e02cf0f2b866e85043fe799d
      d12c78e5
    • Michael Wang's avatar
      slab: fix the DEADLOCK issue on l3 alien lock · 788a9073
      Michael Wang authored
      commit 947ca1856a7e60aa6d20536785e6a42dff25aa6e upstream.
      
      DEADLOCK will be report while running a kernel with NUMA and LOCKDEP enabled,
      the process of this fake report is:
      
      	   kmem_cache_free()	//free obj in cachep
      	-> cache_free_alien()	//acquire cachep's l3 alien lock
      	-> __drain_alien_cache()
      	-> free_block()
      	-> slab_destroy()
      	-> kmem_cache_free()	//free slab in cachep->slabp_cache
      	-> cache_free_alien()	//acquire cachep->slabp_cache's l3 alien lock
      
      Since the cachep and cachep->slabp_cache's l3 alien are in the same lock class,
      fake report generated.
      
      This should not happen since we already have init_lock_keys() which will
      reassign the lock class for both l3 list and l3 alien.
      
      However, init_lock_keys() was invoked at a wrong position which is before we
      invoke enable_cpucache() on each cache.
      
      Since until set slab_state to be FULL, we won't invoke enable_cpucache()
      on caches to build their l3 alien while creating them, so although we invoked
      init_lock_keys(), the l3 alien lock class won't change since we don't have
      them until invoked enable_cpucache() later.
      
      This patch will invoke init_lock_keys() after we done enable_cpucache()
      instead of before to avoid the fake DEADLOCK report.
      
      Michael traced the problem back to a commit in release 3.0.0:
      
      commit 30765b92
      
      
      Author: Peter Zijlstra <peterz@infradead.org>
      Date:   Thu Jul 28 23:22:56 2011 +0200
      
          slab, lockdep: Annotate the locks before using them
      
          Fernando found we hit the regular OFF_SLAB 'recursion' before we
          annotate the locks, cure this.
      
          The relevant portion of the stack-trace:
      
          > [    0.000000]  [<c085e24f>] rt_spin_lock+0x50/0x56
          > [    0.000000]  [<c04fb406>] __cache_free+0x43/0xc3
          > [    0.000000]  [<c04fb23f>] kmem_cache_free+0x6c/0xdc
          > [    0.000000]  [<c04fb2fe>] slab_destroy+0x4f/0x53
          > [    0.000000]  [<c04fb396>] free_block+0x94/0xc1
          > [    0.000000]  [<c04fc551>] do_tune_cpucache+0x10b/0x2bb
          > [    0.000000]  [<c04fc8dc>] enable_cpucache+0x7b/0xa7
          > [    0.000000]  [<c0bd9d3c>] kmem_cache_init_late+0x1f/0x61
          > [    0.000000]  [<c0bba687>] start_kernel+0x24c/0x363
          > [    0.000000]  [<c0bba0ba>] i386_start_kernel+0xa9/0xaf
      Reported-by: default avatarFernando Lopez-Lezcano <nando@ccrma.Stanford.EDU>
      Acked-by: default avatarPekka Enberg <penberg@kernel.org>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
          Link: http://lkml.kernel.org/r/1311888176.2617.379.camel@laptop
      
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      
      The commit moved init_lock_keys() before we build up the alien, so we
      failed to reclass it.
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Tested-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Wang <wangyun@linux.vnet.ibm.com>
      Signed-off-by: default avatarPekka Enberg <penberg@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      788a9073
    • Mahesh Sivasubramanian's avatar
      arm: arch_timer: set memory mapped timer interrupt as IRQF_TIMER · 8808c5cc
      Mahesh Sivasubramanian authored
      
      The memory mapped timer is used as a broadcast timer to wake the core for
      timer interrupts when the arch timer might not be functional. When interrupt
      is not marked as IRQF_NO_SUSPEND, the interrupt gets disabled during the
      suspend_device_irqs() callback in the suspend path. If a core were to enter a
      idle low power mode which relies on broadcast timer to process the interrupt,
      the core is never woken up for timer interrupts.
      
      Mark the interrupt with IRQF_TIMER which marks this interrupt as a timer
      interrupt and also marks it as IRQF_NO_SUSPEND
      
      CRs-fixed: 636712
      Change-Id: I0484e92a9d05f66a0c5b3c00c584a3dd3fe6ae85
      Signed-off-by: default avatarMahesh Sivasubramanian <msivasub@codeaurora.org>
      8808c5cc
    • Syed Rameez Mustafa's avatar
      ARM: Allow panic on division by zero in the kernel · aaa03166
      Syed Rameez Mustafa authored
      
      Division by zero errors in the kernel currently trigger warnings.
      Allow panic on these errors so that we can catch the problem closer
      to its source.
      
      Change-Id: Id5fed71b74cd37874ae857a8105455d7561c782d
      Signed-off-by: default avatarSyed Rameez Mustafa <rameezmustafa@codeaurora.org>
      aaa03166
    • Jiang Liu's avatar
      memory hotplug: fix invalid memory access caused by stale kswapd pointer · e7554260
      Jiang Liu authored
      
      commit d8adde17e5f858427504725218c56aef90e90fc7 upstream.
      
      kswapd_stop() is called to destroy the kswapd work thread when all memory
      of a NUMA node has been offlined.  But kswapd_stop() only terminates the
      work thread without resetting NODE_DATA(nid)->kswapd to NULL.  The stale
      pointer will prevent kswapd_run() from creating a new work thread when
      adding memory to the memory-less NUMA node again.  Eventually the stale
      pointer may cause invalid memory access.
      
      An example stack dump as below. It's reproduced with 2.6.32, but latest
      kernel has the same issue.
      
        BUG: unable to handle kernel NULL pointer dereference at (null)
        IP: [<ffffffff81051a94>] exit_creds+0x12/0x78
        PGD 0
        Oops: 0000 [#1] SMP
        last sysfs file: /sys/devices/system/memory/memory391/state
        CPU 11
        Modules linked in: cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq microcode fuse loop dm_mod tpm_tis rtc_cmos i2c_i801 rtc_core tpm serio_raw pcspkr sg tpm_bios igb i2c_core iTCO_wdt rtc_lib mptctl iTCO_vendor_support button dca bnx2 usbhid hid uhci_hcd ehci_hcd usbcore sd_mod crc_t10dif edd ext3 mbcache jbd fan ide_pci_generic ide_core ata_generic ata_piix libata thermal processor thermal_sys hwmon mptsas mptscsih mptbase scsi_transport_sas scsi_mod
        Pid: 7949, comm: sh Not tainted 2.6.32.12-qiuxishi-5-default #92 Tecal RH2285
        RIP: 0010:exit_creds+0x12/0x78
        RSP: 0018:ffff8806044f1d78  EFLAGS: 00010202
        RAX: 0000000000000000 RBX: ffff880604f22140 RCX: 0000000000019502
        RDX: 0000000000000000 RSI: 0000000000000202 RDI: 0000000000000000
        RBP: ffff880604f22150 R08: 0000000000000000 R09: ffffffff81a4dc10
        R10: 00000000000032a0 R11: ffff880006202500 R12: 0000000000000000
        R13: 0000000000c40000 R14: 0000000000008000 R15: 0000000000000001
        FS:  00007fbc03d066f0(0000) GS:ffff8800282e0000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
        CR2: 0000000000000000 CR3: 000000060f029000 CR4: 00000000000006e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
        Process sh (pid: 7949, threadinfo ffff8806044f0000, task ffff880603d7c600)
        Stack:
         ffff880604f22140 ffffffff8103aac5 ffff880604f22140 ffffffff8104d21e
         ffff880006202500 0000000000008000 0000000000c38000 ffffffff810bd5b1
         0000000000000000 ffff880603d7c600 00000000ffffdd29 0000000000000003
        Call Trace:
          __put_task_struct+0x5d/0x97
          kthread_stop+0x50/0x58
          offline_pages+0x324/0x3da
          memory_block_change_state+0x179/0x1db
          store_mem_state+0x9e/0xbb
          sysfs_write_file+0xd0/0x107
          vfs_write+0xad/0x169
          sys_write+0x45/0x6e
          system_call_fastpath+0x16/0x1b
        Code: ff 4d 00 0f 94 c0 84 c0 74 08 48 89 ef e8 1f fd ff ff 5b 5d 31 c0 41 5c c3 53 48 8b 87 20 06 00 00 48 89 fb 48 8b bf 18 06 00 00 <8b> 00 48 c7 83 18 06 00 00 00 00 00 00 f0 ff 0f 0f 94 c0 84 c0
        RIP  exit_creds+0x12/0x78
         RSP <ffff8806044f1d78>
        CR2: 0000000000000000
      
      [akpm@linux-foundation.org: add pglist_data.kswapd locking comments]
      Signed-off-by: default avatarXishi Qiu <qiuxishi@huawei.com>
      Signed-off-by: default avatarJiang Liu <jiang.liu@huawei.com>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Reviewed-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e7554260
    • Gavin Shan's avatar
      mm/memblock: fix memory leak on extending regions · defc6575
      Gavin Shan authored
      
      commit 181eb39425f2b9275afcb015eaa547d11f71a02f upstream.
      
      The overall memblock has been organized into the memory regions and
      reserved regions.  Initially, the memory regions and reserved regions are
      stored in the predetermined arrays of "struct memblock _region".  It's
      possible for the arrays to be enlarged when we have newly added regions,
      but no free space left there.  The policy here is to create double-sized
      array either by slab allocator or memblock allocator.  Unfortunately, we
      didn't free the old array, which might be allocated through slab allocator
      before.  That would cause memory leak.
      
      The patch introduces 2 variables to trace where (slab or memblock) the
      memory and reserved regions come from.  The memory for the memory or
      reserved regions will be deallocated by kfree() if that was allocated by
      slab allocator.  Thus to fix the memory leak issue.
      Signed-off-by: default avatarGavin Shan <shangw@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      defc6575
    • Greg Pearson's avatar
      mm/memblock: fix overlapping allocation when doubling reserved array · ef674d27
      Greg Pearson authored
      
      commit 48c3b583bbddad2220ca4c22319ca5d1f78b2090 upstream.
      
      __alloc_memory_core_early() asks memblock for a range of memory then try
      to reserve it.  If the reserved region array lacks space for the new
      range, memblock_double_array() is called to allocate more space for the
      array.  If memblock is used to allocate memory for the new array it can
      end up using a range that overlaps with the range originally allocated in
      __alloc_memory_core_early(), leading to possible data corruption.
      
      With this patch memblock_double_array() now calls memblock_find_in_range()
      with a narrowed candidate range (in cases where the reserved.regions array
      is being doubled) so any memory allocated will not overlap with the
      original range that was being reserved.  The range is narrowed by passing
      in the starting address and size of the previously allocated range.  Then
      the range above the ending address is searched and if a candidate is not
      found, the range below the starting address is searched.
      Signed-off-by: default avatarGreg Pearson <greg.pearson@hp.com>
      Signed-off-by: default avatarYinghai Lu <yinghai@kernel.org>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ef674d27
    • Mel Gorman's avatar
      mm: hugetlbfs: close race during teardown of hugetlbfs shared page tables · 34967266
      Mel Gorman authored
      commit d833352a4338dc31295ed832a30c9ccff5c7a183 upstream.
      
      If a process creates a large hugetlbfs mapping that is eligible for page
      table sharing and forks heavily with children some of whom fault and
      others which destroy the mapping then it is possible for page tables to
      get corrupted.  Some teardowns of the mapping encounter a "bad pmd" and
      output a message to the kernel log.  The final teardown will trigger a
      BUG_ON in mm/filemap.c.
      
      This was reproduced in 3.4 but is known to have existed for a long time
      and goes back at least as far as 2.6.37.  It was probably was introduced
      in 2.6.20 by [39dde65c
      
      : shared page table for hugetlb page].  The messages
      look like this;
      
      [  ..........] Lots of bad pmd messages followed by this
      [  127.164256] mm/memory.c:391: bad pmd ffff880412e04fe8(80000003de4000e7).
      [  127.164257] mm/memory.c:391: bad pmd ffff880412e04ff0(80000003de6000e7).
      [  127.164258] mm/memory.c:391: bad pmd ffff880412e04ff8(80000003de0000e7).
      [  127.186778] ------------[ cut here ]------------
      [  127.186781] kernel BUG at mm/filemap.c:134!
      [  127.186782] invalid opcode: 0000 [#1] SMP
      [  127.186783] CPU 7
      [  127.186784] Modules linked in: af_packet cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf ext3 jbd dm_mod coretemp crc32c_intel usb_storage ghash_clmulni_intel aesni_intel i2c_i801 r8169 mii uas sr_mod cdrom sg iTCO_wdt iTCO_vendor_support shpchp serio_raw cryptd aes_x86_64 e1000e pci_hotplug dcdbas aes_generic container microcode ext4 mbcache jbd2 crc16 sd_mod crc_t10dif i915 drm_kms_helper drm i2c_algo_bit ehci_hcd ahci libahci usbcore rtc_cmos usb_common button i2c_core intel_agp video intel_gtt fan processor thermal thermal_sys hwmon ata_generic pata_atiixp libata scsi_mod
      [  127.186801]
      [  127.186802] Pid: 9017, comm: hugetlbfs-test Not tainted 3.4.0-autobuild #53 Dell Inc. OptiPlex 990/06D7TR
      [  127.186804] RIP: 0010:[<ffffffff810ed6ce>]  [<ffffffff810ed6ce>] __delete_from_page_cache+0x15e/0x160
      [  127.186809] RSP: 0000:ffff8804144b5c08  EFLAGS: 00010002
      [  127.186810] RAX: 0000000000000001 RBX: ffffea000a5c9000 RCX: 00000000ffffffc0
      [  127.186811] RDX: 0000000000000000 RSI: 0000000000000009 RDI: ffff88042dfdad00
      [  127.186812] RBP: ffff8804144b5c18 R08: 0000000000000009 R09: 0000000000000003
      [  127.186813] R10: 0000000000000000 R11: 000000000000002d R12: ffff880412ff83d8
      [  127.186814] R13: ffff880412ff83d8 R14: 0000000000000000 R15: ffff880412ff83d8
      [  127.186815] FS:  00007fe18ed2c700(0000) GS:ffff88042dce0000(0000) knlGS:0000000000000000
      [  127.186816] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [  127.186817] CR2: 00007fe340000503 CR3: 0000000417a14000 CR4: 00000000000407e0
      [  127.186818] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  127.186819] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      [  127.186820] Process hugetlbfs-test (pid: 9017, threadinfo ffff8804144b4000, task ffff880417f803c0)
      [  127.186821] Stack:
      [  127.186822]  ffffea000a5c9000 0000000000000000 ffff8804144b5c48 ffffffff810ed83b
      [  127.186824]  ffff8804144b5c48 000000000000138a 0000000000001387 ffff8804144b5c98
      [  127.186825]  ffff8804144b5d48 ffffffff811bc925 ffff8804144b5cb8 0000000000000000
      [  127.186827] Call Trace:
      [  127.186829]  [<ffffffff810ed83b>] delete_from_page_cache+0x3b/0x80
      [  127.186832]  [<ffffffff811bc925>] truncate_hugepages+0x115/0x220
      [  127.186834]  [<ffffffff811bca43>] hugetlbfs_evict_inode+0x13/0x30
      [  127.186837]  [<ffffffff811655c7>] evict+0xa7/0x1b0
      [  127.186839]  [<ffffffff811657a3>] iput_final+0xd3/0x1f0
      [  127.186840]  [<ffffffff811658f9>] iput+0x39/0x50
      [  127.186842]  [<ffffffff81162708>] d_kill+0xf8/0x130
      [  127.186843]  [<ffffffff81162812>] dput+0xd2/0x1a0
      [  127.186845]  [<ffffffff8114e2d0>] __fput+0x170/0x230
      [  127.186848]  [<ffffffff81236e0e>] ? rb_erase+0xce/0x150
      [  127.186849]  [<ffffffff8114e3ad>] fput+0x1d/0x30
      [  127.186851]  [<ffffffff81117db7>] remove_vma+0x37/0x80
      [  127.186853]  [<ffffffff81119182>] do_munmap+0x2d2/0x360
      [  127.186855]  [<ffffffff811cc639>] sys_shmdt+0xc9/0x170
      [  127.186857]  [<ffffffff81410a39>] system_call_fastpath+0x16/0x1b
      [  127.186858] Code: 0f 1f 44 00 00 48 8b 43 08 48 8b 00 48 8b 40 28 8b b0 40 03 00 00 85 f6 0f 88 df fe ff ff 48 89 df e8 e7 cb 05 00 e9 d2 fe ff ff <0f> 0b 55 83 e2 fd 48 89 e5 48 83 ec 30 48 89 5d d8 4c 89 65 e0
      [  127.186868] RIP  [<ffffffff810ed6ce>] __delete_from_page_cache+0x15e/0x160
      [  127.186870]  RSP <ffff8804144b5c08>
      [  127.186871] ---[ end trace 7cbac5d1db69f426 ]---
      
      The bug is a race and not always easy to reproduce.  To reproduce it I was
      doing the following on a single socket I7-based machine with 16G of RAM.
      
      $ hugeadm --pool-pages-max DEFAULT:13G
      $ echo $((18*1048576*1024)) > /proc/sys/kernel/shmmax
      $ echo $((18*1048576*1024)) > /proc/sys/kernel/shmall
      $ for i in `seq 1 9000`; do ./hugetlbfs-test; done
      
      On my particular machine, it usually triggers within 10 minutes but
      enabling debug options can change the timing such that it never hits.
      Once the bug is triggered, the machine is in trouble and needs to be
      rebooted.  The machine will respond but processes accessing proc like "ps
      aux" will hang due to the BUG_ON.  shutdown will also hang and needs a
      hard reset or a sysrq-b.
      
      The basic problem is a race between page table sharing and teardown.  For
      the most part page table sharing depends on i_mmap_mutex.  In some cases,
      it is also taking the mm->page_table_lock for the PTE updates but with
      shared page tables, it is the i_mmap_mutex that is more important.
      
      Unfortunately it appears to be also insufficient. Consider the following
      situation
      
      Process A					Process B
      ---------					---------
      hugetlb_fault					shmdt
        						LockWrite(mmap_sem)
          						  do_munmap
      						    unmap_region
      						      unmap_vmas
      						        unmap_single_vma
      						          unmap_hugepage_range
            						            Lock(i_mmap_mutex)
      							    Lock(mm->page_table_lock)
      							    huge_pmd_unshare/unmap tables <--- (1)
      							    Unlock(mm->page_table_lock)
            						            Unlock(i_mmap_mutex)
        huge_pte_alloc				      ...
          Lock(i_mmap_mutex)				      ...
          vma_prio_walk, find svma, spte		      ...
          Lock(mm->page_table_lock)			      ...
          share spte					      ...
          Unlock(mm->page_table_lock)			      ...
          Unlock(i_mmap_mutex)			      ...
        hugetlb_no_page									  <--- (2)
      						      free_pgtables
      						        unlink_file_vma
      							hugetlb_free_pgd_range
      						    remove_vma_list
      
      In this scenario, it is possible for Process A to share page tables with
      Process B that is trying to tear them down.  The i_mmap_mutex on its own
      does not prevent Process A walking Process B's page tables.  At (1) above,
      the page tables are not shared yet so it unmaps the PMDs.  Process A sets
      up page table sharing and at (2) faults a new entry.  Process B then trips
      up on it in free_pgtables.
      
      This patch fixes the problem by adding a new function
      __unmap_hugepage_range_final that is only called when the VMA is about to
      be destroyed.  This function clears VM_MAYSHARE during
      unmap_hugepage_range() under the i_mmap_mutex.  This makes the VMA
      ineligible for sharing and avoids the race.  Superficially this looks like
      it would then be vunerable to truncate and madvise issues but hugetlbfs
      has its own truncate handlers so does not use unmap_mapping_range() and
      does not support madvise(DONTNEED).
      
      This should be treated as a -stable candidate if it is merged.
      
      Test program is as follows. The test case was mostly written by Michal
      Hocko with a few minor changes to reproduce this bug.
      
      ==== CUT HERE ====
      
      static size_t huge_page_size = (2UL << 20);
      static size_t nr_huge_page_A = 512;
      static size_t nr_huge_page_B = 5632;
      
      unsigned int get_random(unsigned int max)
      {
      	struct timeval tv;
      
      	gettimeofday(&tv, NULL);
      	srandom(tv.tv_usec);
      	return random() % max;
      }
      
      static void play(void *addr, size_t size)
      {
      	unsigned char *start = addr,
      		      *end = start + size,
      		      *a;
      	start += get_random(size/2);
      
      	/* we could itterate on huge pages but let's give it more time. */
      	for (a = start; a < end; a += 4096)
      		*a = 0;
      }
      
      int main(int argc, char **argv)
      {
      	key_t key = IPC_PRIVATE;
      	size_t sizeA = nr_huge_page_A * huge_page_size;
      	size_t sizeB = nr_huge_page_B * huge_page_size;
      	int shmidA, shmidB;
      	void *addrA = NULL, *addrB = NULL;
      	int nr_children = 300, n = 0;
      
      	if ((shmidA = shmget(key, sizeA, IPC_CREAT|SHM_HUGETLB|0660)) == -1) {
      		perror("shmget:");
      		return 1;
      	}
      
      	if ((addrA = shmat(shmidA, addrA, SHM_R|SHM_W)) == (void *)-1UL) {
      		perror("shmat");
      		return 1;
      	}
      	if ((shmidB = shmget(key, sizeB, IPC_CREAT|SHM_HUGETLB|0660)) == -1) {
      		perror("shmget:");
      		return 1;
      	}
      
      	if ((addrB = shmat(shmidB, addrB, SHM_R|SHM_W)) == (void *)-1UL) {
      		perror("shmat");
      		return 1;
      	}
      
      fork_child:
      	switch(fork()) {
      		case 0:
      			switch (n%3) {
      			case 0:
      				play(addrA, sizeA);
      				break;
      			case 1:
      				play(addrB, sizeB);
      				break;
      			case 2:
      				break;
      			}
      			break;
      		case -1:
      			perror("fork:");
      			break;
      		default:
      			if (++n < nr_children)
      				goto fork_child;
      			play(addrA, sizeA);
      			break;
      	}
      	shmdt(addrA);
      	shmdt(addrB);
      	do {
      		wait(NULL);
      	} while (--n > 0);
      	shmctl(shmidA, IPC_RMID, NULL);
      	shmctl(shmidB, IPC_RMID, NULL);
      	return 0;
      }
      
      [akpm@linux-foundation.org: name the declaration's args, fix CONFIG_HUGETLBFS=n build]
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      34967266
    • Xiao Guangrong's avatar
      mm: mmu_notifier: fix freed page still mapped in secondary MMU · 6d94102f
      Xiao Guangrong authored
      
      commit 3ad3d901bbcfb15a5e4690e55350db0899095a68 upstream.
      
      mmu_notifier_release() is called when the process is exiting.  It will
      delete all the mmu notifiers.  But at this time the page belonging to the
      process is still present in page tables and is present on the LRU list, so
      this race will happen:
      
            CPU 0                 CPU 1
      mmu_notifier_release:    try_to_unmap:
         hlist_del_init_rcu(&mn->hlist);
                                  ptep_clear_flush_notify:
                                        mmu nofifler not found
                                  free page  !!!!!!
                                  /*
                                   * At the point, the page has been
                                   * freed, but it is still mapped in
                                   * the secondary MMU.
                                   */
      
        mn->ops->release(mn, mm);
      
      Then the box is not stable and sometimes we can get this bug:
      
      [  738.075923] BUG: Bad page state in process migrate-perf  pfn:03bec
      [  738.075931] page:ffffea00000efb00 count:0 mapcount:0 mapping:          (null) index:0x8076
      [  738.075936] page flags: 0x20000000000014(referenced|dirty)
      
      The same issue is present in mmu_notifier_unregister().
      
      We can call ->release before deleting the notifier to ensure the page has
      been unmapped from the secondary MMU before it is freed.
      Signed-off-by: default avatarXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6d94102f
    • Hugh Dickins's avatar
      mm: fix crashes from mbind() merging vmas · f14412af
      Hugh Dickins authored
      commit d05f0cdcbe6388723f1900c549b4850360545201 upstream.
      
      In v2.6.34 commit 9d8cebd4 ("mm: fix mbind vma merge problem")
      introduced vma merging to mbind(), but it should have also changed the
      convention of passing start vma from queue_pages_range() (formerly
      check_range()) to new_vma_page(): vma merging may have already freed
      that structure, resulting in BUG at mm/mempolicy.c:1738 and probably
      worse crashes.
      
      Fixes: 9d8cebd4
      
       ("mm: fix mbind vma merge problem")
      Reported-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Tested-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f14412af
    • Xiaozhe Shi's avatar
      power: qpnp-bms: always limit soc to [0, 100] · 124e06c1
      Xiaozhe Shi authored
      
      Currently, soc is only limited to [0, 100] when adjust_soc runs.
      However, there are some cases where the main algorithm of adjust_soc is
      skipped, due to charging, SOC being too high or in the flat region of
      the PC/OCV curve.
      
      This can cause issues where SOC is calculated to be over 100 or under
      0, which is undesirable. Fix this by moving the bound_soc call to the
      main calculate_soc function so that it is never skipped.
      
      CRs-Fixed: 697713
      Change-Id: I641f513d182c62731a4fc115f29c0e38e5ec4c14
      Signed-off-by: default avatarXiaozhe Shi <xiaozhes@codeaurora.org>
      124e06c1
    • Sujit Reddy Thumma's avatar
      mmc: sdhci-msm: Fix clock gating while voltage switch is in progress · 35afe159
      Sujit Reddy Thumma authored
      
      CLK_PWRSAVE bit in vendor specific register gates the output clock to
      card automatically if there are no data/cmd operations.
      
      According the SD3.0 voltage switch sequence the host should provide
      clock to the card for atleast one millisecond before DAT[3:0] lines
      are pulled high by the card. In this case if power save bit is enabled
      it might auto-gate clocks even before the card completes voltage
      switch sequence.
      
      Fix this by disabling power save operation when the clocks are turned
      off and enable only when clock rate is >400KHz i.e., end of initialization.
      
      CRs-Fixed: 589992
      Change-Id: If82d6d2e303b8d1189b76712e514f41fe6e2cf8b
      Signed-off-by: default avatarSujit Reddy Thumma <sthumma@codeaurora.org>
      35afe159
    • Krishna Konda's avatar
      mmc: core: get drive types supported by eMMC cards · 8a08a0e6
      Krishna Konda authored
      
      Get the various drive types other than the default supported
      by the card.
      
      Change-Id: I122971e4fb4a3ab98f0078ceafca3380e9c0e2d1
      Signed-off-by: default avatarKrishna Konda <kkonda@codeaurora.org>
      8a08a0e6
    • Venkat Gopalakrishnan's avatar
      mmc: core: Fix power class config for HS400 · c4ea5bef
      Venkat Gopalakrishnan authored
      
      Use the correct power class field from the extended CSD register
      for HS400 mode as defined in the eMMC5.0 specification.
      
      CRs-fixed: 690341
      Change-Id: Ie10e35941fd3c6ee49c686f721bf5af6fcd74862
      Signed-off-by: default avatarVenkat Gopalakrishnan <venkatg@codeaurora.org>
      c4ea5bef
    • Balamurugan Alagarsamy's avatar
      v4l2: vb2: replace VIDEO_MAX_FRAME with VB2_MAX_FRAME · 5145166e
      Balamurugan Alagarsamy authored
      
      - vb2 drivers to rely on VB2_MAX_FRAME.
      
      - VB2_MAX_FRAME bumps the value to 64 from current 32
      
      Change-Id: I3d7998898df43553486166c44b54524aac449deb
      Signed-off-by: default avatarBalamurugan Alagarsamy <balaga@codeaurora.org>
      5145166e
    • Matt Wagantall's avatar
      cpaccess: remove use of set_get_l2_indirect_reg() · f92d2bb2
      Matt Wagantall authored
      
      cpaccess is the only client of set_get_l2_indirect_reg(), a special-
      purpose API for writing an indirect Krait CP15 register and reading
      back the new value. For simplicity, this API is being removed and
      replaced with bacl-to-back calls to set_l2_indirect_reg() and
      get_l2_indirect_reg(), which should perform comparably.
      
      Change-Id: I868ae115265c59e58ffd37dc405a91c6962e0c3d
      Signed-off-by: default avatarMatt Wagantall <mattw@codeaurora.org>
      f92d2bb2
    • Rafael Aquini's avatar
      swap: avoid read_swap_cache_async() race to deadlock while waiting on discard I/O completion · 4a9432dc
      Rafael Aquini authored
      
      commit cbab0e4eec299e9059199ebe6daf48730be46d2b upstream.
      
      read_swap_cache_async() can race against get_swap_page(), and stumble
      across a SWAP_HAS_CACHE entry in the swap map whose page wasn't brought
      into the swapcache yet.
      
      This transient swap_map state is expected to be transitory, but the
      actual placement of discard at scan_swap_map() inserts a wait for I/O
      completion thus making the thread at read_swap_cache_async() to loop
      around its -EEXIST case, while the other end at get_swap_page() is
      scheduled away at scan_swap_map().  This can leave the system deadlocked
      if the I/O completion happens to be waiting on the CPU waitqueue where
      read_swap_cache_async() is busy looping and !CONFIG_PREEMPT.
      
      This patch introduces a cond_resched() call to make the aforementioned
      read_swap_cache_async() busy loop condition to bail out when necessary,
      thus avoiding the subtle race window.
      Signed-off-by: default avatarRafael Aquini <aquini@redhat.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Shaohua Li <shli@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4a9432dc
    • Mathieu Desnoyers's avatar
      Fix: compat_rw_copy_check_uvector() misuse in aio, readv, writev, and security keys · a10059bc
      Mathieu Desnoyers authored
      
      commit 8aec0f5d4137532de14e6554fd5dd201ff3a3c49 upstream.
      
      Looking at mm/process_vm_access.c:process_vm_rw() and comparing it to
      compat_process_vm_rw() shows that the compatibility code requires an
      explicit "access_ok()" check before calling
      compat_rw_copy_check_uvector(). The same difference seems to appear when
      we compare fs/read_write.c:do_readv_writev() to
      fs/compat.c:compat_do_readv_writev().
      
      This subtle difference between the compat and non-compat requirements
      should probably be debated, as it seems to be error-prone. In fact,
      there are two others sites that use this function in the Linux kernel,
      and they both seem to get it wrong:
      
      Now shifting our attention to fs/aio.c, we see that aio_setup_iocb()
      also ends up calling compat_rw_copy_check_uvector() through
      aio_setup_vectored_rw(). Unfortunately, the access_ok() check appears to
      be missing. Same situation for
      security/keys/compat.c:compat_keyctl_instantiate_key_iov().
      
      I propose that we add the access_ok() check directly into
      compat_rw_copy_check_uvector(), so callers don't have to worry about it,
      and it therefore makes the compat call code similar to its non-compat
      counterpart. Place the access_ok() check in the same location where
      copy_from_user() can trigger a -EFAULT error in the non-compat code, so
      the ABI behaviors are alike on both compat and non-compat.
      
      While we are here, fix compat_do_readv_writev() so it checks for
      compat_rw_copy_check_uvector() negative return values.
      
      And also, fix a memory leak in compat_keyctl_instantiate_key_iov() error
      handling.
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: default avatarAl Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a10059bc
    • Michal Hocko's avatar
      mm: limit mmu_gather batching to fix soft lockups on !CONFIG_PREEMPT · 368b15bb
      Michal Hocko authored
      commit 53a59fc67f97374758e63a9c785891ec62324c81 upstream.
      
      Since commit e303297e
      
       ("mm: extended batches for generic
      mmu_gather") we are batching pages to be freed until either
      tlb_next_batch cannot allocate a new batch or we are done.
      
      This works just fine most of the time but we can get in troubles with
      non-preemptible kernel (CONFIG_PREEMPT_NONE or CONFIG_PREEMPT_VOLUNTARY)
      on large machines where too aggressive batching might lead to soft
      lockups during process exit path (exit_mmap) because there are no
      scheduling points down the free_pages_and_swap_cache path and so the
      freeing can take long enough to trigger the soft lockup.
      
      The lockup is harmless except when the system is setup to panic on
      softlockup which is not that unusual.
      
      The simplest way to work around this issue is to limit the maximum
      number of batches in a single mmu_gather.  10k of collected pages should
      be safe to prevent from soft lockups (we would have 2ms for one) even if
      they are all freed without an explicit scheduling point.
      
      This patch doesn't add any new explicit scheduling points because it
      relies on zap_pmd_range during page tables zapping which calls
      cond_resched per PMD.
      
      The following lockup has been reported for 3.0 kernel with a huge
      process (in order of hundreds gigs but I do know any more details).
      
        BUG: soft lockup - CPU#56 stuck for 22s! [kernel:31053]
        Modules linked in: af_packet nfs lockd fscache auth_rpcgss nfs_acl sunrpc mptctl mptbase autofs4 binfmt_misc dm_round_robin dm_multipath bonding cpufreq_conservative cpufreq_userspace cpufreq_powersave pcc_cpufreq mperf microcode fuse loop osst sg sd_mod crc_t10dif st qla2xxx scsi_transport_fc scsi_tgt netxen_nic i7core_edac iTCO_wdt joydev e1000e serio_raw pcspkr edac_core iTCO_vendor_support acpi_power_meter rtc_cmos hpwdt hpilo button container usbhid hid dm_mirror dm_region_hash dm_log linear uhci_hcd ehci_hcd usbcore usb_common scsi_dh_emc scsi_dh_alua scsi_dh_hp_sw scsi_dh_rdac scsi_dh dm_snapshot pcnet32 mii edd dm_mod raid1 ext3 mbcache jbd fan thermal processor thermal_sys hwmon cciss scsi_mod
        Supported: Yes
        CPU 56
        Pid: 31053, comm: kernel Not tainted 3.0.31-0.9-default #1 HP ProLiant DL580 G7
        RIP: 0010:  _raw_spin_unlock_irqrestore+0x8/0x10
        RSP: 0018:ffff883ec1037af0  EFLAGS: 00000206
        RAX: 0000000000000e00 RBX: ffffea01a0817e28 RCX: ffff88803ffd9e80
        RDX: 0000000000000200 RSI: 0000000000000206 RDI: 0000000000000206
        RBP: 0000000000000002 R08: 0000000000000001 R09: ffff887ec724a400
        R10: 0000000000000000 R11: dead000000200200 R12: ffffffff8144c26e
        R13: 0000000000000030 R14: 0000000000000297 R15: 000000000000000e
        FS:  00007ed834282700(0000) GS:ffff88c03f200000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
        CR2: 000000000068b240 CR3: 0000003ec13c5000 CR4: 00000000000006e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
        Process kernel (pid: 31053, threadinfo ffff883ec1036000, task ffff883ebd5d4100)
        Call Trace:
          release_pages+0xc5/0x260
          free_pages_and_swap_cache+0x9d/0xc0
          tlb_flush_mmu+0x5c/0x80
          tlb_finish_mmu+0xe/0x50
          exit_mmap+0xbd/0x120
          mmput+0x49/0x120
          exit_mm+0x122/0x160
          do_exit+0x17a/0x430
          do_group_exit+0x3d/0xb0
          get_signal_to_deliver+0x247/0x480
          do_signal+0x71/0x1b0
          do_notify_resume+0x98/0xb0
          int_signal+0x12/0x17
        DWARF2 unwinder stuck at int_signal+0x12/0x17
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      368b15bb
    • Hugh Dickins's avatar
      tmpfs,ceph,gfs2,isofs,reiserfs,xfs: fix fh_len checking · 6cd2d508
      Hugh Dickins authored
      
      commit 35c2a7f4908d404c9124c2efc6ada4640ca4d5d5 upstream.
      
      Fuzzing with trinity oopsed on the 1st instruction of shmem_fh_to_dentry(),
      	u64 inum = fid->raw[2];
      which is unhelpfully reported as at the end of shmem_alloc_inode():
      
      BUG: unable to handle kernel paging request at ffff880061cd3000
      IP: [<ffffffff812190d0>] shmem_alloc_inode+0x40/0x40
      Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
      Call Trace:
       [<ffffffff81488649>] ? exportfs_decode_fh+0x79/0x2d0
       [<ffffffff812d77c3>] do_handle_open+0x163/0x2c0
       [<ffffffff812d792c>] sys_open_by_handle_at+0xc/0x10
       [<ffffffff83a5f3f8>] tracesys+0xe1/0xe6
      
      Right, tmpfs is being stupid to access fid->raw[2] before validating that
      fh_len includes it: the buffer kmalloc'ed by do_sys_name_to_handle() may
      fall at the end of a page, and the next page not be present.
      
      But some other filesystems (ceph, gfs2, isofs, reiserfs, xfs) are being
      careless about fh_len too, in fh_to_dentry() and/or fh_to_parent(), and
      could oops in the same way: add the missing fh_len checks to those.
      Reported-by: default avatarSasha Levin <levinsasha928@gmail.com>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Sage Weil <sage@inktank.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6cd2d508
    • Vinson Lee's avatar
      perf tools: Fix build with bison 2.3 and older. · 51de4318
      Vinson Lee authored
      
      commit 85df3b3769222894e9692b383c7af124b7721086 upstream.
      
      The %name-prefix "prefix" syntax is not available on bison 2.3 and
      older. Substitute with the -p "prefix" command-line option for
      compatibility with older versions of bison.
      
      This patch fixes this build error with older versions of bison.
      
          CC util/sysfs.o
          BISON util/pmu-bison.c
      util/pmu.y:2.14-24: syntax error, unexpected string, expecting =
      make: *** [util/pmu-bison.c] Error 1
      Signed-off-by: default avatarVinson Lee <vlee@twitter.com>
      Tested-by: default avatarLi Zefan <lizefan@huawei.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Namhyung Kim <namhyung@gmail.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Link: http://lkml.kernel.org/r/1360792138-29186-1-git-send-email-vlee@twitter.com
      
      Signed-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      51de4318
    • Nikola Majkić's avatar
      Revert "msm: kgsl: don't use sscanf()" · bd8d594b
      Nikola Majkić authored
      This reverts commit 3f9fee8a.
      bd8d594b
    • Nikola Majkić's avatar
      Revert "msm: kgsl: Keep track of kernel space mappings to memory" · ebd7b7a7
      Nikola Majkić authored
      This reverts commit afc733137089159e86e3bcd4fb2fb5aab66726ce.
      ebd7b7a7
    • Nikola Majkić's avatar
    • Nikola Majkić's avatar
      Revert "Revert "msm: kgsl: Keep track of kernel space mappings to memory"" · 15f27772
      Nikola Majkić authored
      This reverts commit 2eeca7c64aac4e9271d689649a9b85a82a02fc38.
      15f27772
    • Krishna Chaitanya Parimi's avatar
      msm: mdss: Correct RGB order for LUT programming in mdp3 · 1080688e
      Krishna Chaitanya Parimi authored
      
      MDP3 LUT programming has incorrect RGB order as per HW. The
      correct order is to have color0 for green, color1 for red
      and color2 for blue. Correcting the order in mdp3.
      
      Change-Id: Ie7b6ab7f83e18495e83a05102e288fee6841e3ab
      Signed-off-by: default avatarKrishna Chaitanya Parimi <cparimi@codeaurora.org>
      1080688e
    • Sandeep Panda's avatar
      msm: mdss: DSI read support for more than 2 bytes · 2dcfe50a
      Sandeep Panda authored
      
      Support more than 2 bytes DSI read for DSI v2 driver.
      
      Change-Id: Ie1b1a7990aed6944036ce82cf4202472604e8e87
      Signed-off-by: default avatarSandeep Panda <spanda@codeaurora.org>
      2dcfe50a
    • Jed Davis's avatar
      ARM: 7765/1: perf: Record the user-mode PC in the call chain. · 0e453c7e
      Jed Davis authored
      
      commit c5f927a6f62196226915f12194c9d0df4e2210d7 upstream.
      
      With this change, we no longer lose the innermost entry in the user-mode
      part of the call chain.  See also the x86 port, which includes the ip.
      
      It's possible to partially work around this problem by post-processing
      the data to use the PERF_SAMPLE_IP value, but this works only if the CPU
      wasn't in the kernel when the sample was taken.
      Signed-off-by: default avatarJed Davis <jld@mozilla.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarRussell King <rmk+kernel@arm.linux.org.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0e453c7e
    • qiuxishi's avatar
      memory hotplug: fix section info double registration bug · 04863a58
      qiuxishi authored
      
      commit f14851af0ebb32745c6c5a2e400aa0549f9d20df upstream.
      
      There may be a bug when registering section info.  For example, on my
      Itanium platform, the pfn range of node0 includes the other nodes, so
      other nodes' section info will be double registered, and memmap's page
      count will equal to 3.
      
        node0: start_pfn=0x100,    spanned_pfn=0x20fb00, present_pfn=0x7f8a3, => 0x000100-0x20fc00
        node1: start_pfn=0x80000,  spanned_pfn=0x80000,  present_pfn=0x80000, => 0x080000-0x100000
        node2: start_pfn=0x100000, spanned_pfn=0x80000,  present_pfn=0x80000, => 0x100000-0x180000
        node3: start_pfn=0x180000, spanned_pfn=0x80000,  present_pfn=0x80000, => 0x180000-0x200000
      
        free_all_bootmem_node()
      	register_page_bootmem_info_node()
      		register_page_bootmem_info_section()
      
      When hot remove memory, we can't free the memmap's page because
      page_count() is 2 after put_page_bootmem().
      
        sparse_remove_one_section()
      	free_section_usemap()
      		free_map_bootmem()
      			put_page_bootmem()
      
      [akpm@linux-foundation.org: add code comment]
      Signed-off-by: default avatarXishi Qiu <qiuxishi@huawei.com>
      Signed-off-by: default avatarJiang Liu <jiang.liu@huawei.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      04863a58
    • Gavin Shan's avatar
      mm/memblock: cleanup on duplicate VA/PA conversion · 1faabad1
      Gavin Shan authored
      
      commit 4e2f07750d9a94e8f23e86408df5ab95be88bf11 upstream.
      
      The overall memblock has been organized into the memory regions and
      reserved regions.  Initially, the memory regions and reserved regions are
      stored in the predetermined arrays of "struct memblock _region".  It's
      possible for the arrays to be enlarged when we have newly added regions
      for them, but no enough space there.  Under the situation, We will created
      double-sized array to meet the requirement.  However, the original
      implementation converted the VA (Virtual Address) of the newly allocated
      array of regions to PA (Physical Address), then translate back when we
      allocates the new array from slab.  That's actually unnecessary.
      
      The patch removes the duplicate VA/PA conversion.
      Signed-off-by: default avatarGavin Shan <shangw@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1faabad1
    • Greg Thelen's avatar
      memcg: fix multiple large threshold notifications · 7e24b212
      Greg Thelen authored
      commit 2bff24a3707093c435ab3241c47dcdb5f16e432b upstream.
      
      A memory cgroup with (1) multiple threshold notifications and (2) at least
      one threshold >=2G was not reliable.  Specifically the notifications would
      either not fire or would not fire in the proper order.
      
      The __mem_cgroup_threshold() signaling logic depends on keeping 64 bit
      thresholds in sorted order.  mem_cgroup_usage_register_event() sorts them
      with compare_thresholds(), which returns the difference of two 64 bit
      thresholds as an int.  If the difference is positive but has bit[31] set,
      then sort() treats the difference as negative and breaks sort order.
      
      This fix compares the two arbitrary 64 bit thresholds returning the
      classic -1, 0, 1 result.
      
      The test below sets two notifications (at 0x1000 and 0x81001000):
        cd /sys/fs/cgroup/memory
        mkdir x
        for x in 4096 2164264960; do
          cgroup_event_listener x/memory.usage_in_bytes $x | sed "s/^/$x listener:/" &
        done
        echo $$ > x/cgroup.procs
        anon_leaker 500M
      
      v3.11-rc7 fails to signal the 4096 event listener:
        Leaking...
        Done leaking pages.
      
      Patched v3.11-rc7 properly notifies:
        Leaking...
        4096 listener:2013:8:31:14:13:36
        Done leaking pages.
      
      The fixed bug is old.  It appears to date back to the introduction of
      memcg threshold notifications in v2.6.34-rc1-116-g2e72b634
      
       "memcg:
      implement memory thresholds"
      Signed-off-by: default avatarGreg Thelen <gthelen@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7e24b212
    • Jordan Crouse's avatar
      sync: Only print details for active fences · b503bcf6
      Jordan Crouse authored
      
      Only call the pt_log callback for active fences.
      
      CRs-Fixed: 744197
      Change-Id: Ic0dedbadf0f5979fc155cdef332fedda9047f440
      Signed-off-by: default avatarJordan Crouse <jcrouse@codeaurora.org>
      Signed-off-by: default avatarLynus Vaz <lvaz@codeaurora.org>
      (cherry picked from commit c1ea464ed28baa6e3b896657e869bdfdb9eb9c77)
      Reviewed-on: http://gerrit.mot.com/726707
      
      
      SLTApproved: Slta Waiver <sltawvr@motorola.com>
      SME-Granted: SME Approvals Granted
      Submit-Approved: Jira Key <jirakey@motorola.com>
      Tested-by: default avatarJira Key <jirakey@motorola.com>
      Reviewed-by: default avatarFred Fettinger <fettinge@motorola.com>
      Reviewed-by: default avatarStephen Rossbach <rossbach@motorola.com>
      b503bcf6
    • Jordan Crouse's avatar
      sync: Add a "details" callback for sync points · 280e3ceb
      Jordan Crouse authored
      
      Allow drivers to add a callback for expanded details about a sync
      point.  This provides for a much richer debug experience than can
      be provided by the simpler callbacks.
      
      Change-Id: Ic0dedbad19fddc2f9b753d886994247e8025d6dc
      Signed-off-by: default avatarJordan Crouse <jcrouse@codeaurora.org>
      Signed-off-by: default avatarLynus Vaz <lvaz@codeaurora.org>
      (cherry picked from commit 64d4572c1e8f97c55edcc140c441d221a1e33ce4)
      Reviewed-on: http://gerrit.mot.com/726706
      
      
      SLTApproved: Slta Waiver <sltawvr@motorola.com>
      SME-Granted: SME Approvals Granted
      Submit-Approved: Jira Key <jirakey@motorola.com>
      Tested-by: default avatarJira Key <jirakey@motorola.com>
      Reviewed-by: default avatarFred Fettinger <fettinge@motorola.com>
      Reviewed-by: default avatarStephen Rossbach <rossbach@motorola.com>
      280e3ceb
    • Cody Ferber's avatar
      Fix uninitialized div_s64 for gcc 4.9 · e8ca7efa
      Cody Ferber authored
      I know we aren't even close to using gcc 4.9, so this might not get merged
      for a while, but here it is for when the time comes. :)  4.9 is even more finicky.
      
      Missing include.
      
      Change-Id: Ia1a0b675f37623735bc1c041fbdd0be61cdaa427
      e8ca7efa
    • Abhijeet Dharmapurikar's avatar
      msm: krait-regulator: fix unnecessary calls to switch to LDO · b935a40f
      Abhijeet Dharmapurikar authored
      
      commit 7eba2ac5de2b5a80898b7dbb42d8f0c0ec461c00 upstream.
      
      When asked to set a LDO voltage the driver rounds it up to
      the nearest step value. However, when checking if the requested voltage
      is already set it doesn't account for the rounding up.
      
      Fix this by rounding up the requested value, which avoids
      unnecessary calls to set the LDO voltage and switch to LDO.
      
      CRs-Fixed: 609879
      Change-Id: Ia5a201721e39eece49dea3f61fbc585f116d5060
      Signed-off-by: default avatarAbhijeet Dharmapurikar <adharmap@codeaurora.org>
      Reviewed-on: http://gerrit.mot.com/680993
      
      
      SLTApproved: Slta Waiver <sltawvr@motorola.com>
      Submit-Approved: Jira Key <jirakey@motorola.com>
      Tested-by: default avatarJira Key <jirakey@motorola.com>
      Reviewed-by: default avatarChristopher Fries <cfries@motorola.com>
      b935a40f
    • Will Deacon's avatar
      ARM: 7488/1: mm: use 5 bits for swapfile type encoding · f6295542
      Will Deacon authored
      
      commit f5f2025ef3e2cdb593707cbf87378761f17befbe upstream.
      
      Page migration encodes the pfn in the offset field of a swp_entry_t.
      For LPAE, we support physical addresses of up to 36 bits (due to
      sparsemem limitations with the size of page flags), requiring 24 bits
      to represent a pfn. A further 3 bits are used to encode a swp_entry into
      a pte, leaving 5 bits for the type field. Furthermore, the core code
      defines MAX_SWAPFILES_SHIFT as 5, so the additional type bit does not
      get used.
      
      This patch reduces the width of the type field to 5 bits, allowing us
      to create up to 31 swapfiles of 64GB each.
      Reviewed-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarRussell King <rmk+kernel@arm.linux.org.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f6295542
    • Dan Carpenter's avatar
      fanotify: info leak in copy_event_to_user() · 970d6f19
      Dan Carpenter authored
      
      commit de1e0c40aceb9d5bff09c3a3b97b2f1b178af53f upstream.
      
      The ->reserved field isn't cleared so we leak one byte of stack
      information to userspace.
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Luis Henriques <luis.henriques@canonical.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      970d6f19
    • Anurup m's avatar
      fs/fscache/stats.c: fix memory leak · c93c338a
      Anurup m authored
      commit ec686c9239b4d472052a271c505d04dae84214cc upstream.
      
      There is a kernel memory leak observed when the proc file
      /proc/fs/fscache/stats is read.
      
      The reason is that in fscache_stats_open, single_open is called and the
      respective release function is not called during release.  Hence fix
      with correct release function - single_release().
      
      Addresses https://bugzilla.kernel.org/show_bug.cgi?id=57101
      
      Signed-off-by: default avatarAnurup m <anurup.m@huawei.com>
      Cc: shyju pv <shyju.pv@huawei.com>
      Cc: Sanil kumar <sanil.kumar@huawei.com>
      Cc: Nataraj m <nataraj.m@huawei.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: David Howells <dhowells@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c93c338a
    • majianpeng's avatar
      nfsd: Fix memleak · 5c324093
      majianpeng authored
      
      commit 2d32b29a1c2830f7c42caa8258c714acd983961f upstream.
      
      When free nfs-client, it must free the ->cl_stateids.
      Signed-off-by: default avatarJianpeng Ma <majianpeng@gmail.com>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5c324093
    • Xu Kai's avatar
      leds: leds-qpnp: use the proper pwm period in LPG mode · 005ebc5b
      Xu Kai authored
      
      Previously, when the leds were working in LPG mode, the pwm
      period was always set to the minimum value supported by hardware.
      That's unreasonable. The better way is to set the period to the
      expected value.
      
      CRs-Fixed: 655566
      Change-Id: I30b17cbfe98639ec132484f314cb9b9234e74354
      Signed-off-by: default avatarXu Kai <kaixu@codeaurora.org>
      005ebc5b