1. 02 Apr, 2009 15 commits
    • Gerd Hoffmann's avatar
      preadv/pwritev: Add preadv and pwritev system calls. · f3554f4b
      Gerd Hoffmann authored
      This patch adds preadv and pwritev system calls.  These syscalls are a
      pretty straightforward combination of pread and readv (same for write).
      They are quite useful for doing vectored I/O in threaded applications.
      Using lseek+readv instead opens race windows you'll have to plug with
      locking.
      
      Other systems have such system calls too, for example NetBSD, check
      here: http://www.daemon-systems.org/man/preadv.2.html
      
      
      
      The application-visible interface provided by glibc should look like
      this to be compatible to the existing implementations in the *BSD family:
      
        ssize_t preadv(int d, const struct iovec *iov, int iovcnt, off_t offset);
        ssize_t pwritev(int d, const struct iovec *iov, int iovcnt, off_t offset);
      
      This prototype has one problem though: On 32bit archs is the (64bit)
      offset argument unaligned, which the syscall ABI of several archs doesn't
      allow to do.  At least s390 needs a wrapper in glibc to handle this.  As
      we'll need a wrappers in glibc anyway I've decided to push problem to
      glibc entriely and use a syscall prototype which works without
      arch-specific wrappers inside the kernel: The offset argument is
      explicitly splitted into two 32bit values.
      
      The patch sports the actual system call implementation and the windup in
      the x86 system call tables.  Other archs follow as separate patches.
      Signed-off-by: default avatarGerd Hoffmann <kraxel@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: <linux-api@vger.kernel.org>
      Cc: <linux-arch@vger.kernel.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f3554f4b
    • Gerd Hoffmann's avatar
      preadv/pwritev: create compat_writev() · 6949a631
      Gerd Hoffmann authored
      
      Factor out some code from compat_sys_writev() which can be shared with the
      upcoming compat_sys_pwritev().
      Signed-off-by: default avatarGerd Hoffmann <kraxel@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: <linux-api@vger.kernel.org>
      Cc: <linux-arch@vger.kernel.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6949a631
    • Gerd Hoffmann's avatar
      preadv/pwritev: create compat_readv() · dac12138
      Gerd Hoffmann authored
      
      This patch series:
      
      Implement the preadv() and pwritev() syscalls.  *BSD has this syscall for
      quite some time.
      
      Test code:
      
      #if 0
      set -x
      gcc -Wall -O2 -o preadv $0
      exit 0
      #endif
      /*
       * preadv demo / test
       *
       * (c) 2008 Gerd Hoffmann <kraxel@redhat.com>
       *
       * build with "sh $thisfile"
       */
      
      #include <stdio.h>
      #include <stdlib.h>
      #include <unistd.h>
      #include <errno.h>
      #include <inttypes.h>
      #include <sys/uio.h>
      
      /* ----------------------------------------------------------------- */
      /* syscall windup                                                    */
      
      #include <sys/syscall.h>
      #if 0
      /* WARNING: Be sure you know what you are doing if you enable this.
       * linux syscall code isn't upstream yet, syscall numbers are subject
       * to change */
      # ifndef __NR_preadv
      #  ifdef __i386__
      #   define __NR_preadv  333
      #   define __NR_pwritev 334
      #  endif
      #  ifdef __x86_64__
      #   define __NR_preadv  295
      #   define __NR_pwritev 296
      #  endif
      # endif
      #endif
      #ifndef __NR_preadv
      # error preadv/pwritev syscall numbers are unknown
      #endif
      
      static ssize_t preadv(int fd, const struct iovec *iov, int iovcnt, off_t offset)
      {
          uint32_t pos_high = (offset >> 32) & 0xffffffff;
          uint32_t pos_low  =  offset        & 0xffffffff;
      
          return syscall(__NR_preadv, fd, iov, iovcnt, pos_high, pos_low);
      }
      
      static ssize_t pwritev(int fd, const struct iovec *iov, int iovcnt, off_t offset)
      {
          uint32_t pos_high = (offset >> 32) & 0xffffffff;
          uint32_t pos_low  =  offset        & 0xffffffff;
      
          return syscall(__NR_pwritev, fd, iov, iovcnt, pos_high, pos_low);
      }
      
      /* ----------------------------------------------------------------- */
      /* demo/test app                                                     */
      
      static char filename[] = "/tmp/preadv-XXXXXX";
      static char outbuf[11] = "0123456789";
      static char inbuf[11]  = "----------";
      
      static struct iovec ovec[2] = {{
              .iov_base = outbuf + 5,
              .iov_len  = 5,
          },{
              .iov_base = outbuf + 0,
              .iov_len  = 5,
          }};
      
      static struct iovec ivec[3] = {{
              .iov_base = inbuf + 6,
              .iov_len  = 2,
          },{
              .iov_base = inbuf + 4,
              .iov_len  = 2,
          },{
              .iov_base = inbuf + 2,
              .iov_len  = 2,
          }};
      
      void cleanup(void)
      {
          unlink(filename);
      }
      
      int main(int argc, char **argv)
      {
          int fd, rc;
      
          fd = mkstemp(filename);
          if (-1 == fd) {
              perror("mkstemp");
              exit(1);
          }
          atexit(cleanup);
      
          /* write to file: "56789-01234" */
          rc = pwritev(fd, ovec, 2, 0);
          if (rc < 0) {
              perror("pwritev");
              exit(1);
          }
      
          /* read from file: "78-90-12" */
          rc = preadv(fd, ivec, 3, 2);
          if (rc < 0) {
              perror("preadv");
              exit(1);
          }
      
          printf("result  : %s\n", inbuf);
          printf("expected: %s\n", "--129078--");
          exit(0);
      }
      
      This patch:
      
      Factor out some code from compat_sys_readv() which can be shared with the
      upcoming compat_sys_preadv().
      Signed-off-by: default avatarGerd Hoffmann <kraxel@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: <linux-api@vger.kernel.org>
      Cc: <linux-arch@vger.kernel.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dac12138
    • David VomLehn's avatar
      cramfs: propagate uncompression errors · 98310e58
      David VomLehn authored
      
      Decompression errors can arise due to corruption of compressed blocks on
      flash or in memory.  This patch propagates errors detected during
      decompression back to the block layer.
      Signed-off-by: default avatarDavid VomLehn <dvomlehn@cisco.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      98310e58
    • Mike Frysinger's avatar
      bin_elf_fdpic: check the return value of clear_user · ab4ad555
      Mike Frysinger authored
      Signed-off-by: default avatarMike Frysinger <vapier.adi@gmail.com>
      Signed-off-by: default avatarBryan Wu <cooloney@kernel.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Greg Ungerer <gerg@snapgear.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ab4ad555
    • Roel Kluin's avatar
      hppfs: hppfs_read_file() may return -ERROR · 880fe76e
      Roel Kluin authored
      
      hppfs_read_file() may return (ssize_t) -ENOMEM, or -EFAULT.  When stored
      in size_t 'count', these errors will not be noticed, a large value will be
      added to *ppos.
      Signed-off-by: default avatarRoel Kluin <roel.kluin@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      880fe76e
    • Jan Kara's avatar
      ext3: avoid false EIO errors · 695f6ae0
      Jan Kara authored
      Sometimes block_write_begin() can map buffers in a page but later we
      fail to copy data into those buffers (because the source page has been
      paged out in the mean time).  We then end up with !uptodate mapped
      buffers.  To add a bit more to the confusion, block_write_end() does
      not commit any data (and thus does not any mark buffers as uptodate) if
      we didn't succeed with copying all the data.
      
      Commit f4fc66a8
      
       (ext3: convert to new
      aops) missed these cases and thus we were inserting non-uptodate
      buffers to transaction's list which confuses JBD code and it reports IO
      errors, aborts a transaction and generally makes users afraid about
      their data ;-P.
      
      This patch fixes the problem by reorganizing ext3_..._write_end() code
      to first call block_write_end() to mark buffers with valid data
      uptodate and after that we file only uptodate buffers to transaction's
      lists.
      
      We also fix a problem where we could leave blocks allocated beyond i_size
      (i_disksize in fact) because of failed write. We now add inode to orphan
      list when write fails (to be safe in case we crash) and then truncate blocks
      beyond i_size in a separate transaction.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: <linux-ext4@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      695f6ae0
    • Bryan Donlan's avatar
      ext3: return -EIO not -ESTALE on directory traversal through deleted inode · de18f3b2
      Bryan Donlan authored
      
      ext3_iget() returns -ESTALE if invoked on a deleted inode, in order to
      report errors to NFS properly.  However, in ext[234]_lookup(), this
      -ESTALE can be propagated to userspace if the filesystem is corrupted such
      that a directory entry references a deleted inode.  This leads to a
      misleading error message - "Stale NFS file handle" - and confusion on the
      part of the admin.
      
      The bug can be easily reproduced by creating a new filesystem, making a
      link to an unused inode using debugfs, then mounting and attempting to ls
      -l said link.
      
      This patch thus changes ext3_lookup to return -EIO if it receives -ESTALE
      from ext3_iget(), as ext3 does for other filesystem metadata corruption;
      and also invokes the appropriate ext*_error functions when this case is
      detected.
      Signed-off-by: default avatarBryan Donlan <bdonlan@gmail.com>
      Cc: <linux-ext4@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      de18f3b2
    • Wei Yongjun's avatar
      ext3: use unsigned instead of int for type of blocksize in fs/ext3/namei.c · 45f90217
      Wei Yongjun authored
      
      Use unsigned instead of int for the parameter which carries a blocksize.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarWei Yongjun <yjwei@cn.fujitsu.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      45f90217
    • Jan Kara's avatar
      jbd: fix oops in jbd_journal_init_inode() on corrupted fs · ecca9af0
      Jan Kara authored
      
      On 32-bit system with CONFIG_LBD getblk can fail because provided block
      number is too big. Make JBD gracefully handle that.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Cc: <dmaciejak@fortinet.com>
      Cc: <linux-ext4@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ecca9af0
    • Cyrus Massoumi's avatar
      ext3: remove the BKL in ext3/ioctl.c · 039fd8ce
      Cyrus Massoumi authored
      
      Reformat ext3/ioctl.c to make it look more like ext4/ioctl.c and remove
      the BKL around ext3_ioctl().
      Signed-off-by: default avatarCyrus Massoumi <cyrusm@gmx.net>
      Cc: <linux-ext4@vger.kernel.org>
      Acked-by: default avatarJan Kara <jack@ucw.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      039fd8ce
    • Nikanth Karthikesan's avatar
      vfs: check bh->b_blocknr only if BH_Mapped is set · 97f76d3d
      Nikanth Karthikesan authored
      
      Check bh->b_blocknr only if BH_Mapped is set.
      
      akpm: I doubt if b_blocknr is ever uninitialised here, but it could
      conceivably cause a problem if we're doing a lookup for block zero.
      Signed-off-by: default avatarNikanth Karthikesan <knikanth@suse.de>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      97f76d3d
    • Jeff Layton's avatar
      writeback: guard against jiffies wraparound on inode->dirtied_when checks (try #3) · d2caa3c5
      Jeff Layton authored
      
      The dirtied_when value on an inode is supposed to represent the first time
      that an inode has one of its pages dirtied.  This value is in units of
      jiffies.  It's used in several places in the writeback code to determine
      when to write out an inode.
      
      The problem is that these checks assume that dirtied_when is updated
      periodically.  If an inode is continuously being used for I/O it can be
      persistently marked as dirty and will continue to age.  Once the time
      compared to is greater than or equal to half the maximum of the jiffies
      type, the logic of the time_*() macros inverts and the opposite of what is
      needed is returned.  On 32-bit architectures that's just under 25 days
      (assuming HZ == 1000).
      
      As the least-recently dirtied inode, it'll end up being the first one that
      pdflush will try to write out.  sync_sb_inodes does this check:
      
      	/* Was this inode dirtied after sync_sb_inodes was called? */
       	if (time_after(inode->dirtied_when, start))
       		break;
      
      ...but now dirtied_when appears to be in the future.  sync_sb_inodes bails
      out without attempting to write any dirty inodes.  When this occurs,
      pdflush will stop writing out inodes for this superblock.  Nothing can
      unwedge it until jiffies moves out of the problematic window.
      
      This patch fixes this problem by changing the checks against dirtied_when
      to also check whether it appears to be in the future.  If it does, then we
      consider the value to be far in the past.
      
      This should shrink the problematic window of time to such a small period
      (30s) as not to matter.
      Signed-off-by: default avatarJeff Layton <jlayton@redhat.com>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Acked-by: default avatarIan Kent <raven@themaw.net>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d2caa3c5
    • Wu Fengguang's avatar
      vfs: skip I_CLEAR state inodes · b6fac63c
      Wu Fengguang authored
      
      clear_inode() will switch inode state from I_FREEING to I_CLEAR, and do so
      _outside_ of inode_lock.  So any I_FREEING testing is incomplete without a
      coupled testing of I_CLEAR.
      
      So add I_CLEAR tests to drop_pagecache_sb(), generic_sync_sb_inodes() and
      add_dquot_ref().
      
      Masayoshi MIZUMA discovered the bug in drop_pagecache_sb() and Jan Kara
      reminds fixing the other two cases.
      
      Masayoshi MIZUMA has a nice panic flow:
      
      =====================================================================
                  [process A]               |        [process B]
       |                                    |
       |    prune_icache()                  | drop_pagecache()
       |      spin_lock(&inode_lock)        |   drop_pagecache_sb()
       |      inode->i_state |= I_FREEING;  |       |
       |      spin_unlock(&inode_lock)      |       V
       |          |                         |     spin_lock(&inode_lock)
       |          V                         |         |
       |      dispose_list()                |         |
       |        list_del()                  |         |
       |        clear_inode()               |         |
       |          inode->i_state = I_CLEAR  |         |
       |            |                       |         V
       |            |                       |      if (inode->i_state & (I_FREEING|I_WILL_FREE))
       |            |                       |              continue;           <==== NOT MATCH
       |            |                       |
       |            |                       | (DANGER from here on! Accessing disposing inode!)
       |            |                       |
       |            |                       |      __iget()
       |            |                       |        list_move() <===== PANIC on poisoned list !!
       V            V                       |
      (time)
      =====================================================================
      Reported-by: default avatarMasayoshi MIZUMA <m.mizuma@jp.fujitsu.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Cc: <stable@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b6fac63c
    • David Howells's avatar
      nommu: fix a number of issues with the per-MM VMA patch · 33e5d769
      David Howells authored
      
      Fix a number of issues with the per-MM VMA patch:
      
       (1) Make mmap_pages_allocated an atomic_long_t, just in case this is used on
           a NOMMU system with more than 2G pages.  Makes no difference on a 32-bit
           system.
      
       (2) Report vma->vm_pgoff * PAGE_SIZE as a 64-bit value, not a 32-bit value,
           lest it overflow.
      
       (3) Move the allocation of the vm_area_struct slab back for fork.c.
      
       (4) Use KMEM_CACHE() for both vm_area_struct and vm_region slabs.
      
       (5) Use BUG_ON() rather than if () BUG().
      
       (6) Make the default validate_nommu_regions() a static inline rather than a
           #define.
      
       (7) Make free_page_series()'s objection to pages with a refcount != 1 more
           informative.
      
       (8) Adjust the __put_nommu_region() banner comment to indicate that the
           semaphore must be held for writing.
      
       (9) Limit the number of warnings about munmaps of non-mmapped regions.
      Reported-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Cc: Greg Ungerer <gerg@snapgear.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      33e5d769
  2. 01 Apr, 2009 25 commits
    • Mans Rullgard's avatar
      NSM: Fix unaligned accesses in nsm_init_private() · ad5b365c
      Mans Rullgard authored
      
      This fixes unaligned accesses in nsm_init_private() when
      creating nlm_reboot keys.
      Signed-off-by: default avatarMans Rullgard <mans@mansr.com>
      Reviewed-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarTrond Myklebust <Trond.Myklebust@netapp.com>
      ad5b365c
    • Ian Kent's avatar
      autofs4: fix lookup deadlock · 8f63aaa8
      Ian Kent authored
      
      A deadlock can occur when user space uses a signal (autofs version 4 uses
      SIGCHLD for this) to effect expire completion.
      
      The order of events is:
      
      Expire process completes, but before being able to send SIGCHLD to it's parent
      ...
      
      Another process walks onto a different mount point and drops the directory
      inode semaphore prior to sending the request to the daemon as it must ...
      
      A third process does an lstat on on the expired mount point causing it to wait
      on expire completion (unfortunately) holding the directory semaphore.
      
      The mount request then arrives at the daemon which does an lstat and,
      deadlock.
      
      For some time I was concerned about releasing the directory semaphore around
      the expire wait in autofs4_lookup as well as for the mount call back.  I
      finally realized that the last round of changes in this function made the
      expiring dentry and the lookup dentry separate and distinct so the check and
      possible wait can be done anywhere prior to the mount call back.  This patch
      moves the check to just before the mount call back and inside the directory
      inode mutex release.
      Signed-off-by: default avatarIan Kent <raven@themaw.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8f63aaa8
    • Ian Kent's avatar
      autofs4: cleanup expire code duplication · 56fcef75
      Ian Kent authored
      
      A significant portion of the autofs_dev_ioctl_expire() and
      autofs4_expire_multi() functions is duplicated code.  This patch cleans that
      up.
      Signed-off-by: default avatarIan Kent <raven@themaw.net>
      Signed-off-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      56fcef75
    • Johannes Weiner's avatar
      ecryptfs: use kzfree() · 00fcf2cb
      Johannes Weiner authored
      
      Use kzfree() instead of memset() + kfree().
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarPekka Enberg <penberg@cs.helsinki.fi>
      Acked-by: default avatarTyler Hicks <tyhicks@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      00fcf2cb
    • Wu Fengguang's avatar
      ramfs: add support for "mode=" mount option · c3b1b1cb
      Wu Fengguang authored
      Addresses http://bugzilla.kernel.org/show_bug.cgi?id=12843
      
      
      
      "I use ramfs instead of tmpfs for /tmp because I don't use swap on my
      laptop.  Some apps need 1777 mode for /tmp directory, but ramfs does not
      support 'mode=' mount option."
      Reported-by: default avatarAvan Anishchuk <matimatik@gmail.com>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c3b1b1cb
    • Davide Libenzi's avatar
      epoll keyed wakeups: make eventfd use keyed wakeups · 39510888
      Davide Libenzi authored
      
      Introduce keyed event wakeups inside the eventfd code.
      Signed-off-by: default avatarDavide Libenzi <davidel@xmailserver.org>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: David Miller <davem@davemloft.net>
      Cc: William Lee Irwin III <wli@movementarian.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      39510888
    • Davide Libenzi's avatar
      epoll keyed wakeups: teach epoll about hints coming with the wakeup key · 2dfa4eea
      Davide Libenzi authored
      
      Use the events hint now sent by some devices, to avoid unnecessary wakeups
      for events that are of no interest for the caller.  This code handles both
      devices that are sending keyed events, and the ones that are not (and
      event the ones that sometimes send events, and sometimes don't).
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarDavide Libenzi <davidel@xmailserver.org>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: David Miller <davem@davemloft.net>
      Cc: William Lee Irwin III <wli@movementarian.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2dfa4eea
    • Davide Libenzi's avatar
      eventfd: improve support for semaphore-like behavior · bcd0b235
      Davide Libenzi authored
      People started using eventfd in a semaphore-like way where before they
      were using pipes.
      
      That is, counter-based resource access.  Where a "wait()" returns
      immediately by decrementing the counter by one, if counter is greater than
      zero.  Otherwise will wait.  And where a "post(count)" will add count to
      the counter releasing the appropriate amount of waiters.  If eventfd the
      "post" (write) part is fine, while the "wait" (read) does not dequeue 1,
      but the whole counter value.
      
      The problem with eventfd is that a read() on the fd returns and wipes the
      whole counter, making the use of it as semaphore a little bit more
      cumbersome.  You can do a read() followed by a write() of COUNTER-1, but
      IMO it's pretty easy and cheap to make this work w/out extra steps.  This
      patch introduces a new eventfd flag that tells eventfd to only dequeue 1
      from the counter, allowing simple read/write to make it behave like a
      semaphore.  Simple test here:
      
      http://www.xmailserver.org/eventfd-sem.c
      
      
      
      To be back-compatible with earlier kernels, userspace applications should
      probe for the availability of this feature via
      
      #ifdef EFD_SEMAPHORE
      	fd = eventfd2 (CNT, EFD_SEMAPHORE);
      	if (fd == -1 && errno == EINVAL)
      		<fallback>
      #else
      		<fallback>
      #endif
      Signed-off-by: default avatarDavide Libenzi <davidel@xmailserver.org>
      Cc: <linux-api@vger.kernel.org>
      Tested-by: default avatarMichael Kerrisk <mtk.manpages@gmail.com>
      Cc: Ulrich Drepper <drepper@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bcd0b235
    • Tony Battersby's avatar
      epoll: use real type instead of void * · 4f0989db
      Tony Battersby authored
      
      eventpoll.c uses void * in one place for no obvious reason; change it to
      use the real type instead.
      Signed-off-by: default avatarTony Battersby <tonyb@cybernetics.com>
      Acked-by: default avatarDavide Libenzi <davidel@xmailserver.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4f0989db
    • Tony Battersby's avatar
      epoll: clean up ep_modify · e057e15f
      Tony Battersby authored
      
      ep_modify() doesn't need to set event.data from within the ep->lock
      spinlock as the comment suggests.  The only place event.data is used is
      ep_send_events_proc(), and this is protected by ep->mtx instead of
      ep->lock.  Also update the comment for mutex_lock() at the top of
      ep_scan_ready_list(), which mentions epoll_ctl(EPOLL_CTL_DEL) but not
      epoll_ctl(EPOLL_CTL_MOD).
      
      ep_modify() can also use spin_lock_irq() instead of spin_lock_irqsave().
      Signed-off-by: default avatarTony Battersby <tonyb@cybernetics.com>
      Acked-by: default avatarDavide Libenzi <davidel@xmailserver.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e057e15f
    • Tony Battersby's avatar
      epoll: remove unnecessary xchg · d1bc90dd
      Tony Battersby authored
      
      xchg in ep_unregister_pollwait() is unnecessary because it is protected by
      either epmutex or ep->mtx (the same protection as ep_remove()).
      
      If xchg was necessary, it would be insufficient to protect against
      problems: if multiple concurrent calls to ep_unregister_pollwait() were
      possible then a second caller that returns without doing anything because
      nwait == 0 could return before the waitqueues are removed by the first
      caller, which looks like it could lead to problematic races with
      ep_poll_callback().
      
      So remove xchg and add comments about the locking.
      Signed-off-by: default avatarTony Battersby <tonyb@cybernetics.com>
      Acked-by: default avatarDavide Libenzi <davidel@xmailserver.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d1bc90dd
    • Tony Battersby's avatar
      epoll: remember the event if epoll_wait returns -EFAULT · d0305882
      Tony Battersby authored
      
      If epoll_wait returns -EFAULT, the event that was being returned when the
      fault was encountered will be forgotten.  This is not a big deal since
      EFAULT will happen only if a buggy userspace program passes in a bad
      address, in which case what happens later usually doesn't matter.
      However, it is easy to remember the event for later, and this patch makes
      a simple change to do that.
      Signed-off-by: default avatarTony Battersby <tonyb@cybernetics.com>
      Acked-by: default avatarDavide Libenzi <davidel@xmailserver.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d0305882
    • Tony Battersby's avatar
      epoll: don't use current in irq context · abff55ce
      Tony Battersby authored
      
      ep_call_nested() (formerly ep_poll_safewake()) uses "current" (without
      dereferencing it) to detect callback recursion, but it may be called from
      irq context where the use of current is generally discouraged.  It would
      be better to use get_cpu() and put_cpu() to detect the callback recursion.
      Signed-off-by: default avatarTony Battersby <tonyb@cybernetics.com>
      Acked-by: default avatarDavide Libenzi <davidel@xmailserver.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      abff55ce
    • Davide Libenzi's avatar
      epoll: remove debugging code · bb57c3ed
      Davide Libenzi authored
      
      Remove debugging code from epoll.  There's no need for it to be included
      into mainline code.
      Signed-off-by: default avatarDavide Libenzi <davidel@xmailserver.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bb57c3ed
    • Davide Libenzi's avatar
      epoll: fix epoll's own poll (update) · 296e236e
      Davide Libenzi authored
      Signed-off-by: default avatarDavide Libenzi <davidel@xmailserver.org>
      Cc: Pavel Pisa <pisa@cmp.felk.cvut.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      296e236e
    • Davide Libenzi's avatar
      epoll: fix epoll's own poll · 5071f97e
      Davide Libenzi authored
      
      Fix a bug inside the epoll's f_op->poll() code, that returns POLLIN even
      though there are no actual ready monitored fds.  The bug shows up if you
      add an epoll fd inside another fd container (poll, select, epoll).
      
      The problem is that callback-based wake ups used by epoll does not carry
      (patches will follow, to fix this) any information about the events that
      actually happened.  So the callback code, since it can't call the file*
      ->poll() inside the callback, chains the file* into a ready-list.
      
      So, suppose you added an fd with EPOLLOUT only, and some data shows up on
      the fd, the file* mapped by the fd will be added into the ready-list (via
      wakeup callback).  During normal epoll_wait() use, this condition is
      sorted out at the time we're actually able to call the file*'s
      f_op->poll().
      
      Inside the old epoll's f_op->poll() though, only a quick check
      !list_empty(ready-list) was performed, and this could have led to
      reporting POLLIN even though no ready fds would show up at a following
      epoll_wait().  In order to correctly report the ready status for an epoll
      fd, the ready-list must be checked to see if any really available fd+event
      would be ready in a following epoll_wait().
      
      Operation (calling f_op->poll() from inside f_op->poll()) that, like wake
      ups, must be handled with care because of the fact that epoll fds can be
      added to other epoll fds.
      
      Test code:
      
      /*
       *  epoll_test by Davide Libenzi (Simple code to test epoll internals)
       *  Copyright (C) 2008  Davide Libenzi
       *
       *  This program is free software; you can redistribute it and/or modify
       *  it under the terms of the GNU General Public License as published by
       *  the Free Software Foundation; either version 2 of the License, or
       *  (at your option) any later version.
       *
       *  This program is distributed in the hope that it will be useful,
       *  but WITHOUT ANY WARRANTY; without even the implied warranty of
       *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
       *  GNU General Public License for more details.
       *
       *  You should have received a copy of the GNU General Public License
       *  along with this program; if not, write to the Free Software
       *  Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
       *
       *  Davide Libenzi <davidel@xmailserver.org>
       *
       */
      
      #include <sys/types.h>
      #include <unistd.h>
      #include <stdio.h>
      #include <stdlib.h>
      #include <string.h>
      #include <errno.h>
      #include <signal.h>
      #include <limits.h>
      #include <poll.h>
      #include <sys/epoll.h>
      #include <sys/wait.h>
      
      #define EPWAIT_TIMEO	(1 * 1000)
      #ifndef POLLRDHUP
      #define POLLRDHUP 0x2000
      #endif
      
      #define EPOLL_MAX_CHAIN	100L
      
      #define EPOLL_TF_LOOP (1 << 0)
      
      struct epoll_test_cfg {
      	long size;
      	long flags;
      };
      
      static int xepoll_create(int n) {
      	int epfd;
      
      	if ((epfd = epoll_create(n)) == -1) {
      		perror("epoll_create");
      		exit(2);
      	}
      
      	return epfd;
      }
      
      static void xepoll_ctl(int epfd, int cmd, int fd, struct epoll_event *evt) {
      	if (epoll_ctl(epfd, cmd, fd, evt) < 0) {
      		perror("epoll_ctl");
      		exit(3);
      	}
      }
      
      static void xpipe(int *fds) {
      	if (pipe(fds)) {
      		perror("pipe");
      		exit(4);
      	}
      }
      
      static pid_t xfork(void) {
      	pid_t pid;
      
      	if ((pid = fork()) == (pid_t) -1) {
      		perror("pipe");
      		exit(5);
      	}
      
      	return pid;
      }
      
      static int run_forked_proc(int (*proc)(void *), void *data) {
      	int status;
      	pid_t pid;
      
      	if ((pid = xfork()) == 0)
      		exit((*proc)(data));
      	if (waitpid(pid, &status, 0) != pid) {
      		perror("waitpid");
      		return -1;
      	}
      
      	return WIFEXITED(status) ? WEXITSTATUS(status): -2;
      }
      
      static int check_events(int fd, int timeo) {
      	struct pollfd pfd;
      
      	fprintf(stdout, "Checking events for fd %d\n", fd);
      	memset(&pfd, 0, sizeof(pfd));
      	pfd.fd = fd;
      	pfd.events = POLLIN | POLLOUT;
      	if (poll(&pfd, 1, timeo) < 0) {
      		perror("poll()");
      		return 0;
      	}
      	if (pfd.revents & POLLIN)
      		fprintf(stdout, "\tPOLLIN\n");
      	if (pfd.revents & POLLOUT)
      		fprintf(stdout, "\tPOLLOUT\n");
      	if (pfd.revents & POLLERR)
      		fprintf(stdout, "\tPOLLERR\n");
      	if (pfd.revents & POLLHUP)
      		fprintf(stdout, "\tPOLLHUP\n");
      	if (pfd.revents & POLLRDHUP)
      		fprintf(stdout, "\tPOLLRDHUP\n");
      
      	return pfd.revents;
      }
      
      static int epoll_test_tty(void *data) {
      	int epfd, ifd = fileno(stdin), res;
      	struct epoll_event evt;
      
      	if (check_events(ifd, 0) != POLLOUT) {
      		fprintf(stderr, "Something is cooking on STDIN (%d)\n", ifd);
      		return 1;
      	}
      	epfd = xepoll_create(1);
      	fprintf(stdout, "Created epoll fd (%d)\n", epfd);
      	memset(&evt, 0, sizeof(evt));
      	evt.events = EPOLLIN;
      	xepoll_ctl(epfd, EPOLL_CTL_ADD, ifd, &evt);
      	if (check_events(epfd, 0) & POLLIN) {
      		res = epoll_wait(epfd, &evt, 1, 0);
      		if (res == 0) {
      			fprintf(stderr, "Epoll fd (%d) is ready when it shouldn't!\n",
      				epfd);
      			return 2;
      		}
      	}
      
      	return 0;
      }
      
      static int epoll_wakeup_chain(void *data) {
      	struct epoll_test_cfg *tcfg = data;
      	int i, res, epfd, bfd, nfd, pfds[2];
      	pid_t pid;
      	struct epoll_event evt;
      
      	memset(&evt, 0, sizeof(evt));
      	evt.events = EPOLLIN;
      
      	epfd = bfd = xepoll_create(1);
      
      	for (i = 0; i < tcfg->size; i++) {
      		nfd = xepoll_create(1);
      		xepoll_ctl(bfd, EPOLL_CTL_ADD, nfd, &evt);
      		bfd = nfd;
      	}
      	xpipe(pfds);
      	if (tcfg->flags & EPOLL_TF_LOOP)
      	{
      		xepoll_ctl(bfd, EPOLL_CTL_ADD, epfd, &evt);
      		/*
      		 * If we're testing for loop, we want that the wakeup
      		 * triggered by the write to the pipe done in the child
      		 * process, triggers a fake event. So we add the pipe
      		 * read size with EPOLLOUT events. This will trigger
      		 * an addition to the ready-list, but no real events
      		 * will be there. The the epoll kernel code will proceed
      		 * in calling f_op->poll() of the epfd, triggering the
      		 * loop we want to test.
      		 */
      		evt.events = EPOLLOUT;
      	}
      	xepoll_ctl(bfd, EPOLL_CTL_ADD, pfds[0], &evt);
      
      	/*
      	 * The pipe write must come after the poll(2) call inside
      	 * check_events(). This tests the nested wakeup code in
      	 * fs/eventpoll.c:ep_poll_safewake()
      	 * By having the check_events() (hence poll(2)) happens first,
      	 * we have poll wait queue filled up, and the write(2) in the
      	 * child will trigger the wakeup chain.
      	 */
      	if ((pid = xfork()) == 0) {
      		sleep(1);
      		write(pfds[1], "w", 1);
      		exit(0);
      	}
      
      	res = check_events(epfd, 2000) & POLLIN;
      
      	if (waitpid(pid, NULL, 0) != pid) {
      		perror("waitpid");
      		return -1;
      	}
      
      	return res;
      }
      
      static int epoll_poll_chain(void *data) {
      	struct epoll_test_cfg *tcfg = data;
      	int i, res, epfd, bfd, nfd, pfds[2];
      	pid_t pid;
      	struct epoll_event evt;
      
      	memset(&evt, 0, sizeof(evt));
      	evt.events = EPOLLIN;
      
      	epfd = bfd = xepoll_create(1);
      
      	for (i = 0; i < tcfg->size; i++) {
      		nfd = xepoll_create(1);
      		xepoll_ctl(bfd, EPOLL_CTL_ADD, nfd, &evt);
      		bfd = nfd;
      	}
      	xpipe(pfds);
      	if (tcfg->flags & EPOLL_TF_LOOP)
      	{
      		xepoll_ctl(bfd, EPOLL_CTL_ADD, epfd, &evt);
      		/*
      		 * If we're testing for loop, we want that the wakeup
      		 * triggered by the write to the pipe done in the child
      		 * process, triggers a fake event. So we add the pipe
      		 * read size with EPOLLOUT events. This will trigger
      		 * an addition to the ready-list, but no real events
      		 * will be there. The the epoll kernel code will proceed
      		 * in calling f_op->poll() of the epfd, triggering the
      		 * loop we want to test.
      		 */
      		evt.events = EPOLLOUT;
      	}
      	xepoll_ctl(bfd, EPOLL_CTL_ADD, pfds[0], &evt);
      
      	/*
      	 * The pipe write mush come before the poll(2) call inside
      	 * check_events(). This tests the nested f_op->poll calls code in
      	 * fs/eventpoll.c:ep_eventpoll_poll()
      	 * By having the pipe write(2) happen first, we make the kernel
      	 * epoll code to load the ready lists, and the following poll(2)
      	 * done inside check_events() will test nested poll code in
      	 * ep_eventpoll_poll().
      	 */
      	if ((pid = xfork()) == 0) {
      		write(pfds[1], "w", 1);
      		exit(0);
      	}
      	sleep(1);
      	res = check_events(epfd, 1000) & POLLIN;
      
      	if (waitpid(pid, NULL, 0) != pid) {
      		perror("waitpid");
      		return -1;
      	}
      
      	return res;
      }
      
      int main(int ac, char **av) {
      	int error;
      	struct epoll_test_cfg tcfg;
      
      	fprintf(stdout, "\n********** Testing TTY events\n");
      	error = run_forked_proc(epoll_test_tty, NULL);
      	fprintf(stdout, error == 0 ?
      		"********** OK\n": "********** FAIL (%d)\n", error);
      
      	tcfg.size = 3;
      	tcfg.flags = 0;
      	fprintf(stdout, "\n********** Testing short wakeup chain\n");
      	error = run_forked_proc(epoll_wakeup_chain, &tcfg);
      	fprintf(stdout, error == POLLIN ?
      		"********** OK\n": "********** FAIL (%d)\n", error);
      
      	tcfg.size = EPOLL_MAX_CHAIN;
      	tcfg.flags = 0;
      	fprintf(stdout, "\n********** Testing long wakeup chain (HOLD ON)\n");
      	error = run_forked_proc(epoll_wakeup_chain, &tcfg);
      	fprintf(stdout, error == 0 ?
      		"********** OK\n": "********** FAIL (%d)\n", error);
      
      	tcfg.size = 3;
      	tcfg.flags = 0;
      	fprintf(stdout, "\n********** Testing short poll chain\n");
      	error = run_forked_proc(epoll_poll_chain, &tcfg);
      	fprintf(stdout, error == POLLIN ?
      		"********** OK\n": "********** FAIL (%d)\n", error);
      
      	tcfg.size = EPOLL_MAX_CHAIN;
      	tcfg.flags = 0;
      	fprintf(stdout, "\n********** Testing long poll chain (HOLD ON)\n");
      	error = run_forked_proc(epoll_poll_chain, &tcfg);
      	fprintf(stdout, error == 0 ?
      		"********** OK\n": "********** FAIL (%d)\n", error);
      
      	tcfg.size = 3;
      	tcfg.flags = EPOLL_TF_LOOP;
      	fprintf(stdout, "\n********** Testing loopy wakeup chain (HOLD ON)\n");
      	error = run_forked_proc(epoll_wakeup_chain, &tcfg);
      	fprintf(stdout, error == 0 ?
      		"********** OK\n": "********** FAIL (%d)\n", error);
      
      	tcfg.size = 3;
      	tcfg.flags = EPOLL_TF_LOOP;
      	fprintf(stdout, "\n********** Testing loopy poll chain (HOLD ON)\n");
      	error = run_forked_proc(epoll_poll_chain, &tcfg);
      	fprintf(stdout, error == 0 ?
      		"********** OK\n": "********** FAIL (%d)\n", error);
      
      	return 0;
      }
      Signed-off-by: default avatarDavide Libenzi <davidel@xmailserver.org>
      Cc: Pavel Pisa <pisa@cmp.felk.cvut.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5071f97e
    • Harvey Harrison's avatar
      ntfs: remove private wrapper of endian helpers · 63cd8854
      Harvey Harrison authored
      
      The base versions handle constant folding now and are shorter than these
      private wrappers, use them directly.
      Signed-off-by: default avatarHarvey Harrison <harvey.harrison@gmail.com>
      Cc: Anton Altaparmakov <aia21@cantab.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      63cd8854
    • Eric Sandeen's avatar
      filesystem freeze: allow SysRq emergency thaw to thaw frozen filesystems · c2d75438
      Eric Sandeen authored
      
      Now that the filesystem freeze operation has been elevated to the VFS, and
      is just an ioctl away, some sort of safety net for unintentionally frozen
      root filesystems may be in order.
      
      The timeout thaw originally proposed did not get merged, but perhaps
      something like this would be useful in emergencies.
      
      For example, freeze /path/to/mountpoint may freeze your root filesystem if
      you forgot that you had that unmounted.
      
      I chose 'j' as the last remaining character other than 'h' which is sort
      of reserved for help (because help is generated on any unknown character).
      
      I've tested this on a non-root fs with multiple (nested) freezers, as well
      as on a system rendered unresponsive due to a frozen root fs.
      
      [randy.dunlap@oracle.com: emergency thaw only if CONFIG_BLOCK enabled]
      Signed-off-by: default avatarEric Sandeen <sandeen@redhat.com>
      Cc: Takashi Sato <t-sato@yk.jp.nec.com>
      Signed-off-by: default avatarRandy Dunlap <randy.dunlap@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c2d75438
    • KAMEZAWA Hiroyuki's avatar
      vmscan: fix it to take care of nodemask · 327c0e96
      KAMEZAWA Hiroyuki authored
      
      try_to_free_pages() is used for the direct reclaim of up to
      SWAP_CLUSTER_MAX pages when watermarks are low.  The caller to
      alloc_pages_nodemask() can specify a nodemask of nodes that are allowed to
      be used but this is not passed to try_to_free_pages().  This can lead to
      unnecessary reclaim of pages that are unusable by the caller and int the
      worst case lead to allocation failure as progress was not been make where
      it is needed.
      
      This patch passes the nodemask used for alloc_pages_nodemask() to
      try_to_free_pages().
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      327c0e96
    • Johannes Weiner's avatar
      ramfs-nommu: use generic lru cache · 2678958e
      Johannes Weiner authored
      
      Instead of open-coding the lru-list-add pagevec batching when expanding a
      file mapping from zero, defer to the appropriate page cache function that
      also takes care of adding the page to the lru list.
      
      This is cleaner, saves code and reduces the stack footprint by 16 words
      worth of pagevec.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarDavid Howells <dhowells@redhat.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Acked-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.com>
      Cc: MinChan Kim <minchan.kim@gmail.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Greg Ungerer <gerg@snapgear.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2678958e
    • Hugh Dickins's avatar
      mm: page_mkwrite change prototype to match fault: fix sysfs · 851a039c
      Hugh Dickins authored
      
      Fix warnings and return values in sysfs bin_page_mkwrite(), fixing
      fs/sysfs/bin.c: In function `bin_page_mkwrite':
      fs/sysfs/bin.c:250: warning: passing argument 2 of `bb->vm_ops->page_mkwrite' from incompatible pointer type
      fs/sysfs/bin.c: At top level:
      fs/sysfs/bin.c:280: warning: initialization from incompatible pointer type
      
      Expects to have my [PATCH next] sysfs: fix some bin_vm_ops errors
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: "Eric W. Biederman" <ebiederm@aristanetworks.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      851a039c
    • Nick Piggin's avatar
      fs: fix page_mkwrite error cases in core code and btrfs · 56a76f82
      Nick Piggin authored
      
      page_mkwrite is called with neither the page lock nor the ptl held.  This
      means a page can be concurrently truncated or invalidated out from
      underneath it.  Callers are supposed to prevent truncate races themselves,
      however previously the only thing they can do in case they hit one is to
      raise a SIGBUS.  A sigbus is wrong for the case that the page has been
      invalidated or truncated within i_size (eg.  hole punched).  Callers may
      also have to perform memory allocations in this path, where again, SIGBUS
      would be wrong.
      
      The previous patch ("mm: page_mkwrite change prototype to match fault")
      made it possible to properly specify errors.  Convert the generic buffer.c
      code and btrfs to return sane error values (in the case of page removed
      from pagecache, VM_FAULT_NOPAGE will cause the fault handler to exit
      without doing anything, and the fault will be retried properly).
      
      This fixes core code, and converts btrfs as a template/example.  All other
      filesystems defining their own page_mkwrite should be fixed in a similar
      manner.
      Acked-by: default avatarChris Mason <chris.mason@oracle.com>
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      56a76f82
    • Nick Piggin's avatar
      mm: page_mkwrite change prototype to match fault · c2ec175c
      Nick Piggin authored
      
      Change the page_mkwrite prototype to take a struct vm_fault, and return
      VM_FAULT_xxx flags.  There should be no functional change.
      
      This makes it possible to return much more detailed error information to
      the VM (and also can provide more information eg.  virtual_address to the
      driver, which might be important in some special cases).
      
      This is required for a subsequent fix.  And will also make it easier to
      merge page_mkwrite() with fault() in future.
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Cc: Miklos Szeredi <miklos@szeredi.hu>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <joel.becker@oracle.com>
      Cc: Artem Bityutskiy <dedekind@infradead.org>
      Cc: Felix Blyakher <felixb@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c2ec175c
    • Ravikiran G Thirumalai's avatar
      mm: reintroduce and deprecate rlimit based access for SHM_HUGETLB · 2584e517
      Ravikiran G Thirumalai authored
      
      Allow non root users with sufficient mlock rlimits to be able to allocate
      hugetlb backed shm for now.  Deprecate this though.  This is being
      deprecated because the mlock based rlimit checks for SHM_HUGETLB is not
      consistent with mmap based huge page allocations.
      Signed-off-by: default avatarRavikiran Thirumalai <kiran@scalex86.org>
      Reviewed-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: Adam Litke <agl@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2584e517
    • Ravikiran G Thirumalai's avatar
      mm: fix SHM_HUGETLB to work with users in hugetlb_shm_group · 8a0bdec1
      Ravikiran G Thirumalai authored
      
      Fix hugetlb subsystem so that non root users belonging to
      hugetlb_shm_group can actually allocate hugetlb backed shm.
      
      Currently non root users cannot even map one large page using SHM_HUGETLB
      when they belong to the gid in /proc/sys/vm/hugetlb_shm_group.  This is
      because allocation size is verified against RLIMIT_MEMLOCK resource limit
      even if the user belongs to hugetlb_shm_group.
      
      This patch
      1. Fixes hugetlb subsystem so that users with CAP_IPC_LOCK and users
         belonging to hugetlb_shm_group don't need to be restricted with
         RLIMIT_MEMLOCK resource limits
      2. This patch also disables mlock based rlimit checking (which will
         be reinstated and marked deprecated in a subsequent patch).
      Signed-off-by: default avatarRavikiran Thirumalai <kiran@scalex86.org>
      Reviewed-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: Adam Litke <agl@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8a0bdec1