• Christoph Lameter's avatar
    Quicklists for page table pages · 6225e937
    Christoph Lameter authored
    
    On x86_64 this cuts allocation overhead for page table pages down to a
    fraction (kernel compile / editing load.  TSC based measurement of times spend
    in each function):
    
    no quicklist
    
    pte_alloc               1569048 4.3s(401ns/2.7us/179.7us)
    pmd_alloc                780988 2.1s(337ns/2.7us/86.1us)
    pud_alloc                780072 2.2s(424ns/2.8us/300.6us)
    pgd_alloc                260022 1s(920ns/4us/263.1us)
    
    quicklist:
    
    pte_alloc                452436 573.4ms(8ns/1.3us/121.1us)
    pmd_alloc                196204 174.5ms(7ns/889ns/46.1us)
    pud_alloc                195688 172.4ms(7ns/881ns/151.3us)
    pgd_alloc                 65228 9.8ms(8ns/150ns/6.1us)
    
    pgd allocations are the most complex and there we see the most dramatic
    improvement (may be we can cut down the amount of pgds cached somewhat?).  But
    even the pte allocations still see a doubling of performance.
    
    1. Proven code from the IA64 arch.
    
    	The method used here has been fine tuned for years and
    	is NUMA aware. It is based on the knowledge that accesses
    	to page table pages are sparse in nature. Taking a page
    	off the freelists instead of allocating a zeroed pages
    	allows a reduction of number of cachelines touched
    	in addition to getting rid of the slab overhead. So
    	performance improves. This is particularly useful if pgds
    	contain standard mappings. We can save on the teardown
    	and setup of such a page if we have some on the quicklists.
    	This includes avoiding lists operations that are otherwise
    	necessary on alloc and free to track pgds.
    
    2. Light weight alternative to use slab to manage page size pages
    
    	Slab overhead is significant and even page allocator use
    	is pretty heavy weight. The use of a per cpu quicklist
    	means that we touch only two cachelines for an allocation.
    	There is no need to access the page_struct (unless arch code
    	needs to fiddle around with it). So the fast past just
    	means bringing in one cacheline at the beginning of the
    	page. That same cacheline may then be used to store the
    	page table entry. Or a second cacheline may be used
    	if the page table entry is not in the first cacheline of
    	the page. The current code will zero the page which means
    	touching 32 cachelines (assuming 128 byte). We get down
    	from 32 to 2 cachelines in the fast path.
    
    3. x86_64 gets lightweight page table page management.
    
    	This will allow x86_64 arch code to faster repopulate pgds
    	and other page table entries. The list operations for pgds
    	are reduced in the same way as for i386 to the point where
    	a pgd is allocated from the page allocator and when it is
    	freed back to the page allocator. A pgd can pass through
    	the quicklists without having to be reinitialized.
    
    64 Consolidation of code from multiple arches
    
    	So far arches have their own implementation of quicklist
    	management. This patch moves that feature into the core allowing
    	an easier maintenance and consistent management of quicklists.
    
    Page table pages have the characteristics that they are typically zero or in a
    known state when they are freed.  This is usually the exactly same state as
    needed after allocation.  So it makes sense to build a list of freed page
    table pages and then consume the pages already in use first.  Those pages have
    already been initialized correctly (thus no need to zero them) and are likely
    already cached in such a way that the MMU can use them most effectively.  Page
    table pages are used in a sparse way so zeroing them on allocation is not too
    useful.
    
    Such an implementation already exits for ia64.  Howver, that implementation
    did not support constructors and destructors as needed by i386 / x86_64.  It
    also only supported a single quicklist.  The implementation here has
    constructor and destructor support as well as the ability for an arch to
    specify how many quicklists are needed.
    
    Quicklists are defined by an arch defining CONFIG_QUICKLIST.  If more than one
    quicklist is necessary then we can define NR_QUICK for additional lists.  F.e.
     i386 needs two and thus has
    
    config NR_QUICK
    	int
    	default 2
    
    If an arch has requested quicklist support then pages can be allocated
    from the quicklist (or from the page allocator if the quicklist is
    empty) via:
    
    quicklist_alloc(<quicklist-nr>, <gfpflags>, <constructor>)
    
    Page table pages can be freed using:
    
    quicklist_free(<quicklist-nr>, <destructor>, <page>)
    
    Pages must have a definite state after allocation and before
    they are freed. If no constructor is specified then pages
    will be zeroed on allocation and must be zeroed before they are
    freed.
    
    If a constructor is used then the constructor will establish
    a definite page state. F.e. the i386 and x86_64 pgd constructors
    establish certain mappings.
    
    Constructors and destructors can also be used to track the pages.
    i386 and x86_64 use a list of pgds in order to be able to dynamically
    update standard mappings.
    Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
    Cc: "David S. Miller" <davem@davemloft.net>
    Cc: Andi Kleen <ak@suse.de>
    Cc: "Luck, Tony" <tony.luck@intel.com>
    Cc: William Lee Irwin III <wli@holomorphy.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    6225e937
Kconfig 5.1 KB
config SELECT_MEMORY_MODEL
	def_bool y
	depends on EXPERIMENTAL || ARCH_SELECT_MEMORY_MODEL

choice
	prompt "Memory model"
	depends on SELECT_MEMORY_MODEL
	default DISCONTIGMEM_MANUAL if ARCH_DISCONTIGMEM_DEFAULT
	default SPARSEMEM_MANUAL if ARCH_SPARSEMEM_DEFAULT
	default FLATMEM_MANUAL

config FLATMEM_MANUAL
	bool "Flat Memory"
	depends on !(ARCH_DISCONTIGMEM_ENABLE || ARCH_SPARSEMEM_ENABLE) || ARCH_FLATMEM_ENABLE
	help
	  This option allows you to change some of the ways that
	  Linux manages its memory internally.  Most users will
	  only have one option here: FLATMEM.  This is normal
	  and a correct option.

	  Some users of more advanced features like NUMA and
	  memory hotplug may have different options here.
	  DISCONTIGMEM is an more mature, better tested system,
	  but is incompatible with memory hotplug and may suffer
	  decreased performance over SPARSEMEM.  If unsure between
	  "Sparse Memory" and "Discontiguous Memory", choose
	  "Discontiguous Memory".

	  If unsure, choose this option (Flat Memory) over any other.

config DISCONTIGMEM_MANUAL
	bool "Discontiguous Memory"
	depends on ARCH_DISCONTIGMEM_ENABLE
	help
	  This option provides enhanced support for discontiguous
	  memory systems, over FLATMEM.  These systems have holes
	  in their physical address spaces, and this option provides
	  more efficient handling of these holes.  However, the vast
	  majority of hardware has quite flat address spaces, and
	  can have degraded performance from extra overhead that
	  this option imposes.

	  Many NUMA configurations will have this as the only option.

	  If unsure, choose "Flat Memory" over this option.

config SPARSEMEM_MANUAL
	bool "Sparse Memory"
	depends on ARCH_SPARSEMEM_ENABLE
	help
	  This will be the only option for some systems, including
	  memory hotplug systems.  This is normal.

	  For many other systems, this will be an alternative to
	  "Discontiguous Memory".  This option provides some potential
	  performance benefits, along with decreased code complexity,
	  but it is newer, and more experimental.

	  If unsure, choose "Discontiguous Memory" or "Flat Memory"
	  over this option.

endchoice

config DISCONTIGMEM
	def_bool y
	depends on (!SELECT_MEMORY_MODEL && ARCH_DISCONTIGMEM_ENABLE) || DISCONTIGMEM_MANUAL

config SPARSEMEM
	def_bool y
	depends on SPARSEMEM_MANUAL

config FLATMEM
	def_bool y
	depends on (!DISCONTIGMEM && !SPARSEMEM) || FLATMEM_MANUAL

config FLAT_NODE_MEM_MAP
	def_bool y
	depends on !SPARSEMEM

#
# Both the NUMA code and DISCONTIGMEM use arrays of pg_data_t's
# to represent different areas of memory.  This variable allows
# those dependencies to exist individually.
#
config NEED_MULTIPLE_NODES
	def_bool y
	depends on DISCONTIGMEM || NUMA

config HAVE_MEMORY_PRESENT
	def_bool y
	depends on ARCH_HAVE_MEMORY_PRESENT || SPARSEMEM

#
# SPARSEMEM_EXTREME (which is the default) does some bootmem
# allocations when memory_present() is called.  If this cannot
# be done on your architecture, select this option.  However,
# statically allocating the mem_section[] array can potentially
# consume vast quantities of .bss, so be careful.
#
# This option will also potentially produce smaller runtime code
# with gcc 3.4 and later.
#
config SPARSEMEM_STATIC
	def_bool n

#
# Architecture platforms which require a two level mem_section in SPARSEMEM
# must select this option. This is usually for architecture platforms with
# an extremely sparse physical address space.
#
config SPARSEMEM_EXTREME
	def_bool y
	depends on SPARSEMEM && !SPARSEMEM_STATIC

# eventually, we can have this option just 'select SPARSEMEM'
config MEMORY_HOTPLUG
	bool "Allow for memory hot-add"
	depends on SPARSEMEM || X86_64_ACPI_NUMA
	depends on HOTPLUG && !SOFTWARE_SUSPEND && ARCH_ENABLE_MEMORY_HOTPLUG
	depends on (IA64 || X86 || PPC64)

comment "Memory hotplug is currently incompatible with Software Suspend"
	depends on SPARSEMEM && HOTPLUG && SOFTWARE_SUSPEND

config MEMORY_HOTPLUG_SPARSE
	def_bool y
	depends on SPARSEMEM && MEMORY_HOTPLUG

# Heavily threaded applications may benefit from splitting the mm-wide
# page_table_lock, so that faults on different parts of the user address
# space can be handled with less contention: split it at this NR_CPUS.
# Default to 4 for wider testing, though 8 might be more appropriate.
# ARM's adjust_pte (unused if VIPT) depends on mm-wide page_table_lock.
# PA-RISC 7xxx's spinlock_t would enlarge struct page from 32 to 44 bytes.
#
config SPLIT_PTLOCK_CPUS
	int
	default "4096" if ARM && !CPU_CACHE_VIPT
	default "4096" if PARISC && !PA20
	default "4"

#
# support for page migration
#
config MIGRATION
	bool "Page migration"
	def_bool y
	depends on NUMA
	help
	  Allows the migration of the physical location of pages of processes
	  while the virtual addresses are not changed. This is useful for
	  example on NUMA systems to put pages nearer to the processors accessing
	  the page.

config RESOURCES_64BIT
	bool "64 bit Memory and IO resources (EXPERIMENTAL)" if (!64BIT && EXPERIMENTAL)
	default 64BIT
	help
	  This option allows memory and IO resources to be 64 bit.

config ZONE_DMA_FLAG
	int
	default "0" if !ZONE_DMA
	default "1"

config NR_QUICK
	int
	depends on QUICKLIST
	default "1"