Blame - Documentation/vm/hugetlbfs_reserv.txt - LeafOS-Devices/android_kernel_samsung_gta4xl

blob: 9aca09a76bed5f08274b7ca4db88ef1cb4074523 [file] [log] [blame]

Mike Kravetz	70bc0dc	2017-05-03 14:55:22 -0700	[diff] [blame]	1	Hugetlbfs Reservation Overview
				2	------------------------------
				3	Huge pages as described at 'Documentation/vm/hugetlbpage.txt' are typically
				4	preallocated for application use. These huge pages are instantiated in a
				5	task's address space at page fault time if the VMA indicates huge pages are
				6	to be used. If no huge page exists at page fault time, the task is sent
				7	a SIGBUS and often dies an unhappy death. Shortly after huge page support
				8	was added, it was determined that it would be better to detect a shortage
				9	of huge pages at mmap() time. The idea is that if there were not enough
				10	huge pages to cover the mapping, the mmap() would fail. This was first
				11	done with a simple check in the code at mmap() time to determine if there
				12	were enough free huge pages to cover the mapping. Like most things in the
				13	kernel, the code has evolved over time. However, the basic idea was to
				14	'reserve' huge pages at mmap() time to ensure that huge pages would be
				15	available for page faults in that mapping. The description below attempts to
				16	describe how huge page reserve processing is done in the v4.10 kernel.
				17
				18
				19	Audience
				20	--------
				21	This description is primarily targeted at kernel developers who are modifying
				22	hugetlbfs code.
				23
				24
				25	The Data Structures
				26	-------------------
				27	resv_huge_pages
				28	This is a global (per-hstate) count of reserved huge pages. Reserved
				29	huge pages are only available to the task which reserved them.
				30	Therefore, the number of huge pages generally available is computed
				31	as (free_huge_pages - resv_huge_pages).
				32	Reserve Map
				33	A reserve map is described by the structure:
				34	struct resv_map {
				35	struct kref refs;
				36	spinlock_t lock;
				37	struct list_head regions;
				38	long adds_in_progress;
				39	struct list_head region_cache;
				40	long region_cache_count;
				41	};
				42	There is one reserve map for each huge page mapping in the system.
				43	The regions list within the resv_map describes the regions within
				44	the mapping. A region is described as:
				45	struct file_region {
				46	struct list_head link;
				47	long from;
				48	long to;
				49	};
				50	The 'from' and 'to' fields of the file region structure are huge page
				51	indices into the mapping. Depending on the type of mapping, a
				52	region in the reserv_map may indicate reservations exist for the
				53	range, or reservations do not exist.
				54	Flags for MAP_PRIVATE Reservations
				55	These are stored in the bottom bits of the reservation map pointer.
				56	#define HPAGE_RESV_OWNER (1UL << 0) Indicates this task is the
				57	owner of the reservations associated with the mapping.
				58	#define HPAGE_RESV_UNMAPPED (1UL << 1) Indicates task originally
				59	mapping this range (and creating reserves) has unmapped a
				60	page from this task (the child) due to a failed COW.
				61	Page Flags
				62	The PagePrivate page flag is used to indicate that a huge page
				63	reservation must be restored when the huge page is freed. More
				64	details will be discussed in the "Freeing huge pages" section.
				65
				66
				67	Reservation Map Location (Private or Shared)
				68	--------------------------------------------
				69	A huge page mapping or segment is either private or shared. If private,
				70	it is typically only available to a single address space (task). If shared,
				71	it can be mapped into multiple address spaces (tasks). The location and
				72	semantics of the reservation map is significantly different for two types
				73	of mappings. Location differences are:
				74	- For private mappings, the reservation map hangs off the the VMA structure.
				75	Specifically, vma->vm_private_data. This reserve map is created at the
				76	time the mapping (mmap(MAP_PRIVATE)) is created.
				77	- For shared mappings, the reservation map hangs off the inode. Specifically,
				78	inode->i_mapping->private_data. Since shared mappings are always backed
				79	by files in the hugetlbfs filesystem, the hugetlbfs code ensures each inode
				80	contains a reservation map. As a result, the reservation map is allocated
				81	when the inode is created.
				82
				83
				84	Creating Reservations
				85	---------------------
				86	Reservations are created when a huge page backed shared memory segment is
				87	created (shmget(SHM_HUGETLB)) or a mapping is created via mmap(MAP_HUGETLB).
				88	These operations result in a call to the routine hugetlb_reserve_pages()
				89
				90	int hugetlb_reserve_pages(struct inode *inode,
				91	long from, long to,
				92	struct vm_area_struct *vma,
				93	vm_flags_t vm_flags)
				94
				95	The first thing hugetlb_reserve_pages() does is check for the NORESERVE
				96	flag was specified in either the shmget() or mmap() call. If NORESERVE
				97	was specified, then this routine returns immediately as no reservation
				98	are desired.
				99
				100	The arguments 'from' and 'to' are huge page indices into the mapping or
				101	underlying file. For shmget(), 'from' is always 0 and 'to' corresponds to
				102	the length of the segment/mapping. For mmap(), the offset argument could
				103	be used to specify the offset into the underlying file. In such a case
				104	the 'from' and 'to' arguments have been adjusted by this offset.
				105
				106	One of the big differences between PRIVATE and SHARED mappings is the way
				107	in which reservations are represented in the reservation map.
				108	- For shared mappings, an entry in the reservation map indicates a reservation
				109	exists or did exist for the corresponding page. As reservations are
				110	consumed, the reservation map is not modified.
				111	- For private mappings, the lack of an entry in the reservation map indicates
				112	a reservation exists for the corresponding page. As reservations are
				113	consumed, entries are added to the reservation map. Therefore, the
				114	reservation map can also be used to determine which reservations have
				115	been consumed.
				116
				117	For private mappings, hugetlb_reserve_pages() creates the reservation map and
				118	hangs it off the VMA structure. In addition, the HPAGE_RESV_OWNER flag is set
				119	to indicate this VMA owns the reservations.
				120
				121	The reservation map is consulted to determine how many huge page reservations
				122	are needed for the current mapping/segment. For private mappings, this is
				123	always the value (to - from). However, for shared mappings it is possible that some reservations may already exist within the range (to - from). See the
				124	section "Reservation Map Modifications" for details on how this is accomplished.
				125
				126	The mapping may be associated with a subpool. If so, the subpool is consulted
				127	to ensure there is sufficient space for the mapping. It is possible that the
				128	subpool has set aside reservations that can be used for the mapping. See the
				129	section "Subpool Reservations" for more details.
				130
				131	After consulting the reservation map and subpool, the number of needed new
				132	reservations is known. The routine hugetlb_acct_memory() is called to check
				133	for and take the requested number of reservations. hugetlb_acct_memory()
				134	calls into routines that potentially allocate and adjust surplus page counts.
				135	However, within those routines the code is simply checking to ensure there
				136	are enough free huge pages to accommodate the reservation. If there are,
				137	the global reservation count resv_huge_pages is adjusted something like the
				138	following.
				139	if (resv_needed <= (resv_huge_pages - free_huge_pages))
				140	resv_huge_pages += resv_needed;
				141	Note that the global lock hugetlb_lock is held when checking and adjusting
				142	these counters.
				143
				144	If there were enough free huge pages and the global count resv_huge_pages
				145	was adjusted, then the reservation map associated with the mapping is
				146	modified to reflect the reservations. In the case of a shared mapping, a
				147	file_region will exist that includes the range 'from' 'to'. For private
				148	mappings, no modifications are made to the reservation map as lack of an
				149	entry indicates a reservation exists.
				150
				151	If hugetlb_reserve_pages() was successful, the global reservation count and
				152	reservation map associated with the mapping will be modified as required to
				153	ensure reservations exist for the range 'from' - 'to'.
				154
				155
				156	Consuming Reservations/Allocating a Huge Page
				157	---------------------------------------------
				158	Reservations are consumed when huge pages associated with the reservations
				159	are allocated and instantiated in the corresponding mapping. The allocation
				160	is performed within the routine alloc_huge_page().
				161	struct page alloc_huge_page(struct vm_area_struct vma,
				162	unsigned long addr, int avoid_reserve)
				163	alloc_huge_page is passed a VMA pointer and a virtual address, so it can
				164	consult the reservation map to determine if a reservation exists. In addition,
				165	alloc_huge_page takes the argument avoid_reserve which indicates reserves
				166	should not be used even if it appears they have been set aside for the
				167	specified address. The avoid_reserve argument is most often used in the case
				168	of Copy on Write and Page Migration where additional copies of an existing
				169	page are being allocated.
				170
				171	The helper routine vma_needs_reservation() is called to determine if a
				172	reservation exists for the address within the mapping(vma). See the section
				173	"Reservation Map Helper Routines" for detailed information on what this
				174	routine does. The value returned from vma_needs_reservation() is generally
				175	0 or 1. 0 if a reservation exists for the address, 1 if no reservation exists.
				176	If a reservation does not exist, and there is a subpool associated with the
				177	mapping the subpool is consulted to determine if it contains reservations.
				178	If the subpool contains reservations, one can be used for this allocation.
				179	However, in every case the avoid_reserve argument overrides the use of
				180	a reservation for the allocation. After determining whether a reservation
				181	exists and can be used for the allocation, the routine dequeue_huge_page_vma()
				182	is called. This routine takes two arguments related to reservations:
				183	- avoid_reserve, this is the same value/argument passed to alloc_huge_page()
				184	- chg, even though this argument is of type long only the values 0 or 1 are
				185	passed to dequeue_huge_page_vma. If the value is 0, it indicates a
				186	reservation exists (see the section "Memory Policy and Reservations" for
				187	possible issues). If the value is 1, it indicates a reservation does not
				188	exist and the page must be taken from the global free pool if possible.
				189	The free lists associated with the memory policy of the VMA are searched for
				190	a free page. If a page is found, the value free_huge_pages is decremented
				191	when the page is removed from the free list. If there was a reservation
				192	associated with the page, the following adjustments are made:
				193	SetPagePrivate(page); /* Indicates allocating this page consumed
				194	* a reservation, and if an error is
				195	* encountered such that the page must be
				196	* freed, the reservation will be restored. */
				197	resv_huge_pages--; /* Decrement the global reservation count */
				198	Note, if no huge page can be found that satisfies the VMA's memory policy
				199	an attempt will be made to allocate one using the buddy allocator. This
				200	brings up the issue of surplus huge pages and overcommit which is beyond
				201	the scope reservations. Even if a surplus page is allocated, the same
				202	reservation based adjustments as above will be made: SetPagePrivate(page) and
				203	resv_huge_pages--.
				204
				205	After obtaining a new huge page, (page)->private is set to the value of
				206	the subpool associated with the page if it exists. This will be used for
				207	subpool accounting when the page is freed.
				208
				209	The routine vma_commit_reservation() is then called to adjust the reserve
				210	map based on the consumption of the reservation. In general, this involves
				211	ensuring the page is represented within a file_region structure of the region
				212	map. For shared mappings where the the reservation was present, an entry
				213	in the reserve map already existed so no change is made. However, if there
				214	was no reservation in a shared mapping or this was a private mapping a new
				215	entry must be created.
				216
				217	It is possible that the reserve map could have been changed between the call
				218	to vma_needs_reservation() at the beginning of alloc_huge_page() and the
				219	call to vma_commit_reservation() after the page was allocated. This would
				220	be possible if hugetlb_reserve_pages was called for the same page in a shared
				221	mapping. In such cases, the reservation count and subpool free page count
				222	will be off by one. This rare condition can be identified by comparing the
				223	return value from vma_needs_reservation and vma_commit_reservation. If such
				224	a race is detected, the subpool and global reserve counts are adjusted to
				225	compensate. See the section "Reservation Map Helper Routines" for more
				226	information on these routines.
				227
				228
				229	Instantiate Huge Pages
				230	----------------------
				231	After huge page allocation, the page is typically added to the page tables
				232	of the allocating task. Before this, pages in a shared mapping are added
				233	to the page cache and pages in private mappings are added to an anonymous
				234	reverse mapping. In both cases, the PagePrivate flag is cleared. Therefore,
				235	when a huge page that has been instantiated is freed no adjustment is made
				236	to the global reservation count (resv_huge_pages).
				237
				238
				239	Freeing Huge Pages
				240	------------------
				241	Huge page freeing is performed by the routine free_huge_page(). This routine
				242	is the destructor for hugetlbfs compound pages. As a result, it is only
				243	passed a pointer to the page struct. When a huge page is freed, reservation
				244	accounting may need to be performed. This would be the case if the page was
				245	associated with a subpool that contained reserves, or the page is being freed
				246	on an error path where a global reserve count must be restored.
				247
				248	The page->private field points to any subpool associated with the page.
				249	If the PagePrivate flag is set, it indicates the global reserve count should
				250	be adjusted (see the section "Consuming Reservations/Allocating a Huge Page"
				251	for information on how these are set).
				252
				253	The routine first calls hugepage_subpool_put_pages() for the page. If this
				254	routine returns a value of 0 (which does not equal the value passed 1) it
				255	indicates reserves are associated with the subpool, and this newly free page
				256	must be used to keep the number of subpool reserves above the minimum size.
				257	Therefore, the global resv_huge_pages counter is incremented in this case.
				258
				259	If the PagePrivate flag was set in the page, the global resv_huge_pages counter
				260	will always be incremented.
				261
				262
				263	Subpool Reservations
				264	--------------------
				265	There is a struct hstate associated with each huge page size. The hstate
				266	tracks all huge pages of the specified size. A subpool represents a subset
				267	of pages within a hstate that is associated with a mounted hugetlbfs
				268	filesystem.
				269
				270	When a hugetlbfs filesystem is mounted a min_size option can be specified
				271	which indicates the minimum number of huge pages required by the filesystem.
				272	If this option is specified, the number of huge pages corresponding to
				273	min_size are reserved for use by the filesystem. This number is tracked in
				274	the min_hpages field of a struct hugepage_subpool. At mount time,
				275	hugetlb_acct_memory(min_hpages) is called to reserve the specified number of
				276	huge pages. If they can not be reserved, the mount fails.
				277
				278	The routines hugepage_subpool_get/put_pages() are called when pages are
				279	obtained from or released back to a subpool. They perform all subpool
				280	accounting, and track any reservations associated with the subpool.
				281	hugepage_subpool_get/put_pages are passed the number of huge pages by which
				282	to adjust the subpool 'used page' count (down for get, up for put). Normally,
				283	they return the same value that was passed or an error if not enough pages
				284	exist in the subpool.
				285
				286	However, if reserves are associated with the subpool a return value less
				287	than the passed value may be returned. This return value indicates the
				288	number of additional global pool adjustments which must be made. For example,
				289	suppose a subpool contains 3 reserved huge pages and someone asks for 5.
				290	The 3 reserved pages associated with the subpool can be used to satisfy part
				291	of the request. But, 2 pages must be obtained from the global pools. To
				292	relay this information to the caller, the value 2 is returned. The caller
				293	is then responsible for attempting to obtain the additional two pages from
				294	the global pools.
				295
				296
				297	COW and Reservations
				298	--------------------
				299	Since shared mappings all point to and use the same underlying pages, the
				300	biggest reservation concern for COW is private mappings. In this case,
				301	two tasks can be pointing at the same previously allocated page. One task
				302	attempts to write to the page, so a new page must be allocated so that each
				303	task points to its own page.
				304
				305	When the page was originally allocated, the reservation for that page was
				306	consumed. When an attempt to allocate a new page is made as a result of
				307	COW, it is possible that no free huge pages are free and the allocation
				308	will fail.
				309
				310	When the private mapping was originally created, the owner of the mapping
				311	was noted by setting the HPAGE_RESV_OWNER bit in the pointer to the reservation
				312	map of the owner. Since the owner created the mapping, the owner owns all
				313	the reservations associated with the mapping. Therefore, when a write fault
				314	occurs and there is no page available, different action is taken for the owner
				315	and non-owner of the reservation.
				316
				317	In the case where the faulting task is not the owner, the fault will fail and
				318	the task will typically receive a SIGBUS.
				319
				320	If the owner is the faulting task, we want it to succeed since it owned the
				321	original reservation. To accomplish this, the page is unmapped from the
				322	non-owning task. In this way, the only reference is from the owning task.
				323	In addition, the HPAGE_RESV_UNMAPPED bit is set in the reservation map pointer
				324	of the non-owning task. The non-owning task may receive a SIGBUS if it later
				325	faults on a non-present page. But, the original owner of the
				326	mapping/reservation will behave as expected.
				327
				328
				329	Reservation Map Modifications
				330	-----------------------------
				331	The following low level routines are used to make modifications to a
				332	reservation map. Typically, these routines are not called directly. Rather,
				333	a reservation map helper routine is called which calls one of these low level
				334	routines. These low level routines are fairly well documented in the source
				335	code (mm/hugetlb.c). These routines are:
				336	long region_chg(struct resv_map *resv, long f, long t);
				337	long region_add(struct resv_map *resv, long f, long t);
				338	void region_abort(struct resv_map *resv, long f, long t);
				339	long region_count(struct resv_map *resv, long f, long t);
				340
				341	Operations on the reservation map typically involve two operations:
				342	1) region_chg() is called to examine the reserve map and determine how
				343	many pages in the specified range [f, t) are NOT currently represented.
				344
				345	The calling code performs global checks and allocations to determine if
				346	there are enough huge pages for the operation to succeed.
				347
				348	2a) If the operation can succeed, region_add() is called to actually modify
				349	the reservation map for the same range [f, t) previously passed to
				350	region_chg().
				351	2b) If the operation can not succeed, region_abort is called for the same range
				352	[f, t) to abort the operation.
				353
				354	Note that this is a two step process where region_add() and region_abort()
				355	are guaranteed to succeed after a prior call to region_chg() for the same
				356	range. region_chg() is responsible for pre-allocating any data structures
				357	necessary to ensure the subsequent operations (specifically region_add()))
				358	will succeed.
				359
				360	As mentioned above, region_chg() determines the number of pages in the range
				361	which are NOT currently represented in the map. This number is returned to
				362	the caller. region_add() returns the number of pages in the range added to
				363	the map. In most cases, the return value of region_add() is the same as the
				364	return value of region_chg(). However, in the case of shared mappings it is
				365	possible for changes to the reservation map to be made between the calls to
				366	region_chg() and region_add(). In this case, the return value of region_add()
				367	will not match the return value of region_chg(). It is likely that in such
				368	cases global counts and subpool accounting will be incorrect and in need of
				369	adjustment. It is the responsibility of the caller to check for this condition
				370	and make the appropriate adjustments.
				371
				372	The routine region_del() is called to remove regions from a reservation map.
				373	It is typically called in the following situations:
				374	- When a file in the hugetlbfs filesystem is being removed, the inode will
				375	be released and the reservation map freed. Before freeing the reservation
				376	map, all the individual file_region structures must be freed. In this case
				377	region_del is passed the range [0, LONG_MAX).
				378	- When a hugetlbfs file is being truncated. In this case, all allocated pages
				379	after the new file size must be freed. In addition, any file_region entries
				380	in the reservation map past the new end of file must be deleted. In this
				381	case, region_del is passed the range [new_end_of_file, LONG_MAX).
				382	- When a hole is being punched in a hugetlbfs file. In this case, huge pages
				383	are removed from the middle of the file one at a time. As the pages are
				384	removed, region_del() is called to remove the corresponding entry from the
				385	reservation map. In this case, region_del is passed the range
				386	[page_idx, page_idx + 1).
				387	In every case, region_del() will return the number of pages removed from the
				388	reservation map. In VERY rare cases, region_del() can fail. This can only
				389	happen in the hole punch case where it has to split an existing file_region
				390	entry and can not allocate a new structure. In this error case, region_del()
				391	will return -ENOMEM. The problem here is that the reservation map will
				392	indicate that there is a reservation for the page. However, the subpool and
				393	global reservation counts will not reflect the reservation. To handle this
				394	situation, the routine hugetlb_fix_reserve_counts() is called to adjust the
				395	counters so that they correspond with the reservation map entry that could
				396	not be deleted.
				397
				398	region_count() is called when unmapping a private huge page mapping. In
				399	private mappings, the lack of a entry in the reservation map indicates that
				400	a reservation exists. Therefore, by counting the number of entries in the
				401	reservation map we know how many reservations were consumed and how many are
				402	outstanding (outstanding = (end - start) - region_count(resv, start, end)).
				403	Since the mapping is going away, the subpool and global reservation counts
				404	are decremented by the number of outstanding reservations.
				405
				406
				407	Reservation Map Helper Routines
				408	-------------------------------
				409	Several helper routines exist to query and modify the reservation maps.
				410	These routines are only interested with reservations for a specific huge
				411	page, so they just pass in an address instead of a range. In addition,
				412	they pass in the associated VMA. From the VMA, the type of mapping (private
				413	or shared) and the location of the reservation map (inode or VMA) can be
				414	determined. These routines simply call the underlying routines described
				415	in the section "Reservation Map Modifications". However, they do take into
				416	account the 'opposite' meaning of reservation map entries for private and
				417	shared mappings and hide this detail from the caller.
				418
				419	long vma_needs_reservation(struct hstate *h,
				420	struct vm_area_struct *vma, unsigned long addr)
				421	This routine calls region_chg() for the specified page. If no reservation
				422	exists, 1 is returned. If a reservation exists, 0 is returned.
				423
				424	long vma_commit_reservation(struct hstate *h,
				425	struct vm_area_struct *vma, unsigned long addr)
				426	This calls region_add() for the specified page. As in the case of region_chg
				427	and region_add, this routine is to be called after a previous call to
				428	vma_needs_reservation. It will add a reservation entry for the page. It
				429	returns 1 if the reservation was added and 0 if not. The return value should
				430	be compared with the return value of the previous call to
				431	vma_needs_reservation. An unexpected difference indicates the reservation
				432	map was modified between calls.
				433
				434	void vma_end_reservation(struct hstate *h,
				435	struct vm_area_struct *vma, unsigned long addr)
				436	This calls region_abort() for the specified page. As in the case of region_chg
				437	and region_abort, this routine is to be called after a previous call to
				438	vma_needs_reservation. It will abort/end the in progress reservation add
				439	operation.
				440
				441	long vma_add_reservation(struct hstate *h,
				442	struct vm_area_struct *vma, unsigned long addr)
				443	This is a special wrapper routine to help facilitate reservation cleanup
				444	on error paths. It is only called from the routine restore_reserve_on_error().
				445	This routine is used in conjunction with vma_needs_reservation in an attempt
				446	to add a reservation to the reservation map. It takes into account the
				447	different reservation map semantics for private and shared mappings. Hence,
				448	region_add is called for shared mappings (as an entry present in the map
				449	indicates a reservation), and region_del is called for private mappings (as
				450	the absence of an entry in the map indicates a reservation). See the section
				451	"Reservation cleanup in error paths" for more information on what needs to
				452	be done on error paths.
				453
				454
				455	Reservation Cleanup in Error Paths
				456	----------------------------------
				457	As mentioned in the section "Reservation Map Helper Routines", reservation
				458	map modifications are performed in two steps. First vma_needs_reservation
				459	is called before a page is allocated. If the allocation is successful,
				460	then vma_commit_reservation is called. If not, vma_end_reservation is called.
				461	Global and subpool reservation counts are adjusted based on success or failure
				462	of the operation and all is well.
				463
				464	Additionally, after a huge page is instantiated the PagePrivate flag is
				465	cleared so that accounting when the page is ultimately freed is correct.
				466
				467	However, there are several instances where errors are encountered after a huge
				468	page is allocated but before it is instantiated. In this case, the page
				469	allocation has consumed the reservation and made the appropriate subpool,
				470	reservation map and global count adjustments. If the page is freed at this
				471	time (before instantiation and clearing of PagePrivate), then free_huge_page
				472	will increment the global reservation count. However, the reservation map
				473	indicates the reservation was consumed. This resulting inconsistent state
				474	will cause the 'leak' of a reserved huge page. The global reserve count will
				475	be higher than it should and prevent allocation of a pre-allocated page.
				476
				477	The routine restore_reserve_on_error() attempts to handle this situation. It
				478	is fairly well documented. The intention of this routine is to restore
				479	the reservation map to the way it was before the page allocation. In this
				480	way, the state of the reservation map will correspond to the global reservation
				481	count after the page is freed.
				482
				483	The routine restore_reserve_on_error itself may encounter errors while
				484	attempting to restore the reservation map entry. In this case, it will
				485	simply clear the PagePrivate flag of the page. In this way, the global
				486	reserve count will not be incremented when the page is freed. However, the
				487	reservation map will continue to look as though the reservation was consumed.
				488	A page can still be allocated for the address, but it will not use a reserved
				489	page as originally intended.
				490
				491	There is some code (most notably userfaultfd) which can not call
				492	restore_reserve_on_error. In this case, it simply modifies the PagePrivate
				493	so that a reservation will not be leaked when the huge page is freed.
				494
				495
				496	Reservations and Memory Policy
				497	------------------------------
				498	Per-node huge page lists existed in struct hstate when git was first used
				499	to manage Linux code. The concept of reservations was added some time later.
				500	When reservations were added, no attempt was made to take memory policy
				501	into account. While cpusets are not exactly the same as memory policy, this
				502	comment in hugetlb_acct_memory sums up the interaction between reservations
				503	and cpusets/memory policy.
				504	/*
				505	* When cpuset is configured, it breaks the strict hugetlb page
				506	* reservation as the accounting is done on a global variable. Such
				507	* reservation is completely rubbish in the presence of cpuset because
				508	* the reservation is not checked against page availability for the
				509	* current cpuset. Application can still potentially OOM'ed by kernel
				510	* with lack of free htlb page in cpuset that the task is in.
				511	* Attempt to enforce strict accounting with cpuset is almost
				512	* impossible (or too ugly) because cpuset is too fluid that
				513	* task or memory node can be dynamically moved between cpusets.
				514	*
				515	* The change of semantics for shared hugetlb mapping with cpuset is
				516	* undesirable. However, in order to preserve some of the semantics,
				517	* we fall back to check against current free page availability as
				518	* a best attempt and hopefully to minimize the impact of changing
				519	* semantics that cpuset has.
				520	*/
				521
				522	Huge page reservations were added to prevent unexpected page allocation
				523	failures (OOM) at page fault time. However, if an application makes use
				524	of cpusets or memory policy there is no guarantee that huge pages will be
				525	available on the required nodes. This is true even if there are a sufficient
				526	number of global reservations.
				527
				528
				529	Mike Kravetz, 7 April 2017