Documentation/security/self-protection.rst - LeafOS-Devices/android_kernel_samsung_gta4xl - Gitiles

 ======================
 Kernel Self-Protection
 ======================

 Kernel self-protection is the design and implementation of systems and
 structures within the Linux kernel to protect against security flaws in
 the kernel itself. This covers a wide range of issues, including removing
 entire classes of bugs, blocking security flaw exploitation methods,
 and actively detecting attack attempts. Not all topics are explored in
 this document, but it should serve as a reasonable starting point and
 answer any frequently asked questions. (Patches welcome, of course!)

 In the worst-case scenario, we assume an unprivileged local attacker
 has arbitrary read and write access to the kernel's memory. In many
 cases, bugs being exploited will not provide this level of access,
 but with systems in place that defend against the worst case we'll
 cover the more limited cases as well. A higher bar, and one that should
 still be kept in mind, is protecting the kernel against a _privileged_
 local attacker, since the root user has access to a vastly increased
 attack surface. (Especially when they have the ability to load arbitrary
 kernel modules.)

 The goals for successful self-protection systems would be that they
 are effective, on by default, require no opt-in by developers, have no
 performance impact, do not impede kernel debugging, and have tests. It
 is uncommon that all these goals can be met, but it is worth explicitly
 mentioning them, since these aspects need to be explored, dealt with,
 and/or accepted.


 Attack Surface Reduction
 ========================

 The most fundamental defense against security exploits is to reduce the
 areas of the kernel that can be used to redirect execution. This ranges
 from limiting the exposed APIs available to userspace, making in-kernel
 APIs hard to use incorrectly, minimizing the areas of writable kernel
 memory, etc.

 Strict kernel memory permissions
 --------------------------------

 When all of kernel memory is writable, it becomes trivial for attacks
 to redirect execution flow. To reduce the availability of these targets
 the kernel needs to protect its memory with a tight set of permissions.

 Executable code and read-only data must not be writable
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 Any areas of the kernel with executable memory must not be writable.
 While this obviously includes the kernel text itself, we must consider
 all additional places too: kernel modules, JIT memory, etc. (There are
 temporary exceptions to this rule to support things like instruction
 alternatives, breakpoints, kprobes, etc. If these must exist in a
 kernel, they are implemented in a way where the memory is temporarily
 made writable during the update, and then returned to the original
 permissions.)

 In support of this are ``CONFIG_STRICT_KERNEL_RWX`` and
 ``CONFIG_STRICT_MODULE_RWX``, which seek to make sure that code is not
 writable, data is not executable, and read-only data is neither writable
 nor executable.

 Most architectures have these options on by default and not user selectable.
 For some architectures like arm that wish to have these be selectable,
 the architecture Kconfig can select ARCH_OPTIONAL_KERNEL_RWX to enable
 a Kconfig prompt. ``CONFIG_ARCH_OPTIONAL_KERNEL_RWX_DEFAULT`` determines
 the default setting when ARCH_OPTIONAL_KERNEL_RWX is enabled.

 Function pointers and sensitive variables must not be writable
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 Vast areas of kernel memory contain function pointers that are looked
 up by the kernel and used to continue execution (e.g. descriptor/vector
 tables, file/network/etc operation structures, etc). The number of these
 variables must be reduced to an absolute minimum.

 Many such variables can be made read-only by setting them "const"
 so that they live in the .rodata section instead of the .data section
 of the kernel, gaining the protection of the kernel's strict memory
 permissions as described above.

 For variables that are initialized once at ``__init`` time, these can
 be marked with the (new and under development) ``__ro_after_init``
 attribute.

 What remains are variables that are updated rarely (e.g. GDT). These
 will need another infrastructure (similar to the temporary exceptions
 made to kernel code mentioned above) that allow them to spend the rest
 of their lifetime read-only. (For example, when being updated, only the
 CPU thread performing the update would be given uninterruptible write
 access to the memory.)

 Segregation of kernel memory from userspace memory
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 The kernel must never execute userspace memory. The kernel must also never
 access userspace memory without explicit expectation to do so. These
 rules can be enforced either by support of hardware-based restrictions
 (x86's SMEP/SMAP, ARM's PXN/PAN) or via emulation (ARM's Memory Domains).
 By blocking userspace memory in this way, execution and data parsing
 cannot be passed to trivially-controlled userspace memory, forcing
 attacks to operate entirely in kernel memory.

 Reduced access to syscalls
 --------------------------

 One trivial way to eliminate many syscalls for 64-bit systems is building
 without ``CONFIG_COMPAT``. However, this is rarely a feasible scenario.

 The "seccomp" system provides an opt-in feature made available to
 userspace, which provides a way to reduce the number of kernel entry
 points available to a running process. This limits the breadth of kernel
 code that can be reached, possibly reducing the availability of a given
 bug to an attack.

 An area of improvement would be creating viable ways to keep access to
 things like compat, user namespaces, BPF creation, and perf limited only
 to trusted processes. This would keep the scope of kernel entry points
 restricted to the more regular set of normally available to unprivileged
 userspace.

 Restricting access to kernel modules
 ------------------------------------

 The kernel should never allow an unprivileged user the ability to
 load specific kernel modules, since that would provide a facility to
 unexpectedly extend the available attack surface. (The on-demand loading
 of modules via their predefined subsystems, e.g. MODULE_ALIAS_*, is
 considered "expected" here, though additional consideration should be
 given even to these.) For example, loading a filesystem module via an
 unprivileged socket API is nonsense: only the root or physically local
 user should trigger filesystem module loading. (And even this can be up
 for debate in some scenarios.)

 To protect against even privileged users, systems may need to either
 disable module loading entirely (e.g. monolithic kernel builds or
 modules_disabled sysctl), or provide signed modules (e.g.
 ``CONFIG_MODULE_SIG_FORCE``, or dm-crypt with LoadPin), to keep from having
 root load arbitrary kernel code via the module loader interface.


 Memory integrity
 ================

 There are many memory structures in the kernel that are regularly abused
 to gain execution control during an attack, By far the most commonly
 understood is that of the stack buffer overflow in which the return
 address stored on the stack is overwritten. Many other examples of this
 kind of attack exist, and protections exist to defend against them.

 Stack buffer overflow
 ---------------------

 The classic stack buffer overflow involves writing past the expected end
 of a variable stored on the stack, ultimately writing a controlled value
 to the stack frame's stored return address. The most widely used defense
 is the presence of a stack canary between the stack variables and the
 return address (``CONFIG_CC_STACKPROTECTOR``), which is verified just before
 the function returns. Other defenses include things like shadow stacks.

 Stack depth overflow
 --------------------

 A less well understood attack is using a bug that triggers the
 kernel to consume stack memory with deep function calls or large stack
 allocations. With this attack it is possible to write beyond the end of
 the kernel's preallocated stack space and into sensitive structures. Two
 important changes need to be made for better protections: moving the
 sensitive thread_info structure elsewhere, and adding a faulting memory
 hole at the bottom of the stack to catch these overflows.

 Heap memory integrity
 ---------------------

 The structures used to track heap free lists can be sanity-checked during
 allocation and freeing to make sure they aren't being used to manipulate
 other memory areas.

 Counter integrity
 -----------------

 Many places in the kernel use atomic counters to track object references
 or perform similar lifetime management. When these counters can be made
 to wrap (over or under) this traditionally exposes a use-after-free
 flaw. By trapping atomic wrapping, this class of bug vanishes.

 Size calculation overflow detection
 -----------------------------------

 Similar to counter overflow, integer overflows (usually size calculations)
 need to be detected at runtime to kill this class of bug, which
 traditionally leads to being able to write past the end of kernel buffers.


 Probabilistic defenses
 ======================

 While many protections can be considered deterministic (e.g. read-only
 memory cannot be written to), some protections provide only statistical
 defense, in that an attack must gather enough information about a
 running system to overcome the defense. While not perfect, these do
 provide meaningful defenses.

 Canaries, blinding, and other secrets
 -------------------------------------

 It should be noted that things like the stack canary discussed earlier
 are technically statistical defenses, since they rely on a secret value,
 and such values may become discoverable through an information exposure
 flaw.

 Blinding literal values for things like JITs, where the executable
 contents may be partially under the control of userspace, need a similar
 secret value.

 It is critical that the secret values used must be separate (e.g.
 different canary per stack) and high entropy (e.g. is the RNG actually
 working?) in order to maximize their success.

 Kernel Address Space Layout Randomization (KASLR)
 -------------------------------------------------

 Since the location of kernel memory is almost always instrumental in
 mounting a successful attack, making the location non-deterministic
 raises the difficulty of an exploit. (Note that this in turn makes
 the value of information exposures higher, since they may be used to
 discover desired memory locations.)

 Text and module base
 ~~~~~~~~~~~~~~~~~~~~

 By relocating the physical and virtual base address of the kernel at
 boot-time (``CONFIG_RANDOMIZE_BASE``), attacks needing kernel code will be
 frustrated. Additionally, offsetting the module loading base address
 means that even systems that load the same set of modules in the same
 order every boot will not share a common base address with the rest of
 the kernel text.

 Stack base
 ~~~~~~~~~~

 If the base address of the kernel stack is not the same between processes,
 or even not the same between syscalls, targets on or beyond the stack
 become more difficult to locate.

 Dynamic memory base
 ~~~~~~~~~~~~~~~~~~~

 Much of the kernel's dynamic memory (e.g. kmalloc, vmalloc, etc) ends up
 being relatively deterministic in layout due to the order of early-boot
 initializations. If the base address of these areas is not the same
 between boots, targeting them is frustrated, requiring an information
 exposure specific to the region.

 Structure layout
 ~~~~~~~~~~~~~~~~

 By performing a per-build randomization of the layout of sensitive
 structures, attacks must either be tuned to known kernel builds or expose
 enough kernel memory to determine structure layouts before manipulating
 them.


 Preventing Information Exposures
 ================================

 Since the locations of sensitive structures are the primary target for
 attacks, it is important to defend against exposure of both kernel memory
 addresses and kernel memory contents (since they may contain kernel
 addresses or other sensitive things like canary values).

 Unique identifiers
 ------------------

 Kernel memory addresses must never be used as identifiers exposed to
 userspace. Instead, use an atomic counter, an idr, or similar unique
 identifier.

 Memory initialization
 ---------------------

 Memory copied to userspace must always be fully initialized. If not
 explicitly memset(), this will require changes to the compiler to make
 sure structure holes are cleared.

 Memory poisoning
 ----------------

 When releasing memory, it is best to poison the contents (clear stack on
 syscall return, wipe heap memory on a free), to avoid reuse attacks that
 rely on the old contents of memory. This frustrates many uninitialized
 variable attacks, stack content exposures, heap content exposures, and
 use-after-free attacks.

 Destination tracking
 --------------------

 To help kill classes of bugs that result in kernel addresses being
 written to userspace, the destination of writes needs to be tracked. If
 the buffer is destined for userspace (e.g. seq_file backed ``/proc`` files),
 it should automatically censor sensitive values.
	======================
	Kernel Self-Protection
	======================

	Kernel self-protection is the design and implementation of systems and
	structures within the Linux kernel to protect against security flaws in
	the kernel itself. This covers a wide range of issues, including removing
	entire classes of bugs, blocking security flaw exploitation methods,
	and actively detecting attack attempts. Not all topics are explored in
	this document, but it should serve as a reasonable starting point and
	answer any frequently asked questions. (Patches welcome, of course!)

	In the worst-case scenario, we assume an unprivileged local attacker
	has arbitrary read and write access to the kernel's memory. In many
	cases, bugs being exploited will not provide this level of access,
	but with systems in place that defend against the worst case we'll
	cover the more limited cases as well. A higher bar, and one that should
	still be kept in mind, is protecting the kernel against a _privileged_
	local attacker, since the root user has access to a vastly increased
	attack surface. (Especially when they have the ability to load arbitrary
	kernel modules.)

	The goals for successful self-protection systems would be that they
	are effective, on by default, require no opt-in by developers, have no
	performance impact, do not impede kernel debugging, and have tests. It
	is uncommon that all these goals can be met, but it is worth explicitly
	mentioning them, since these aspects need to be explored, dealt with,
	and/or accepted.


	Attack Surface Reduction
	========================

	The most fundamental defense against security exploits is to reduce the
	areas of the kernel that can be used to redirect execution. This ranges
	from limiting the exposed APIs available to userspace, making in-kernel
	APIs hard to use incorrectly, minimizing the areas of writable kernel
	memory, etc.

	Strict kernel memory permissions
	--------------------------------

	When all of kernel memory is writable, it becomes trivial for attacks
	to redirect execution flow. To reduce the availability of these targets
	the kernel needs to protect its memory with a tight set of permissions.

	Executable code and read-only data must not be writable
	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

	Any areas of the kernel with executable memory must not be writable.
	While this obviously includes the kernel text itself, we must consider
	all additional places too: kernel modules, JIT memory, etc. (There are
	temporary exceptions to this rule to support things like instruction
	alternatives, breakpoints, kprobes, etc. If these must exist in a
	kernel, they are implemented in a way where the memory is temporarily
	made writable during the update, and then returned to the original
	permissions.)

	In support of this are ``CONFIG_STRICT_KERNEL_RWX`` and
	``CONFIG_STRICT_MODULE_RWX``, which seek to make sure that code is not
	writable, data is not executable, and read-only data is neither writable
	nor executable.

	Most architectures have these options on by default and not user selectable.
	For some architectures like arm that wish to have these be selectable,
	the architecture Kconfig can select ARCH_OPTIONAL_KERNEL_RWX to enable
	a Kconfig prompt. ``CONFIG_ARCH_OPTIONAL_KERNEL_RWX_DEFAULT`` determines
	the default setting when ARCH_OPTIONAL_KERNEL_RWX is enabled.

	Function pointers and sensitive variables must not be writable
	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

	Vast areas of kernel memory contain function pointers that are looked
	up by the kernel and used to continue execution (e.g. descriptor/vector
	tables, file/network/etc operation structures, etc). The number of these
	variables must be reduced to an absolute minimum.

	Many such variables can be made read-only by setting them "const"
	so that they live in the .rodata section instead of the .data section
	of the kernel, gaining the protection of the kernel's strict memory
	permissions as described above.

	For variables that are initialized once at ``__init`` time, these can
	be marked with the (new and under development) ``__ro_after_init``
	attribute.

	What remains are variables that are updated rarely (e.g. GDT). These
	will need another infrastructure (similar to the temporary exceptions
	made to kernel code mentioned above) that allow them to spend the rest
	of their lifetime read-only. (For example, when being updated, only the
	CPU thread performing the update would be given uninterruptible write
	access to the memory.)

	Segregation of kernel memory from userspace memory
	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

	The kernel must never execute userspace memory. The kernel must also never
	access userspace memory without explicit expectation to do so. These
	rules can be enforced either by support of hardware-based restrictions
	(x86's SMEP/SMAP, ARM's PXN/PAN) or via emulation (ARM's Memory Domains).
	By blocking userspace memory in this way, execution and data parsing
	cannot be passed to trivially-controlled userspace memory, forcing
	attacks to operate entirely in kernel memory.

	Reduced access to syscalls
	--------------------------

	One trivial way to eliminate many syscalls for 64-bit systems is building
	without ``CONFIG_COMPAT``. However, this is rarely a feasible scenario.

	The "seccomp" system provides an opt-in feature made available to
	userspace, which provides a way to reduce the number of kernel entry
	points available to a running process. This limits the breadth of kernel
	code that can be reached, possibly reducing the availability of a given
	bug to an attack.

	An area of improvement would be creating viable ways to keep access to
	things like compat, user namespaces, BPF creation, and perf limited only
	to trusted processes. This would keep the scope of kernel entry points
	restricted to the more regular set of normally available to unprivileged
	userspace.

	Restricting access to kernel modules
	------------------------------------

	The kernel should never allow an unprivileged user the ability to
	load specific kernel modules, since that would provide a facility to
	unexpectedly extend the available attack surface. (The on-demand loading
	of modules via their predefined subsystems, e.g. MODULE_ALIAS_*, is
	considered "expected" here, though additional consideration should be
	given even to these.) For example, loading a filesystem module via an
	unprivileged socket API is nonsense: only the root or physically local
	user should trigger filesystem module loading. (And even this can be up
	for debate in some scenarios.)

	To protect against even privileged users, systems may need to either
	disable module loading entirely (e.g. monolithic kernel builds or
	modules_disabled sysctl), or provide signed modules (e.g.
	``CONFIG_MODULE_SIG_FORCE``, or dm-crypt with LoadPin), to keep from having
	root load arbitrary kernel code via the module loader interface.


	Memory integrity
	================

	There are many memory structures in the kernel that are regularly abused
	to gain execution control during an attack, By far the most commonly
	understood is that of the stack buffer overflow in which the return
	address stored on the stack is overwritten. Many other examples of this
	kind of attack exist, and protections exist to defend against them.

	Stack buffer overflow
	---------------------

	The classic stack buffer overflow involves writing past the expected end
	of a variable stored on the stack, ultimately writing a controlled value
	to the stack frame's stored return address. The most widely used defense
	is the presence of a stack canary between the stack variables and the
	return address (``CONFIG_CC_STACKPROTECTOR``), which is verified just before
	the function returns. Other defenses include things like shadow stacks.

	Stack depth overflow
	--------------------

	A less well understood attack is using a bug that triggers the
	kernel to consume stack memory with deep function calls or large stack
	allocations. With this attack it is possible to write beyond the end of
	the kernel's preallocated stack space and into sensitive structures. Two
	important changes need to be made for better protections: moving the
	sensitive thread_info structure elsewhere, and adding a faulting memory
	hole at the bottom of the stack to catch these overflows.

	Heap memory integrity
	---------------------

	The structures used to track heap free lists can be sanity-checked during
	allocation and freeing to make sure they aren't being used to manipulate
	other memory areas.

	Counter integrity
	-----------------

	Many places in the kernel use atomic counters to track object references
	or perform similar lifetime management. When these counters can be made
	to wrap (over or under) this traditionally exposes a use-after-free
	flaw. By trapping atomic wrapping, this class of bug vanishes.

	Size calculation overflow detection
	-----------------------------------

	Similar to counter overflow, integer overflows (usually size calculations)
	need to be detected at runtime to kill this class of bug, which
	traditionally leads to being able to write past the end of kernel buffers.


	Probabilistic defenses
	======================

	While many protections can be considered deterministic (e.g. read-only
	memory cannot be written to), some protections provide only statistical
	defense, in that an attack must gather enough information about a
	running system to overcome the defense. While not perfect, these do
	provide meaningful defenses.

	Canaries, blinding, and other secrets
	-------------------------------------

	It should be noted that things like the stack canary discussed earlier
	are technically statistical defenses, since they rely on a secret value,
	and such values may become discoverable through an information exposure
	flaw.

	Blinding literal values for things like JITs, where the executable
	contents may be partially under the control of userspace, need a similar
	secret value.

	It is critical that the secret values used must be separate (e.g.
	different canary per stack) and high entropy (e.g. is the RNG actually
	working?) in order to maximize their success.

	Kernel Address Space Layout Randomization (KASLR)
	-------------------------------------------------

	Since the location of kernel memory is almost always instrumental in
	mounting a successful attack, making the location non-deterministic
	raises the difficulty of an exploit. (Note that this in turn makes
	the value of information exposures higher, since they may be used to
	discover desired memory locations.)

	Text and module base
	~~~~~~~~~~~~~~~~~~~~

	By relocating the physical and virtual base address of the kernel at
	boot-time (``CONFIG_RANDOMIZE_BASE``), attacks needing kernel code will be
	frustrated. Additionally, offsetting the module loading base address
	means that even systems that load the same set of modules in the same
	order every boot will not share a common base address with the rest of
	the kernel text.

	Stack base
	~~~~~~~~~~

	If the base address of the kernel stack is not the same between processes,
	or even not the same between syscalls, targets on or beyond the stack
	become more difficult to locate.

	Dynamic memory base
	~~~~~~~~~~~~~~~~~~~

	Much of the kernel's dynamic memory (e.g. kmalloc, vmalloc, etc) ends up
	being relatively deterministic in layout due to the order of early-boot
	initializations. If the base address of these areas is not the same
	between boots, targeting them is frustrated, requiring an information
	exposure specific to the region.

	Structure layout
	~~~~~~~~~~~~~~~~

	By performing a per-build randomization of the layout of sensitive
	structures, attacks must either be tuned to known kernel builds or expose
	enough kernel memory to determine structure layouts before manipulating
	them.


	Preventing Information Exposures
	================================

	Since the locations of sensitive structures are the primary target for
	attacks, it is important to defend against exposure of both kernel memory
	addresses and kernel memory contents (since they may contain kernel
	addresses or other sensitive things like canary values).

	Unique identifiers
	------------------

	Kernel memory addresses must never be used as identifiers exposed to
	userspace. Instead, use an atomic counter, an idr, or similar unique
	identifier.

	Memory initialization
	---------------------

	Memory copied to userspace must always be fully initialized. If not
	explicitly memset(), this will require changes to the compiler to make
	sure structure holes are cleared.

	Memory poisoning
	----------------

	When releasing memory, it is best to poison the contents (clear stack on
	syscall return, wipe heap memory on a free), to avoid reuse attacks that
	rely on the old contents of memory. This frustrates many uninitialized
	variable attacks, stack content exposures, heap content exposures, and
	use-after-free attacks.

	Destination tracking
	--------------------

	To help kill classes of bugs that result in kernel addresses being
	written to userspace, the destination of writes needs to be tracked. If
	the buffer is destined for userspace (e.g. seq_file backed ``/proc`` files),
	it should automatically censor sensitive values.