linux/include/asm-generic/vmlinux.lds.h

1181 lines
34 KiB
C
Raw Permalink Normal View History

/*
* Helper macros to support writing architecture specific
* linker scripts.
*
* A minimal linker scripts has following content:
* [This is a sample, architectures may have special requirements]
*
* OUTPUT_FORMAT(...)
* OUTPUT_ARCH(...)
* ENTRY(...)
* SECTIONS
* {
* . = START;
* __init_begin = .;
* HEAD_TEXT_SECTION
* INIT_TEXT_SECTION(PAGE_SIZE)
* INIT_DATA_SECTION(...)
* PERCPU_SECTION(CACHELINE_SIZE)
* __init_end = .;
*
* _stext = .;
* TEXT_SECTION = 0
* _etext = .;
*
* _sdata = .;
* RO_DATA(PAGE_SIZE)
* RW_DATA(...)
* _edata = .;
*
* EXCEPTION_TABLE(...)
*
vmlinux.lds.h: restructure BSS linker script macros The BSS section macros in vmlinux.lds.h currently place the .sbss input section outside the bounds of [__bss_start, __bss_end]. On all architectures except for microblaze that handle both .sbss and __bss_start/__bss_end, this is wrong: the .sbss input section is within the range [__bss_start, __bss_end]. Relatedly, the example code at the top of the file actually has __bss_start/__bss_end defined twice; I believe the right fix here is to define them in the BSS_SECTION macro but not in the BSS macro. Another problem with the current macros is that several architectures have an ALIGN(4) or some other small number just before __bss_stop in their linker scripts. The BSS_SECTION macro currently hardcodes this to 4; while it should really be an argument. It also ignores its sbss_align argument; fix that. mn10300 is the only user at present of any of the macros touched by this patch. It looks like mn10300 actually was incorrectly converted to use the new BSS() macro (the alignment of 4 prior to conversion was a __bss_stop alignment, but the argument to the BSS macro is a start alignment). So fix this as well. I'd like acks from Sam and David on this one. Also CCing Paul, since he has a patch from me which will need to be updated to use BSS_SECTION(0, PAGE_SIZE, 4) once this gets merged. Signed-off-by: Tim Abbott <tabbott@ksplice.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
2009-07-12 22:23:33 +00:00
* BSS_SECTION(0, 0, 0)
* _end = .;
*
* STABS_DEBUG
* DWARF_DEBUG
* ELF_DETAILS
*
* DISCARDS // must be the last
* }
*
* [__init_begin, __init_end] is the init section that may be freed after init
* // __init_begin and __init_end should be page aligned, so that we can
* // free the whole .init memory
* [_stext, _etext] is the text section
* [_sdata, _edata] is the data section
*
* Some of the included output section have their own set of constants.
* Examples are: [__initramfs_start, __initramfs_end] for initramfs and
* [__nosave_begin, __nosave_end] for the nosave data
*/
lib: add allocation tagging support for memory allocation profiling Introduce CONFIG_MEM_ALLOC_PROFILING which provides definitions to easily instrument memory allocators. It registers an "alloc_tags" codetag type with /proc/allocinfo interface to output allocation tag information when the feature is enabled. CONFIG_MEM_ALLOC_PROFILING_DEBUG is provided for debugging the memory allocation profiling instrumentation. Memory allocation profiling can be enabled or disabled at runtime using /proc/sys/vm/mem_profiling sysctl when CONFIG_MEM_ALLOC_PROFILING_DEBUG=n. CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT enables memory allocation profiling by default. [surenb@google.com: Documentation/filesystems/proc.rst: fix allocinfo title] Link: https://lkml.kernel.org/r/20240326073813.727090-1-surenb@google.com [surenb@google.com: do limited memory accounting for modules with ARCH_NEEDS_WEAK_PER_CPU] Link: https://lkml.kernel.org/r/20240402180933.1663992-2-surenb@google.com [klarasmodin@gmail.com: explicitly include irqflags.h in alloc_tag.h] Link: https://lkml.kernel.org/r/20240407133252.173636-1-klarasmodin@gmail.com [surenb@google.com: fix alloc_tag_init() to prevent passing NULL to PTR_ERR()] Link: https://lkml.kernel.org/r/20240417003349.2520094-1-surenb@google.com Link: https://lkml.kernel.org/r/20240321163705.3067592-14-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Co-developed-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Klara Modin <klarasmodin@gmail.com> Tested-by: Kees Cook <keescook@chromium.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andreas Hindborg <a.hindborg@samsung.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Gary Guo <gary@garyguo.net> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-21 16:36:35 +00:00
#include <asm-generic/codetag.lds.h>
#ifndef LOAD_OFFSET
#define LOAD_OFFSET 0
#endif
/*
* Only some architectures want to have the .notes segment visible in
* a separate PT_NOTE ELF Program Header. When this happens, it needs
* to be visible in both the kernel text's PT_LOAD and the PT_NOTE
* Program Headers. In this case, though, the PT_LOAD needs to be made
* the default again so that all the following sections don't also end
* up in the PT_NOTE Program Header.
*/
#ifdef EMITS_PT_NOTE
#define NOTES_HEADERS :text :note
#define NOTES_HEADERS_RESTORE __restore_ph : { *(.__restore_ph) } :text
#else
#define NOTES_HEADERS
#define NOTES_HEADERS_RESTORE
#endif
/*
* Some architectures have non-executable read-only exception tables.
* They can be added to the RO_DATA segment by specifying their desired
* alignment.
*/
#ifdef RO_EXCEPTION_TABLE_ALIGN
#define RO_EXCEPTION_TABLE EXCEPTION_TABLE(RO_EXCEPTION_TABLE_ALIGN)
#else
#define RO_EXCEPTION_TABLE
#endif
/* Align . function alignment. */
#define ALIGN_FUNCTION() . = ALIGN(CONFIG_FUNCTION_ALIGNMENT)
/*
* LD_DEAD_CODE_DATA_ELIMINATION option enables -fdata-sections, which
* generates .data.identifier sections, which need to be pulled in with
* .data. We don't want to pull in .data..other sections, which Linux
* has defined. Same for text and bss.
*
kbuild: add support for Clang LTO This change adds build system support for Clang's Link Time Optimization (LTO). With -flto, instead of ELF object files, Clang produces LLVM bitcode, which is compiled into native code at link time, allowing the final binary to be optimized globally. For more details, see: https://llvm.org/docs/LinkTimeOptimization.html The Kconfig option CONFIG_LTO_CLANG is implemented as a choice, which defaults to LTO being disabled. To use LTO, the architecture must select ARCH_SUPPORTS_LTO_CLANG and support: - compiling with Clang, - compiling all assembly code with Clang's integrated assembler, - and linking with LLD. While using CONFIG_LTO_CLANG_FULL results in the best runtime performance, the compilation is not scalable in time or memory. CONFIG_LTO_CLANG_THIN enables ThinLTO, which allows parallel optimization and faster incremental builds. ThinLTO is used by default if the architecture also selects ARCH_SUPPORTS_LTO_CLANG_THIN: https://clang.llvm.org/docs/ThinLTO.html To enable LTO, LLVM tools must be used to handle bitcode files, by passing LLVM=1 and LLVM_IAS=1 options to make: $ make LLVM=1 LLVM_IAS=1 defconfig $ scripts/config -e LTO_CLANG_THIN $ make LLVM=1 LLVM_IAS=1 To prepare for LTO support with other compilers, common parts are gated behind the CONFIG_LTO option, and LTO can be disabled for specific files by filtering out CC_FLAGS_LTO. Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20201211184633.3213045-3-samitolvanen@google.com
2020-12-11 18:46:19 +00:00
* With LTO_CLANG, the linker also splits sections by default, so we need
* these macros to combine the sections during the final link.
*
kbuild: Add Propeller configuration for kernel build Add the build support for using Clang's Propeller optimizer. Like AutoFDO, Propeller uses hardware sampling to gather information about the frequency of execution of different code paths within a binary. This information is then used to guide the compiler's optimization decisions, resulting in a more efficient binary. The support requires a Clang compiler LLVM 19 or later, and the create_llvm_prof tool (https://github.com/google/autofdo/releases/tag/v0.30.1). This commit is limited to x86 platforms that support PMU features like LBR on Intel machines and AMD Zen3 BRS. Here is an example workflow for building an AutoFDO+Propeller optimized kernel: 1) Build the kernel on the host machine, with AutoFDO and Propeller build config CONFIG_AUTOFDO_CLANG=y CONFIG_PROPELLER_CLANG=y then $ make LLVM=1 CLANG_AUTOFDO_PROFILE=<autofdo_profile> “<autofdo_profile>” is the profile collected when doing a non-Propeller AutoFDO build. This step builds a kernel that has the same optimization level as AutoFDO, plus a metadata section that records basic block information. This kernel image runs as fast as an AutoFDO optimized kernel. 2) Install the kernel on test/production machines. 3) Run the load tests. The '-c' option in perf specifies the sample event period. We suggest using a suitable prime number, like 500009, for this purpose. For Intel platforms: $ perf record -e BR_INST_RETIRED.NEAR_TAKEN:k -a -N -b -c <count> \ -o <perf_file> -- <loadtest> For AMD platforms: The supported system are: Zen3 with BRS, or Zen4 with amd_lbr_v2 # To see if Zen3 support LBR: $ cat proc/cpuinfo | grep " brs" # To see if Zen4 support LBR: $ cat proc/cpuinfo | grep amd_lbr_v2 # If the result is yes, then collect the profile using: $ perf record --pfm-events RETIRED_TAKEN_BRANCH_INSTRUCTIONS:k -a \ -N -b -c <count> -o <perf_file> -- <loadtest> 4) (Optional) Download the raw perf file to the host machine. 5) Generate Propeller profile: $ create_llvm_prof --binary=<vmlinux> --profile=<perf_file> \ --format=propeller --propeller_output_module_name \ --out=<propeller_profile_prefix>_cc_profile.txt \ --propeller_symorder=<propeller_profile_prefix>_ld_profile.txt “create_llvm_prof” is the profile conversion tool, and a prebuilt binary for linux can be found on https://github.com/google/autofdo/releases/tag/v0.30.1 (can also build from source). "<propeller_profile_prefix>" can be something like "/home/user/dir/any_string". This command generates a pair of Propeller profiles: "<propeller_profile_prefix>_cc_profile.txt" and "<propeller_profile_prefix>_ld_profile.txt". 6) Rebuild the kernel using the AutoFDO and Propeller profile files. CONFIG_AUTOFDO_CLANG=y CONFIG_PROPELLER_CLANG=y and $ make LLVM=1 CLANG_AUTOFDO_PROFILE=<autofdo_profile> \ CLANG_PROPELLER_PROFILE_PREFIX=<propeller_profile_prefix> Co-developed-by: Han Shen <shenhan@google.com> Signed-off-by: Han Shen <shenhan@google.com> Signed-off-by: Rong Xu <xur@google.com> Suggested-by: Sriraman Tallam <tmsriram@google.com> Suggested-by: Krzysztof Pszeniczny <kpszeniczny@google.com> Suggested-by: Nick Desaulniers <ndesaulniers@google.com> Suggested-by: Stephane Eranian <eranian@google.com> Tested-by: Yonghong Song <yonghong.song@linux.dev> Tested-by: Nathan Chancellor <nathan@kernel.org> Reviewed-by: Kees Cook <kees@kernel.org> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2024-11-02 17:51:14 +00:00
* With AUTOFDO_CLANG and PROPELLER_CLANG, by default, the linker splits
* text sections and regroups functions into subsections.
*
* RODATA_MAIN is not used because existing code already defines .rodata.x
* sections to be brought in with rodata.
*/
#if defined(CONFIG_LD_DEAD_CODE_DATA_ELIMINATION) || defined(CONFIG_LTO_CLANG) || \
kbuild: Add Propeller configuration for kernel build Add the build support for using Clang's Propeller optimizer. Like AutoFDO, Propeller uses hardware sampling to gather information about the frequency of execution of different code paths within a binary. This information is then used to guide the compiler's optimization decisions, resulting in a more efficient binary. The support requires a Clang compiler LLVM 19 or later, and the create_llvm_prof tool (https://github.com/google/autofdo/releases/tag/v0.30.1). This commit is limited to x86 platforms that support PMU features like LBR on Intel machines and AMD Zen3 BRS. Here is an example workflow for building an AutoFDO+Propeller optimized kernel: 1) Build the kernel on the host machine, with AutoFDO and Propeller build config CONFIG_AUTOFDO_CLANG=y CONFIG_PROPELLER_CLANG=y then $ make LLVM=1 CLANG_AUTOFDO_PROFILE=<autofdo_profile> “<autofdo_profile>” is the profile collected when doing a non-Propeller AutoFDO build. This step builds a kernel that has the same optimization level as AutoFDO, plus a metadata section that records basic block information. This kernel image runs as fast as an AutoFDO optimized kernel. 2) Install the kernel on test/production machines. 3) Run the load tests. The '-c' option in perf specifies the sample event period. We suggest using a suitable prime number, like 500009, for this purpose. For Intel platforms: $ perf record -e BR_INST_RETIRED.NEAR_TAKEN:k -a -N -b -c <count> \ -o <perf_file> -- <loadtest> For AMD platforms: The supported system are: Zen3 with BRS, or Zen4 with amd_lbr_v2 # To see if Zen3 support LBR: $ cat proc/cpuinfo | grep " brs" # To see if Zen4 support LBR: $ cat proc/cpuinfo | grep amd_lbr_v2 # If the result is yes, then collect the profile using: $ perf record --pfm-events RETIRED_TAKEN_BRANCH_INSTRUCTIONS:k -a \ -N -b -c <count> -o <perf_file> -- <loadtest> 4) (Optional) Download the raw perf file to the host machine. 5) Generate Propeller profile: $ create_llvm_prof --binary=<vmlinux> --profile=<perf_file> \ --format=propeller --propeller_output_module_name \ --out=<propeller_profile_prefix>_cc_profile.txt \ --propeller_symorder=<propeller_profile_prefix>_ld_profile.txt “create_llvm_prof” is the profile conversion tool, and a prebuilt binary for linux can be found on https://github.com/google/autofdo/releases/tag/v0.30.1 (can also build from source). "<propeller_profile_prefix>" can be something like "/home/user/dir/any_string". This command generates a pair of Propeller profiles: "<propeller_profile_prefix>_cc_profile.txt" and "<propeller_profile_prefix>_ld_profile.txt". 6) Rebuild the kernel using the AutoFDO and Propeller profile files. CONFIG_AUTOFDO_CLANG=y CONFIG_PROPELLER_CLANG=y and $ make LLVM=1 CLANG_AUTOFDO_PROFILE=<autofdo_profile> \ CLANG_PROPELLER_PROFILE_PREFIX=<propeller_profile_prefix> Co-developed-by: Han Shen <shenhan@google.com> Signed-off-by: Han Shen <shenhan@google.com> Signed-off-by: Rong Xu <xur@google.com> Suggested-by: Sriraman Tallam <tmsriram@google.com> Suggested-by: Krzysztof Pszeniczny <kpszeniczny@google.com> Suggested-by: Nick Desaulniers <ndesaulniers@google.com> Suggested-by: Stephane Eranian <eranian@google.com> Tested-by: Yonghong Song <yonghong.song@linux.dev> Tested-by: Nathan Chancellor <nathan@kernel.org> Reviewed-by: Kees Cook <kees@kernel.org> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2024-11-02 17:51:14 +00:00
defined(CONFIG_AUTOFDO_CLANG) || defined(CONFIG_PROPELLER_CLANG)
#define TEXT_MAIN .text .text.[0-9a-zA-Z_]*
#else
#define TEXT_MAIN .text
#endif
#if defined(CONFIG_LD_DEAD_CODE_DATA_ELIMINATION) || defined(CONFIG_LTO_CLANG)
vmlinux.lds.h: catch even more instrumentation symbols into .data LKP caught another bunch of orphaned instrumentation symbols [0]: mipsel-linux-ld: warning: orphan section `.data.$LPBX1' from `init/main.o' being placed in section `.data.$LPBX1' mipsel-linux-ld: warning: orphan section `.data.$LPBX0' from `init/main.o' being placed in section `.data.$LPBX0' mipsel-linux-ld: warning: orphan section `.data.$LPBX1' from `init/do_mounts.o' being placed in section `.data.$LPBX1' mipsel-linux-ld: warning: orphan section `.data.$LPBX0' from `init/do_mounts.o' being placed in section `.data.$LPBX0' mipsel-linux-ld: warning: orphan section `.data.$LPBX1' from `init/do_mounts_initrd.o' being placed in section `.data.$LPBX1' mipsel-linux-ld: warning: orphan section `.data.$LPBX0' from `init/do_mounts_initrd.o' being placed in section `.data.$LPBX0' mipsel-linux-ld: warning: orphan section `.data.$LPBX1' from `init/initramfs.o' being placed in section `.data.$LPBX1' mipsel-linux-ld: warning: orphan section `.data.$LPBX0' from `init/initramfs.o' being placed in section `.data.$LPBX0' mipsel-linux-ld: warning: orphan section `.data.$LPBX1' from `init/calibrate.o' being placed in section `.data.$LPBX1' mipsel-linux-ld: warning: orphan section `.data.$LPBX0' from `init/calibrate.o' being placed in section `.data.$LPBX0' [...] Soften the wildcard to .data.$L* to grab these ones into .data too. [0] https://lore.kernel.org/lkml/202102231519.lWPLPveV-lkp@intel.com Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Alexander Lobakin <alobakin@pm.me> Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
2021-02-23 11:36:21 +00:00
#define DATA_MAIN .data .data.[0-9a-zA-Z_]* .data..L* .data..compoundliteral* .data.$__unnamed_* .data.$L*
#define SDATA_MAIN .sdata .sdata.[0-9a-zA-Z_]*
#define RODATA_MAIN .rodata .rodata.[0-9a-zA-Z_]* .rodata..L*
#define BSS_MAIN .bss .bss.[0-9a-zA-Z_]* .bss..L* .bss..compoundliteral*
#define SBSS_MAIN .sbss .sbss.[0-9a-zA-Z_]*
#else
#define DATA_MAIN .data
#define SDATA_MAIN .sdata
#define RODATA_MAIN .rodata
#define BSS_MAIN .bss
#define SBSS_MAIN .sbss
#endif
/*
* GCC 4.5 and later have a 32 bytes section alignment for structures.
* Except GCC 4.9, that feels the need to align on 64 bytes.
*/
#define STRUCT_ALIGNMENT 32
#define STRUCT_ALIGN() . = ALIGN(STRUCT_ALIGNMENT)
/*
* The order of the sched class addresses are important, as they are
* used to determine the order of the priority of each sched class in
* relation to each other.
*/
#define SCHED_DATA \
STRUCT_ALIGN(); \
__sched_class_highest = .; \
*(__stop_sched_class) \
*(__dl_sched_class) \
*(__rt_sched_class) \
*(__fair_sched_class) \
sched_ext: Implement BPF extensible scheduler class Implement a new scheduler class sched_ext (SCX), which allows scheduling policies to be implemented as BPF programs to achieve the following: 1. Ease of experimentation and exploration: Enabling rapid iteration of new scheduling policies. 2. Customization: Building application-specific schedulers which implement policies that are not applicable to general-purpose schedulers. 3. Rapid scheduler deployments: Non-disruptive swap outs of scheduling policies in production environments. sched_ext leverages BPF’s struct_ops feature to define a structure which exports function callbacks and flags to BPF programs that wish to implement scheduling policies. The struct_ops structure exported by sched_ext is struct sched_ext_ops, and is conceptually similar to struct sched_class. The role of sched_ext is to map the complex sched_class callbacks to the more simple and ergonomic struct sched_ext_ops callbacks. For more detailed discussion on the motivations and overview, please refer to the cover letter. Later patches will also add several example schedulers and documentation. This patch implements the minimum core framework to enable implementation of BPF schedulers. Subsequent patches will gradually add functionalities including safety guarantee mechanisms, nohz and cgroup support. include/linux/sched/ext.h defines struct sched_ext_ops. With the comment on top, each operation should be self-explanatory. The followings are worth noting: - Both "sched_ext" and its shorthand "scx" are used. If the identifier already has "sched" in it, "ext" is used; otherwise, "scx". - In sched_ext_ops, only .name is mandatory. Every operation is optional and if omitted a simple but functional default behavior is provided. - A new policy constant SCHED_EXT is added and a task can select sched_ext by invoking sched_setscheduler(2) with the new policy constant. However, if the BPF scheduler is not loaded, SCHED_EXT is the same as SCHED_NORMAL and the task is scheduled by CFS. When the BPF scheduler is loaded, all tasks which have the SCHED_EXT policy are switched to sched_ext. - To bridge the workflow imbalance between the scheduler core and sched_ext_ops callbacks, sched_ext uses simple FIFOs called dispatch queues (dsq's). By default, there is one global dsq (SCX_DSQ_GLOBAL), and one local per-CPU dsq (SCX_DSQ_LOCAL). SCX_DSQ_GLOBAL is provided for convenience and need not be used by a scheduler that doesn't require it. SCX_DSQ_LOCAL is the per-CPU FIFO that sched_ext pulls from when putting the next task on the CPU. The BPF scheduler can manage an arbitrary number of dsq's using scx_bpf_create_dsq() and scx_bpf_destroy_dsq(). - sched_ext guarantees system integrity no matter what the BPF scheduler does. To enable this, each task's ownership is tracked through p->scx.ops_state and all tasks are put on scx_tasks list. The disable path can always recover and revert all tasks back to CFS. See p->scx.ops_state and scx_tasks. - A task is not tied to its rq while enqueued. This decouples CPU selection from queueing and allows sharing a scheduling queue across an arbitrary subset of CPUs. This adds some complexities as a task may need to be bounced between rq's right before it starts executing. See dispatch_to_local_dsq() and move_task_to_local_dsq(). - One complication that arises from the above weak association between task and rq is that synchronizing with dequeue() gets complicated as dequeue() may happen anytime while the task is enqueued and the dispatch path might need to release the rq lock to transfer the task. Solving this requires a bit of complexity. See the logic around p->scx.sticky_cpu and p->scx.ops_qseq. - Both enable and disable paths are a bit complicated. The enable path switches all tasks without blocking to avoid issues which can arise from partially switched states (e.g. the switching task itself being starved). The disable path can't trust the BPF scheduler at all, so it also has to guarantee forward progress without blocking. See scx_ops_enable() and scx_ops_disable_workfn(). - When sched_ext is disabled, static_branches are used to shut down the entry points from hot paths. v7: - scx_ops_bypass() was incorrectly and unnecessarily trying to grab scx_ops_enable_mutex which can lead to deadlocks in the disable path. Fixed. - Fixed TASK_DEAD handling bug in scx_ops_enable() path which could lead to use-after-free. - Consolidated per-cpu variable usages and other cleanups. v6: - SCX_NR_ONLINE_OPS replaced with SCX_OPI_*_BEGIN/END so that multiple groups can be expressed. Later CPU hotplug operations are put into their own group. - SCX_OPS_DISABLING state is replaced with the new bypass mechanism which allows temporarily putting the system into simple FIFO scheduling mode bypassing the BPF scheduler. In addition to the shut down path, this will also be used to isolate the BPF scheduler across PM events. Enabling and disabling the bypass mode requires iterating all runnable tasks. rq->scx.runnable_list addition is moved from the later watchdog patch. - ops.prep_enable() is replaced with ops.init_task() and ops.enable/disable() are now called whenever the task enters and leaves sched_ext instead of when the task becomes schedulable on sched_ext and stops being so. A new operation - ops.exit_task() - is called when the task stops being schedulable on sched_ext. - scx_bpf_dispatch() can now be called from ops.select_cpu() too. This removes the need for communicating local dispatch decision made by ops.select_cpu() to ops.enqueue() via per-task storage. SCX_KF_SELECT_CPU is added to support the change. - SCX_TASK_ENQ_LOCAL which told the BPF scheudler that scx_select_cpu_dfl() wants the task to be dispatched to the local DSQ was removed. Instead, scx_bpf_select_cpu_dfl() now dispatches directly if it finds a suitable idle CPU. If such behavior is not desired, users can use scx_bpf_select_cpu_dfl() which returns the verdict in a bool out param. - scx_select_cpu_dfl() was mishandling WAKE_SYNC and could end up queueing many tasks on a local DSQ which makes tasks to execute in order while other CPUs stay idle which made some hackbench numbers really bad. Fixed. - The current state of sched_ext can now be monitored through files under /sys/sched_ext instead of /sys/kernel/debug/sched/ext. This is to enable monitoring on kernels which don't enable debugfs. - sched_ext wasn't telling BPF that ops.dispatch()'s @prev argument may be NULL and a BPF scheduler which derefs the pointer without checking could crash the kernel. Tell BPF. This is currently a bit ugly. A better way to annotate this is expected in the future. - scx_exit_info updated to carry pointers to message buffers instead of embedding them directly. This decouples buffer sizes from API so that they can be changed without breaking compatibility. - exit_code added to scx_exit_info. This is used to indicate different exit conditions on non-error exits and will be used to handle e.g. CPU hotplugs. - The patch "sched_ext: Allow BPF schedulers to switch all eligible tasks into sched_ext" is folded in and the interface is changed so that partial switching is indicated with a new ops flag %SCX_OPS_SWITCH_PARTIAL. This makes scx_bpf_switch_all() unnecessasry and in turn SCX_KF_INIT. ops.init() is now called with SCX_KF_SLEEPABLE. - Code reorganized so that only the parts necessary to integrate with the rest of the kernel are in the header files. - Changes to reflect the BPF and other kernel changes including the addition of bpf_sched_ext_ops.cfi_stubs. v5: - To accommodate 32bit configs, p->scx.ops_state is now atomic_long_t instead of atomic64_t and scx_dsp_buf_ent.qseq which uses load_acquire/store_release is now unsigned long instead of u64. - Fix the bug where bpf_scx_btf_struct_access() was allowing write access to arbitrary fields. - Distinguish kfuncs which can be called from any sched_ext ops and from anywhere. e.g. scx_bpf_pick_idle_cpu() can now be called only from sched_ext ops. - Rename "type" to "kind" in scx_exit_info to make it easier to use on languages in which "type" is a reserved keyword. - Since cff9b2332ab7 ("kernel/sched: Modify initial boot task idle setup"), PF_IDLE is not set on idle tasks which haven't been online yet which made scx_task_iter_next_filtered() include those idle tasks in iterations leading to oopses. Update scx_task_iter_next_filtered() to directly test p->sched_class against idle_sched_class instead of using is_idle_task() which tests PF_IDLE. - Other updates to match upstream changes such as adding const to set_cpumask() param and renaming check_preempt_curr() to wakeup_preempt(). v4: - SCHED_CHANGE_BLOCK replaced with the previous sched_deq_and_put_task()/sched_enq_and_set_tsak() pair. This is because upstream is adaopting a different generic cleanup mechanism. Once that lands, the code will be adapted accordingly. - task_on_scx() used to test whether a task should be switched into SCX, which is confusing. Renamed to task_should_scx(). task_on_scx() now tests whether a task is currently on SCX. - scx_has_idle_cpus is barely used anymore and replaced with direct check on the idle cpumask. - SCX_PICK_IDLE_CORE added and scx_pick_idle_cpu() improved to prefer fully idle cores. - ops.enable() now sees up-to-date p->scx.weight value. - ttwu_queue path is disabled for tasks on SCX to avoid confusing BPF schedulers expecting ->select_cpu() call. - Use cpu_smt_mask() instead of topology_sibling_cpumask() like the rest of the scheduler. v3: - ops.set_weight() added to allow BPF schedulers to track weight changes without polling p->scx.weight. - move_task_to_local_dsq() was losing SCX-specific enq_flags when enqueueing the task on the target dsq because it goes through activate_task() which loses the upper 32bit of the flags. Carry the flags through rq->scx.extra_enq_flags. - scx_bpf_dispatch(), scx_bpf_pick_idle_cpu(), scx_bpf_task_running() and scx_bpf_task_cpu() now use the new KF_RCU instead of KF_TRUSTED_ARGS to make it easier for BPF schedulers to call them. - The kfunc helper access control mechanism implemented through sched_ext_entity.kf_mask is improved. Now SCX_CALL_OP*() is always used when invoking scx_ops operations. v2: - balance_scx_on_up() is dropped. Instead, on UP, balance_scx() is called from put_prev_taks_scx() and pick_next_task_scx() as necessary. To determine whether balance_scx() should be called from put_prev_task_scx(), SCX_TASK_DEQD_FOR_SLEEP flag is added. See the comment in put_prev_task_scx() for details. - sched_deq_and_put_task() / sched_enq_and_set_task() sequences replaced with SCHED_CHANGE_BLOCK(). - Unused all_dsqs list removed. This was a left-over from previous iterations. - p->scx.kf_mask is added to track and enforce which kfunc helpers are allowed. Also, init/exit sequences are updated to make some kfuncs always safe to call regardless of the current BPF scheduler state. Combined, this should make all the kfuncs safe. - BPF now supports sleepable struct_ops operations. Hacky workaround removed and operations and kfunc helpers are tagged appropriately. - BPF now supports bitmask / cpumask helpers. scx_bpf_get_idle_cpumask() and friends are added so that BPF schedulers can use the idle masks with the generic helpers. This replaces the hacky kfunc helpers added by a separate patch in V1. - CONFIG_SCHED_CLASS_EXT can no longer be enabled if SCHED_CORE is enabled. This restriction will be removed by a later patch which adds core-sched support. - Add MAINTAINERS entries and other misc changes. Signed-off-by: Tejun Heo <tj@kernel.org> Co-authored-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com> Cc: Andrea Righi <andrea.righi@canonical.com>
2024-06-18 20:09:17 +00:00
*(__ext_sched_class) \
*(__idle_sched_class) \
__sched_class_lowest = .;
/* The actual configuration determine if the init/exit sections
* are handled as text/data or they can be discarded (which
* often happens at runtime)
*/
#ifndef CONFIG_HAVE_DYNAMIC_FTRACE_NO_PATCHABLE
#define KEEP_PATCHABLE KEEP(*(__patchable_function_entries))
#define PATCHABLE_DISCARDS
#else
#define KEEP_PATCHABLE
#define PATCHABLE_DISCARDS *(__patchable_function_entries)
#endif
#ifndef CONFIG_ARCH_SUPPORTS_CFI_CLANG
/*
* Simply points to ftrace_stub, but with the proper protocol.
* Defined by the linker script in linux/vmlinux.lds.h
*/
#define FTRACE_STUB_HACK ftrace_stub_graph = ftrace_stub;
#else
#define FTRACE_STUB_HACK
#endif
ftrace: create __mcount_loc section This patch creates a section in the kernel called "__mcount_loc". This will hold a list of pointers to the mcount relocation for each call site of mcount. For example: objdump -dr init/main.o [...] Disassembly of section .text: 0000000000000000 <do_one_initcall>: 0: 55 push %rbp [...] 000000000000017b <init_post>: 17b: 55 push %rbp 17c: 48 89 e5 mov %rsp,%rbp 17f: 53 push %rbx 180: 48 83 ec 08 sub $0x8,%rsp 184: e8 00 00 00 00 callq 189 <init_post+0xe> 185: R_X86_64_PC32 mcount+0xfffffffffffffffc [...] We will add a section to point to each function call. .section __mcount_loc,"a",@progbits [...] .quad .text + 0x185 [...] The offset to of the mcount call site in init_post is an offset from the start of the section, and not the start of the function init_post. The mcount relocation is at the call site 0x185 from the start of the .text section. .text + 0x185 == init_post + 0xa We need a way to add this __mcount_loc section in a way that we do not lose the relocations after final link. The .text section here will be attached to all other .text sections after final link and the offsets will be meaningless. We need to keep track of where these .text sections are. To do this, we use the start of the first function in the section. do_one_initcall. We can make a tmp.s file with this function as a reference to the start of the .text section. .section __mcount_loc,"a",@progbits [...] .quad do_one_initcall + 0x185 [...] Then we can compile the tmp.s into a tmp.o gcc -c tmp.s -o tmp.o And link it into back into main.o. ld -r main.o tmp.o -o tmp_main.o mv tmp_main.o main.o But we have a problem. What happens if the first function in a section is not exported, and is a static function. The linker will not let the tmp.o use it. This case exists in main.o as well. Disassembly of section .init.text: 0000000000000000 <set_reset_devices>: 0: 55 push %rbp 1: 48 89 e5 mov %rsp,%rbp 4: e8 00 00 00 00 callq 9 <set_reset_devices+0x9> 5: R_X86_64_PC32 mcount+0xfffffffffffffffc The first function in .init.text is a static function. 00000000000000a8 t __setup_set_reset_devices 000000000000105f t __setup_str_set_reset_devices 0000000000000000 t set_reset_devices The lowercase 't' means that set_reset_devices is local and is not exported. If we simply try to link the tmp.o with the set_reset_devices we end up with two symbols: one local and one global. .section __mcount_loc,"a",@progbits .quad set_reset_devices + 0x10 00000000000000a8 t __setup_set_reset_devices 000000000000105f t __setup_str_set_reset_devices 0000000000000000 t set_reset_devices U set_reset_devices We still have an undefined reference to set_reset_devices, and if we try to compile the kernel, we will end up with an undefined reference to set_reset_devices, or even worst, it could be exported someplace else, and then we will have a reference to the wrong location. To handle this case, we make an intermediate step using objcopy. We convert set_reset_devices into a global exported symbol before linking it with tmp.o and set it back afterwards. 00000000000000a8 t __setup_set_reset_devices 000000000000105f t __setup_str_set_reset_devices 0000000000000000 T set_reset_devices 00000000000000a8 t __setup_set_reset_devices 000000000000105f t __setup_str_set_reset_devices 0000000000000000 T set_reset_devices 00000000000000a8 t __setup_set_reset_devices 000000000000105f t __setup_str_set_reset_devices 0000000000000000 t set_reset_devices Now we have a section in main.o called __mcount_loc that we can place somewhere in the kernel using vmlinux.ld.S and access it to convert all these locations that call mcount into nops before starting SMP and thus, eliminating the need to do this with kstop_machine. Note, A well documented perl script (scripts/recordmcount.pl) is used to do all this in one location. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-14 19:45:07 +00:00
#ifdef CONFIG_FTRACE_MCOUNT_RECORD
module/ftrace: handle patchable-function-entry When using patchable-function-entry, the compiler will record the callsites into a section named "__patchable_function_entries" rather than "__mcount_loc". Let's abstract this difference behind a new FTRACE_CALLSITE_SECTION, so that architectures don't have to handle this explicitly (e.g. with custom module linker scripts). As parisc currently handles this explicitly, it is fixed up accordingly, with its custom linker script removed. Since FTRACE_CALLSITE_SECTION is only defined when DYNAMIC_FTRACE is selected, the parisc module loading code is updated to only use the definition in that case. When DYNAMIC_FTRACE is not selected, modules shouldn't have this section, so this removes some redundant work in that case. To make sure that this is keep up-to-date for modules and the main kernel, a comment is added to vmlinux.lds.h, with the existing ifdeffery simplified for legibility. I built parisc generic-{32,64}bit_defconfig with DYNAMIC_FTRACE enabled, and verified that the section made it into the .ko files for modules. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Helge Deller <deller@gmx.de> Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Reviewed-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Reviewed-by: Torsten Duwe <duwe@suse.de> Tested-by: Amit Daniel Kachhap <amit.kachhap@arm.com> Tested-by: Sven Schnelle <svens@stackframe.org> Tested-by: Torsten Duwe <duwe@suse.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Jessica Yu <jeyu@kernel.org> Cc: linux-parisc@vger.kernel.org
2019-10-16 17:17:11 +00:00
/*
* The ftrace call sites are logged to a section whose name depends on the
* compiler option used. A given kernel image will only use one, AKA
* FTRACE_CALLSITE_SECTION. We capture all of them here to avoid header
* dependencies for FTRACE_CALLSITE_SECTION's definition.
New tracing features: - PERAMAENT flag to ftrace_ops when attaching a callback to a function As /proc/sys/kernel/ftrace_enabled when set to zero will disable all attached callbacks in ftrace, this has a detrimental impact on live kernel tracing, as it disables all that it patched. If a ftrace_ops is registered to ftrace with the PERMANENT flag set, it will prevent ftrace_enabled from being disabled, and if ftrace_enabled is already disabled, it will prevent a ftrace_ops with PREMANENT flag set from being registered. - New register_ftrace_direct(). As eBPF would like to register its own trampolines to be called by the ftrace nop locations directly, without going through the ftrace trampoline, this function has been added. This allows for eBPF trampolines to live along side of ftrace, perf, kprobe and live patching. It also utilizes the ftrace enabled_functions file that keeps track of functions that have been modified in the kernel, to allow for security auditing. - Allow for kernel internal use of ftrace instances. Subsystems in the kernel can now create and destroy their own tracing instances which allows them to have their own tracing buffer, and be able to record events without worrying about other users from writing over their data. - New seq_buf_hex_dump() that lets users use the hex_dump() in their seq_buf usage. - Notifications now added to tracing_max_latency to allow user space to know when a new max latency is hit by one of the latency tracers. - Wider spread use of generic compare operations for use of bsearch and friends. - More synthetic event fields may be defined (32 up from 16) - Use of xarray for architectures with sparse system calls, for the system call trace events. This along with small clean ups and fixes. -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXdwv4BQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6qnB5AP91vsdHQjwE1+/UWG/cO+qFtKvn2QJK QmBRIJNH/s+1TAD/fAOhgw+ojSK3o/qc+NpvPTEW9AEwcJL1wacJUn+XbQc= =ztql -----END PGP SIGNATURE----- Merge tag 'trace-v5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing updates from Steven Rostedt: "New tracing features: - New PERMANENT flag to ftrace_ops when attaching a callback to a function. As /proc/sys/kernel/ftrace_enabled when set to zero will disable all attached callbacks in ftrace, this has a detrimental impact on live kernel tracing, as it disables all that it patched. If a ftrace_ops is registered to ftrace with the PERMANENT flag set, it will prevent ftrace_enabled from being disabled, and if ftrace_enabled is already disabled, it will prevent a ftrace_ops with PREMANENT flag set from being registered. - New register_ftrace_direct(). As eBPF would like to register its own trampolines to be called by the ftrace nop locations directly, without going through the ftrace trampoline, this function has been added. This allows for eBPF trampolines to live along side of ftrace, perf, kprobe and live patching. It also utilizes the ftrace enabled_functions file that keeps track of functions that have been modified in the kernel, to allow for security auditing. - Allow for kernel internal use of ftrace instances. Subsystems in the kernel can now create and destroy their own tracing instances which allows them to have their own tracing buffer, and be able to record events without worrying about other users from writing over their data. - New seq_buf_hex_dump() that lets users use the hex_dump() in their seq_buf usage. - Notifications now added to tracing_max_latency to allow user space to know when a new max latency is hit by one of the latency tracers. - Wider spread use of generic compare operations for use of bsearch and friends. - More synthetic event fields may be defined (32 up from 16) - Use of xarray for architectures with sparse system calls, for the system call trace events. This along with small clean ups and fixes" * tag 'trace-v5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (51 commits) tracing: Enable syscall optimization for MIPS tracing: Use xarray for syscall trace events tracing: Sample module to demonstrate kernel access to Ftrace instances. tracing: Adding new functions for kernel access to Ftrace instances tracing: Fix Kconfig indentation ring-buffer: Fix typos in function ring_buffer_producer ftrace: Use BIT() macro ftrace: Return ENOTSUPP when DYNAMIC_FTRACE_WITH_DIRECT_CALLS is not configured ftrace: Rename ftrace_graph_stub to ftrace_stub_graph ftrace: Add a helper function to modify_ftrace_direct() to allow arch optimization ftrace: Add helper find_direct_entry() to consolidate code ftrace: Add another check for match in register_ftrace_direct() ftrace: Fix accounting bug with direct->count in register_ftrace_direct() ftrace/selftests: Fix spelling mistake "wakeing" -> "waking" tracing: Increase SYNTH_FIELDS_MAX for synthetic_events ftrace/samples: Add a sample module that implements modify_ftrace_direct() ftrace: Add modify_ftrace_direct() tracing: Add missing "inline" in stub function of latency_fsnotify() tracing: Remove stray tab in TRACE_EVAL_MAP_FILE's help text tracing: Use seq_buf_hex_dump() to dump buffers ...
2019-11-27 19:42:01 +00:00
*
* ftrace_ops_list_func will be defined as arch_ftrace_ops_list_func
* as some archs will have a different prototype for that function
* but ftrace_ops_list_func() will have a single prototype.
module/ftrace: handle patchable-function-entry When using patchable-function-entry, the compiler will record the callsites into a section named "__patchable_function_entries" rather than "__mcount_loc". Let's abstract this difference behind a new FTRACE_CALLSITE_SECTION, so that architectures don't have to handle this explicitly (e.g. with custom module linker scripts). As parisc currently handles this explicitly, it is fixed up accordingly, with its custom linker script removed. Since FTRACE_CALLSITE_SECTION is only defined when DYNAMIC_FTRACE is selected, the parisc module loading code is updated to only use the definition in that case. When DYNAMIC_FTRACE is not selected, modules shouldn't have this section, so this removes some redundant work in that case. To make sure that this is keep up-to-date for modules and the main kernel, a comment is added to vmlinux.lds.h, with the existing ifdeffery simplified for legibility. I built parisc generic-{32,64}bit_defconfig with DYNAMIC_FTRACE enabled, and verified that the section made it into the .ko files for modules. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Helge Deller <deller@gmx.de> Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Reviewed-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Reviewed-by: Torsten Duwe <duwe@suse.de> Tested-by: Amit Daniel Kachhap <amit.kachhap@arm.com> Tested-by: Sven Schnelle <svens@stackframe.org> Tested-by: Torsten Duwe <duwe@suse.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Jessica Yu <jeyu@kernel.org> Cc: linux-parisc@vger.kernel.org
2019-10-16 17:17:11 +00:00
*/
#define MCOUNT_REC() . = ALIGN(8); \
__start_mcount_loc = .; \
KEEP(*(__mcount_loc)) \
KEEP_PATCHABLE \
__stop_mcount_loc = .; \
FTRACE_STUB_HACK \
ftrace_ops_list_func = arch_ftrace_ops_list_func;
ftrace: create __mcount_loc section This patch creates a section in the kernel called "__mcount_loc". This will hold a list of pointers to the mcount relocation for each call site of mcount. For example: objdump -dr init/main.o [...] Disassembly of section .text: 0000000000000000 <do_one_initcall>: 0: 55 push %rbp [...] 000000000000017b <init_post>: 17b: 55 push %rbp 17c: 48 89 e5 mov %rsp,%rbp 17f: 53 push %rbx 180: 48 83 ec 08 sub $0x8,%rsp 184: e8 00 00 00 00 callq 189 <init_post+0xe> 185: R_X86_64_PC32 mcount+0xfffffffffffffffc [...] We will add a section to point to each function call. .section __mcount_loc,"a",@progbits [...] .quad .text + 0x185 [...] The offset to of the mcount call site in init_post is an offset from the start of the section, and not the start of the function init_post. The mcount relocation is at the call site 0x185 from the start of the .text section. .text + 0x185 == init_post + 0xa We need a way to add this __mcount_loc section in a way that we do not lose the relocations after final link. The .text section here will be attached to all other .text sections after final link and the offsets will be meaningless. We need to keep track of where these .text sections are. To do this, we use the start of the first function in the section. do_one_initcall. We can make a tmp.s file with this function as a reference to the start of the .text section. .section __mcount_loc,"a",@progbits [...] .quad do_one_initcall + 0x185 [...] Then we can compile the tmp.s into a tmp.o gcc -c tmp.s -o tmp.o And link it into back into main.o. ld -r main.o tmp.o -o tmp_main.o mv tmp_main.o main.o But we have a problem. What happens if the first function in a section is not exported, and is a static function. The linker will not let the tmp.o use it. This case exists in main.o as well. Disassembly of section .init.text: 0000000000000000 <set_reset_devices>: 0: 55 push %rbp 1: 48 89 e5 mov %rsp,%rbp 4: e8 00 00 00 00 callq 9 <set_reset_devices+0x9> 5: R_X86_64_PC32 mcount+0xfffffffffffffffc The first function in .init.text is a static function. 00000000000000a8 t __setup_set_reset_devices 000000000000105f t __setup_str_set_reset_devices 0000000000000000 t set_reset_devices The lowercase 't' means that set_reset_devices is local and is not exported. If we simply try to link the tmp.o with the set_reset_devices we end up with two symbols: one local and one global. .section __mcount_loc,"a",@progbits .quad set_reset_devices + 0x10 00000000000000a8 t __setup_set_reset_devices 000000000000105f t __setup_str_set_reset_devices 0000000000000000 t set_reset_devices U set_reset_devices We still have an undefined reference to set_reset_devices, and if we try to compile the kernel, we will end up with an undefined reference to set_reset_devices, or even worst, it could be exported someplace else, and then we will have a reference to the wrong location. To handle this case, we make an intermediate step using objcopy. We convert set_reset_devices into a global exported symbol before linking it with tmp.o and set it back afterwards. 00000000000000a8 t __setup_set_reset_devices 000000000000105f t __setup_str_set_reset_devices 0000000000000000 T set_reset_devices 00000000000000a8 t __setup_set_reset_devices 000000000000105f t __setup_str_set_reset_devices 0000000000000000 T set_reset_devices 00000000000000a8 t __setup_set_reset_devices 000000000000105f t __setup_str_set_reset_devices 0000000000000000 t set_reset_devices Now we have a section in main.o called __mcount_loc that we can place somewhere in the kernel using vmlinux.ld.S and access it to convert all these locations that call mcount into nops before starting SMP and thus, eliminating the need to do this with kstop_machine. Note, A well documented perl script (scripts/recordmcount.pl) is used to do all this in one location. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-14 19:45:07 +00:00
#else
# ifdef CONFIG_FUNCTION_TRACER
# define MCOUNT_REC() FTRACE_STUB_HACK \
ftrace_ops_list_func = arch_ftrace_ops_list_func;
# else
# define MCOUNT_REC()
# endif
ftrace: create __mcount_loc section This patch creates a section in the kernel called "__mcount_loc". This will hold a list of pointers to the mcount relocation for each call site of mcount. For example: objdump -dr init/main.o [...] Disassembly of section .text: 0000000000000000 <do_one_initcall>: 0: 55 push %rbp [...] 000000000000017b <init_post>: 17b: 55 push %rbp 17c: 48 89 e5 mov %rsp,%rbp 17f: 53 push %rbx 180: 48 83 ec 08 sub $0x8,%rsp 184: e8 00 00 00 00 callq 189 <init_post+0xe> 185: R_X86_64_PC32 mcount+0xfffffffffffffffc [...] We will add a section to point to each function call. .section __mcount_loc,"a",@progbits [...] .quad .text + 0x185 [...] The offset to of the mcount call site in init_post is an offset from the start of the section, and not the start of the function init_post. The mcount relocation is at the call site 0x185 from the start of the .text section. .text + 0x185 == init_post + 0xa We need a way to add this __mcount_loc section in a way that we do not lose the relocations after final link. The .text section here will be attached to all other .text sections after final link and the offsets will be meaningless. We need to keep track of where these .text sections are. To do this, we use the start of the first function in the section. do_one_initcall. We can make a tmp.s file with this function as a reference to the start of the .text section. .section __mcount_loc,"a",@progbits [...] .quad do_one_initcall + 0x185 [...] Then we can compile the tmp.s into a tmp.o gcc -c tmp.s -o tmp.o And link it into back into main.o. ld -r main.o tmp.o -o tmp_main.o mv tmp_main.o main.o But we have a problem. What happens if the first function in a section is not exported, and is a static function. The linker will not let the tmp.o use it. This case exists in main.o as well. Disassembly of section .init.text: 0000000000000000 <set_reset_devices>: 0: 55 push %rbp 1: 48 89 e5 mov %rsp,%rbp 4: e8 00 00 00 00 callq 9 <set_reset_devices+0x9> 5: R_X86_64_PC32 mcount+0xfffffffffffffffc The first function in .init.text is a static function. 00000000000000a8 t __setup_set_reset_devices 000000000000105f t __setup_str_set_reset_devices 0000000000000000 t set_reset_devices The lowercase 't' means that set_reset_devices is local and is not exported. If we simply try to link the tmp.o with the set_reset_devices we end up with two symbols: one local and one global. .section __mcount_loc,"a",@progbits .quad set_reset_devices + 0x10 00000000000000a8 t __setup_set_reset_devices 000000000000105f t __setup_str_set_reset_devices 0000000000000000 t set_reset_devices U set_reset_devices We still have an undefined reference to set_reset_devices, and if we try to compile the kernel, we will end up with an undefined reference to set_reset_devices, or even worst, it could be exported someplace else, and then we will have a reference to the wrong location. To handle this case, we make an intermediate step using objcopy. We convert set_reset_devices into a global exported symbol before linking it with tmp.o and set it back afterwards. 00000000000000a8 t __setup_set_reset_devices 000000000000105f t __setup_str_set_reset_devices 0000000000000000 T set_reset_devices 00000000000000a8 t __setup_set_reset_devices 000000000000105f t __setup_str_set_reset_devices 0000000000000000 T set_reset_devices 00000000000000a8 t __setup_set_reset_devices 000000000000105f t __setup_str_set_reset_devices 0000000000000000 t set_reset_devices Now we have a section in main.o called __mcount_loc that we can place somewhere in the kernel using vmlinux.ld.S and access it to convert all these locations that call mcount into nops before starting SMP and thus, eliminating the need to do this with kstop_machine. Note, A well documented perl script (scripts/recordmcount.pl) is used to do all this in one location. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-08-14 19:45:07 +00:00
#endif
#define BOUNDED_SECTION_PRE_LABEL(_sec_, _label_, _BEGIN_, _END_) \
_BEGIN_##_label_ = .; \
KEEP(*(_sec_)) \
_END_##_label_ = .;
#define BOUNDED_SECTION_POST_LABEL(_sec_, _label_, _BEGIN_, _END_) \
_label_##_BEGIN_ = .; \
KEEP(*(_sec_)) \
_label_##_END_ = .;
#define BOUNDED_SECTION_BY(_sec_, _label_) \
BOUNDED_SECTION_PRE_LABEL(_sec_, _label_, __start, __stop)
#define BOUNDED_SECTION(_sec) BOUNDED_SECTION_BY(_sec, _sec)
vmlinux.lds.h: add HEADERED_SECTION_* macros These macros elaborate on BOUNDED_SECTION_(PRE|POST)_LABEL macros, prepending an optional KEEP(.gnu.linkonce##_sec_) reservation, and a linker-symbol to address it. This allows a developer to define a header struct (which must fit with the section's base struct-type), and could contain: 1- fields whose value is common to the entire set of data-records. This allows the header & data structs to specialize, complement each other, and shrink. 2- an uplink pointer to an organizing struct which refs other related/sub data-tables header record is addressable via the extern'd header linker-symbol Once the linker-symbols created by the macro are ref'd extern in code, that code can compute a record's index (ptr - start) in the "primary" table, then use it to index into the related/sub tables. Adding a primary.map_* field foreach sub-table would then allow deduplication and remapping of that sub-table. This is aimed at dyndbg's struct _ddebug __dyndbg[] section, whose 3 columns: function, file, module are 50%, 90%, 100% redundant. The module column is fully recoverable after dynamic_debug_init() saves it to each ddebug_table.module as the builtin __dyndbg[] table is parsed. Given that those 3 columns use 24/56 of a _ddebug record, a dyndbg=y kernel with ~5k callsites could reduce kernel memory substantially. Returning that memory to the kernel buddy-allocator? is then possible. Signed-off-by: Jim Cromie <jim.cromie@gmail.com> Link: https://lore.kernel.org/r/20221117171633.923628-3-jim.cromie@gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-11-17 17:16:33 +00:00
#define HEADERED_SECTION_PRE_LABEL(_sec_, _label_, _BEGIN_, _END_, _HDR_) \
_HDR_##_label_ = .; \
KEEP(*(.gnu.linkonce.##_sec_)) \
BOUNDED_SECTION_PRE_LABEL(_sec_, _label_, _BEGIN_, _END_)
#define HEADERED_SECTION_POST_LABEL(_sec_, _label_, _BEGIN_, _END_, _HDR_) \
_label_##_HDR_ = .; \
KEEP(*(.gnu.linkonce.##_sec_)) \
BOUNDED_SECTION_POST_LABEL(_sec_, _label_, _BEGIN_, _END_)
#define HEADERED_SECTION_BY(_sec_, _label_) \
HEADERED_SECTION_PRE_LABEL(_sec_, _label_, __start, __stop)
#define HEADERED_SECTION(_sec) HEADERED_SECTION_BY(_sec, _sec)
#ifdef CONFIG_TRACE_BRANCH_PROFILING
#define LIKELY_PROFILE() \
BOUNDED_SECTION_BY(_ftrace_annotated_branch, _annotated_branch_profile)
tracing: profile likely and unlikely annotations Impact: new unlikely/likely profiler Andrew Morton recently suggested having an in-kernel way to profile likely and unlikely macros. This patch achieves that goal. When configured, every(*) likely and unlikely macro gets a counter attached to it. When the condition is hit, the hit and misses of that condition are recorded. These numbers can later be retrieved by: /debugfs/tracing/profile_likely - All likely markers /debugfs/tracing/profile_unlikely - All unlikely markers. # cat /debug/tracing/profile_unlikely | head correct incorrect % Function File Line ------- --------- - -------- ---- ---- 2167 0 0 do_arch_prctl process_64.c 832 0 0 0 do_arch_prctl process_64.c 804 2670 0 0 IS_ERR err.h 34 71230 5693 7 __switch_to process_64.c 673 76919 0 0 __switch_to process_64.c 639 43184 33743 43 __switch_to process_64.c 624 12740 64181 83 __switch_to process_64.c 594 12740 64174 83 __switch_to process_64.c 590 # cat /debug/tracing/profile_unlikely | \ awk '{ if ($3 > 25) print $0; }' |head -20 44963 35259 43 __switch_to process_64.c 624 12762 67454 84 __switch_to process_64.c 594 12762 67447 84 __switch_to process_64.c 590 1478 595 28 syscall_get_error syscall.h 51 0 2821 100 syscall_trace_leave ptrace.c 1567 0 1 100 native_smp_prepare_cpus smpboot.c 1237 86338 265881 75 calc_delta_fair sched_fair.c 408 210410 108540 34 calc_delta_mine sched.c 1267 0 54550 100 sched_info_queued sched_stats.h 222 51899 66435 56 pick_next_task_fair sched_fair.c 1422 6 10 62 yield_task_fair sched_fair.c 982 7325 2692 26 rt_policy sched.c 144 0 1270 100 pre_schedule_rt sched_rt.c 1261 1268 48073 97 pick_next_task_rt sched_rt.c 884 0 45181 100 sched_info_dequeued sched_stats.h 177 0 15 100 sched_move_task sched.c 8700 0 15 100 sched_move_task sched.c 8690 53167 33217 38 schedule sched.c 4457 0 80208 100 sched_info_switch sched_stats.h 270 30585 49631 61 context_switch sched.c 2619 # cat /debug/tracing/profile_likely | awk '{ if ($3 > 25) print $0; }' 39900 36577 47 pick_next_task sched.c 4397 20824 15233 42 switch_mm mmu_context_64.h 18 0 7 100 __cancel_work_timer workqueue.c 560 617 66484 99 clocksource_adjust timekeeping.c 456 0 346340 100 audit_syscall_exit auditsc.c 1570 38 347350 99 audit_get_context auditsc.c 732 0 345244 100 audit_syscall_entry auditsc.c 1541 38 1017 96 audit_free auditsc.c 1446 0 1090 100 audit_alloc auditsc.c 862 2618 1090 29 audit_alloc auditsc.c 858 0 6 100 move_masked_irq migration.c 9 1 198 99 probe_sched_wakeup trace_sched_switch.c 58 2 2 50 probe_wakeup trace_sched_wakeup.c 227 0 2 100 probe_wakeup_sched_switch trace_sched_wakeup.c 144 4514 2090 31 __grab_cache_page filemap.c 2149 12882 228786 94 mapping_unevictable pagemap.h 50 4 11 73 __flush_cpu_slab slub.c 1466 627757 330451 34 slab_free slub.c 1731 2959 61245 95 dentry_lru_del_init dcache.c 153 946 1217 56 load_elf_binary binfmt_elf.c 904 102 82 44 disk_put_part genhd.h 206 1 1 50 dst_gc_task dst.c 82 0 19 100 tcp_mss_split_point tcp_output.c 1126 As you can see by the above, there's a bit of work to do in rethinking the use of some unlikelys and likelys. Note: the unlikely case had 71 hits that were more than 25%. Note: After submitting my first version of this patch, Andrew Morton showed me a version written by Daniel Walker, where I picked up the following ideas from: 1) Using __builtin_constant_p to avoid profiling fixed values. 2) Using __FILE__ instead of instruction pointers. 3) Using the preprocessor to stop all profiling of likely annotations from vsyscall_64.c. Thanks to Andrew Morton, Arjan van de Ven, Theodore Tso and Ingo Molnar for their feed back on this patch. (*) Not ever unlikely is recorded, those that are used by vsyscalls (a few of them) had to have profiling disabled. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Theodore Tso <tytso@mit.edu> Cc: Arjan van de Ven <arjan@infradead.org> Cc: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-12 05:14:39 +00:00
#else
#define LIKELY_PROFILE()
#endif
#ifdef CONFIG_PROFILE_ALL_BRANCHES
#define BRANCH_PROFILE() \
BOUNDED_SECTION_BY(_ftrace_branch, _branch_profile)
#else
#define BRANCH_PROFILE()
#endif
kprobes: Introduce NOKPROBE_SYMBOL() macro to maintain kprobes blacklist Introduce NOKPROBE_SYMBOL() macro which builds a kprobes blacklist at kernel build time. The usage of this macro is similar to EXPORT_SYMBOL(), placed after the function definition: NOKPROBE_SYMBOL(function); Since this macro will inhibit inlining of static/inline functions, this patch also introduces a nokprobe_inline macro for static/inline functions. In this case, we must use NOKPROBE_SYMBOL() for the inline function caller. When CONFIG_KPROBES=y, the macro stores the given function address in the "_kprobe_blacklist" section. Since the data structures are not fully initialized by the macro (because there is no "size" information), those are re-initialized at boot time by using kallsyms. Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Link: http://lkml.kernel.org/r/20140417081705.26341.96719.stgit@ltc230.yrl.intra.hitachi.co.jp Cc: Alok Kataria <akataria@vmware.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christopher Li <sparse@chrisli.org> Cc: Chris Wright <chrisw@sous-sol.org> Cc: David S. Miller <davem@davemloft.net> Cc: Jan-Simon Möller <dl9pf@gmx.de> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: linux-arch@vger.kernel.org Cc: linux-doc@vger.kernel.org Cc: linux-sparse@vger.kernel.org Cc: virtualization@lists.linux-foundation.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-04-17 08:17:05 +00:00
#ifdef CONFIG_KPROBES
#define KPROBE_BLACKLIST() \
. = ALIGN(8); \
BOUNDED_SECTION(_kprobe_blacklist)
kprobes: Introduce NOKPROBE_SYMBOL() macro to maintain kprobes blacklist Introduce NOKPROBE_SYMBOL() macro which builds a kprobes blacklist at kernel build time. The usage of this macro is similar to EXPORT_SYMBOL(), placed after the function definition: NOKPROBE_SYMBOL(function); Since this macro will inhibit inlining of static/inline functions, this patch also introduces a nokprobe_inline macro for static/inline functions. In this case, we must use NOKPROBE_SYMBOL() for the inline function caller. When CONFIG_KPROBES=y, the macro stores the given function address in the "_kprobe_blacklist" section. Since the data structures are not fully initialized by the macro (because there is no "size" information), those are re-initialized at boot time by using kallsyms. Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Link: http://lkml.kernel.org/r/20140417081705.26341.96719.stgit@ltc230.yrl.intra.hitachi.co.jp Cc: Alok Kataria <akataria@vmware.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christopher Li <sparse@chrisli.org> Cc: Chris Wright <chrisw@sous-sol.org> Cc: David S. Miller <davem@davemloft.net> Cc: Jan-Simon Möller <dl9pf@gmx.de> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: linux-arch@vger.kernel.org Cc: linux-doc@vger.kernel.org Cc: linux-sparse@vger.kernel.org Cc: virtualization@lists.linux-foundation.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-04-17 08:17:05 +00:00
#else
#define KPROBE_BLACKLIST()
#endif
#ifdef CONFIG_FUNCTION_ERROR_INJECTION
#define ERROR_INJECT_WHITELIST() \
STRUCT_ALIGN(); \
BOUNDED_SECTION(_error_injection_whitelist)
#else
#define ERROR_INJECT_WHITELIST()
#endif
#ifdef CONFIG_EVENT_TRACING
#define FTRACE_EVENTS() \
. = ALIGN(8); \
BOUNDED_SECTION(_ftrace_events) \
BOUNDED_SECTION_BY(_ftrace_eval_map, _ftrace_eval_maps)
#else
#define FTRACE_EVENTS()
#endif
#ifdef CONFIG_TRACING
#define TRACE_PRINTKS() BOUNDED_SECTION_BY(__trace_printk_fmt, ___trace_bprintk_fmt)
#define TRACEPOINT_STR() BOUNDED_SECTION_BY(__tracepoint_str, ___tracepoint_str)
#else
#define TRACE_PRINTKS()
tracing: Add __tracepoint_string() to export string pointers There are several tracepoints (mostly in RCU), that reference a string pointer and uses the print format of "%s" to display the string that exists in the kernel, instead of copying the actual string to the ring buffer (saves time and ring buffer space). But this has an issue with userspace tools that read the binary buffers that has the address of the string but has no access to what the string itself is. The end result is just output that looks like: rcu_dyntick: ffffffff818adeaa 1 0 rcu_dyntick: ffffffff818adeb5 0 140000000000000 rcu_dyntick: ffffffff818adeb5 0 140000000000000 rcu_utilization: ffffffff8184333b rcu_utilization: ffffffff8184333b The above is pretty useless when read by the userspace tools. Ideally we would want something that looks like this: rcu_dyntick: Start 1 0 rcu_dyntick: End 0 140000000000000 rcu_dyntick: Start 140000000000000 0 rcu_callback: rcu_preempt rhp=0xffff880037aff710 func=put_cred_rcu 0/4 rcu_callback: rcu_preempt rhp=0xffff880078961980 func=file_free_rcu 0/5 rcu_dyntick: End 0 1 The trace_printk() which also only stores the address of the string format instead of recording the string into the buffer itself, exports the mapping of kernel addresses to format strings via the printk_format file in the debugfs tracing directory. The tracepoint strings can use this same method and output the format to the same file and the userspace tools will be able to decipher the address without any modification. The tracepoint strings need its own section to save the strings because the trace_printk section will cause the trace_printk() buffers to be allocated if anything exists within the section. trace_printk() is only used for debugging and should never exist in the kernel, we can not use the trace_printk sections. Add a new tracepoint_str section that will also be examined by the output of the printk_format file. Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-07-12 21:07:27 +00:00
#define TRACEPOINT_STR()
#endif
#ifdef CONFIG_FTRACE_SYSCALLS
#define TRACE_SYSCALLS() \
. = ALIGN(8); \
BOUNDED_SECTION_BY(__syscalls_metadata, _syscalls_metadata)
#else
#define TRACE_SYSCALLS()
#endif
bpf: introduce BPF_RAW_TRACEPOINT Introduce BPF_PROG_TYPE_RAW_TRACEPOINT bpf program type to access kernel internal arguments of the tracepoints in their raw form. >From bpf program point of view the access to the arguments look like: struct bpf_raw_tracepoint_args { __u64 args[0]; }; int bpf_prog(struct bpf_raw_tracepoint_args *ctx) { // program can read args[N] where N depends on tracepoint // and statically verified at program load+attach time } kprobe+bpf infrastructure allows programs access function arguments. This feature allows programs access raw tracepoint arguments. Similar to proposed 'dynamic ftrace events' there are no abi guarantees to what the tracepoints arguments are and what their meaning is. The program needs to type cast args properly and use bpf_probe_read() helper to access struct fields when argument is a pointer. For every tracepoint __bpf_trace_##call function is prepared. In assembler it looks like: (gdb) disassemble __bpf_trace_xdp_exception Dump of assembler code for function __bpf_trace_xdp_exception: 0xffffffff81132080 <+0>: mov %ecx,%ecx 0xffffffff81132082 <+2>: jmpq 0xffffffff811231f0 <bpf_trace_run3> where TRACE_EVENT(xdp_exception, TP_PROTO(const struct net_device *dev, const struct bpf_prog *xdp, u32 act), The above assembler snippet is casting 32-bit 'act' field into 'u64' to pass into bpf_trace_run3(), while 'dev' and 'xdp' args are passed as-is. All of ~500 of __bpf_trace_*() functions are only 5-10 byte long and in total this approach adds 7k bytes to .text. This approach gives the lowest possible overhead while calling trace_xdp_exception() from kernel C code and transitioning into bpf land. Since tracepoint+bpf are used at speeds of 1M+ events per second this is valuable optimization. The new BPF_RAW_TRACEPOINT_OPEN sys_bpf command is introduced that returns anon_inode FD of 'bpf-raw-tracepoint' object. The user space looks like: // load bpf prog with BPF_PROG_TYPE_RAW_TRACEPOINT type prog_fd = bpf_prog_load(...); // receive anon_inode fd for given bpf_raw_tracepoint with prog attached raw_tp_fd = bpf_raw_tracepoint_open("xdp_exception", prog_fd); Ctrl-C of tracing daemon or cmdline tool that uses this feature will automatically detach bpf program, unload it and unregister tracepoint probe. On the kernel side the __bpf_raw_tp_map section of pointers to tracepoint definition and to __bpf_trace_*() probe function is used to find a tracepoint with "xdp_exception" name and corresponding __bpf_trace_xdp_exception() probe function which are passed to tracepoint_probe_register() to connect probe with tracepoint. Addition of bpf_raw_tracepoint doesn't interfere with ftrace and perf tracepoint mechanisms. perf_event_open() can be used in parallel on the same tracepoint. Multiple bpf_raw_tracepoint_open("xdp_exception", prog_fd) are permitted. Each with its own bpf program. The kernel will execute all tracepoint probes and all attached bpf programs. In the future bpf_raw_tracepoints can be extended with query/introspection logic. __bpf_raw_tp_map section logic was contributed by Steven Rostedt Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-03-28 19:05:37 +00:00
#ifdef CONFIG_BPF_EVENTS
#define BPF_RAW_TP() STRUCT_ALIGN(); \
BOUNDED_SECTION_BY(__bpf_raw_tp_map, __bpf_raw_tp)
bpf: introduce BPF_RAW_TRACEPOINT Introduce BPF_PROG_TYPE_RAW_TRACEPOINT bpf program type to access kernel internal arguments of the tracepoints in their raw form. >From bpf program point of view the access to the arguments look like: struct bpf_raw_tracepoint_args { __u64 args[0]; }; int bpf_prog(struct bpf_raw_tracepoint_args *ctx) { // program can read args[N] where N depends on tracepoint // and statically verified at program load+attach time } kprobe+bpf infrastructure allows programs access function arguments. This feature allows programs access raw tracepoint arguments. Similar to proposed 'dynamic ftrace events' there are no abi guarantees to what the tracepoints arguments are and what their meaning is. The program needs to type cast args properly and use bpf_probe_read() helper to access struct fields when argument is a pointer. For every tracepoint __bpf_trace_##call function is prepared. In assembler it looks like: (gdb) disassemble __bpf_trace_xdp_exception Dump of assembler code for function __bpf_trace_xdp_exception: 0xffffffff81132080 <+0>: mov %ecx,%ecx 0xffffffff81132082 <+2>: jmpq 0xffffffff811231f0 <bpf_trace_run3> where TRACE_EVENT(xdp_exception, TP_PROTO(const struct net_device *dev, const struct bpf_prog *xdp, u32 act), The above assembler snippet is casting 32-bit 'act' field into 'u64' to pass into bpf_trace_run3(), while 'dev' and 'xdp' args are passed as-is. All of ~500 of __bpf_trace_*() functions are only 5-10 byte long and in total this approach adds 7k bytes to .text. This approach gives the lowest possible overhead while calling trace_xdp_exception() from kernel C code and transitioning into bpf land. Since tracepoint+bpf are used at speeds of 1M+ events per second this is valuable optimization. The new BPF_RAW_TRACEPOINT_OPEN sys_bpf command is introduced that returns anon_inode FD of 'bpf-raw-tracepoint' object. The user space looks like: // load bpf prog with BPF_PROG_TYPE_RAW_TRACEPOINT type prog_fd = bpf_prog_load(...); // receive anon_inode fd for given bpf_raw_tracepoint with prog attached raw_tp_fd = bpf_raw_tracepoint_open("xdp_exception", prog_fd); Ctrl-C of tracing daemon or cmdline tool that uses this feature will automatically detach bpf program, unload it and unregister tracepoint probe. On the kernel side the __bpf_raw_tp_map section of pointers to tracepoint definition and to __bpf_trace_*() probe function is used to find a tracepoint with "xdp_exception" name and corresponding __bpf_trace_xdp_exception() probe function which are passed to tracepoint_probe_register() to connect probe with tracepoint. Addition of bpf_raw_tracepoint doesn't interfere with ftrace and perf tracepoint mechanisms. perf_event_open() can be used in parallel on the same tracepoint. Multiple bpf_raw_tracepoint_open("xdp_exception", prog_fd) are permitted. Each with its own bpf program. The kernel will execute all tracepoint probes and all attached bpf programs. In the future bpf_raw_tracepoints can be extended with query/introspection logic. __bpf_raw_tp_map section logic was contributed by Steven Rostedt Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-03-28 19:05:37 +00:00
#else
#define BPF_RAW_TP()
#endif
#ifdef CONFIG_SERIAL_EARLYCON
#define EARLYCON_TABLE() \
. = ALIGN(8); \
BOUNDED_SECTION_POST_LABEL(__earlycon_table, __earlycon_table, , _end)
#else
#define EARLYCON_TABLE()
#endif
#ifdef CONFIG_SECURITY
#define LSM_TABLE() \
. = ALIGN(8); \
BOUNDED_SECTION_PRE_LABEL(.lsm_info.init, _lsm_info, __start, __end)
#define EARLY_LSM_TABLE() \
. = ALIGN(8); \
BOUNDED_SECTION_PRE_LABEL(.early_lsm_info.init, _early_lsm_info, __start, __end)
#else
#define LSM_TABLE()
#define EARLY_LSM_TABLE()
#endif
#define ___OF_TABLE(cfg, name) _OF_TABLE_##cfg(name)
#define __OF_TABLE(cfg, name) ___OF_TABLE(cfg, name)
#define OF_TABLE(cfg, name) __OF_TABLE(IS_ENABLED(cfg), name)
#define _OF_TABLE_0(name)
#define _OF_TABLE_1(name) \
irqchip: add basic infrastructure With the recent creation of the drivers/irqchip/ directory, it is desirable to move irq controller drivers here. At the moment, the only driver here is irq-bcm2835, the driver for the irq controller found in the ARM BCM2835 SoC, present in Rasberry Pi systems. This irq controller driver was exporting its initialization function and its irq handling function through a header file in <linux/irqchip/bcm2835.h>. When proposing to also move another irq controller driver in drivers/irqchip, Rob Herring raised the very valid point that moving things to drivers/irqchip was good in order to remove more stuff from arch/arm, but if it means adding gazillions of headers files in include/linux/irqchip/, it would not be very nice. So, upon the suggestion of Rob Herring and Arnd Bergmann, this commit introduces a small infrastructure that defines a central irqchip_init() function in drivers/irqchip/irqchip.c, which is meant to be called as the ->init_irq() callback of ARM platforms. This function calls of_irq_init() with an array of match strings and init functions generated from a special linker section. Note that the irq controller driver initialization function is responsible for setting the global handle_arch_irq() variable, so that ARM platforms no longer have to define the ->handle_irq field in their DT_MACHINE structure. A global header, <linux/irqchip.h> is also added to expose the single irqchip_init() function to the reset of the kernel. A further commit moves the BCM2835 irq controller driver to this new small infrastructure, therefore removing the include/linux/irqchip/ directory. Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com> Reviewed-by: Stephen Warren <swarren@wwwdotorg.org> Reviewed-by: Rob Herring <rob.herring@calxeda.com> Acked-by: Arnd Bergmann <arnd@arndb.de> [rob.herring: reword commit message to reflect use of linker sections.] Signed-off-by: Rob Herring <rob.herring@calxeda.com>
2012-11-20 22:00:52 +00:00
. = ALIGN(8); \
__##name##_of_table = .; \
KEEP(*(__##name##_of_table)) \
KEEP(*(__##name##_of_table_end))
#define TIMER_OF_TABLES() OF_TABLE(CONFIG_TIMER_OF, timer)
#define IRQCHIP_OF_MATCH_TABLE() OF_TABLE(CONFIG_IRQCHIP, irqchip)
#define CLK_OF_TABLES() OF_TABLE(CONFIG_COMMON_CLK, clk)
#define RESERVEDMEM_OF_TABLES() OF_TABLE(CONFIG_OF_RESERVED_MEM, reservedmem)
#define CPU_METHOD_OF_TABLES() OF_TABLE(CONFIG_SMP, cpu_method)
#define CPUIDLE_METHOD_OF_TABLES() OF_TABLE(CONFIG_CPU_IDLE, cpuidle_method)
#ifdef CONFIG_ACPI
#define ACPI_PROBE_TABLE(name) \
. = ALIGN(8); \
BOUNDED_SECTION_POST_LABEL(__##name##_acpi_probe_table, \
__##name##_acpi_probe_table,, _end)
#else
#define ACPI_PROBE_TABLE(name)
#endif
#ifdef CONFIG_THERMAL
#define THERMAL_TABLE(name) \
. = ALIGN(8); \
BOUNDED_SECTION_POST_LABEL(__##name##_thermal_table, \
__##name##_thermal_table,, _end)
#else
#define THERMAL_TABLE(name)
#endif
#define KERNEL_DTB() \
STRUCT_ALIGN(); \
__dtb_start = .; \
KEEP(*(.dtb.init.rodata)) \
__dtb_end = .;
/*
* .data section
*/
#define DATA_DATA \
mtd: only use __xipram annotation when XIP_KERNEL is set When XIP_KERNEL is enabled, some functions are defined in the .data ELF section because we require them to be in RAM whenever we communicate with the flash chip. However this causes problems when FTRACE is enabled and gcc emits calls to __gnu_mcount_nc in the function prolog: drivers/built-in.o: In function `cfi_chip_setup': :(.data+0x272fc): relocation truncated to fit: R_ARM_CALL against symbol `__gnu_mcount_nc' defined in .text section in arch/arm/kernel/built-in.o drivers/built-in.o: In function `cfi_probe_chip': :(.data+0x27de8): relocation truncated to fit: R_ARM_CALL against symbol `__gnu_mcount_nc' defined in .text section in arch/arm/kernel/built-in.o /tmp/ccY172rP.s: Assembler messages: /tmp/ccY172rP.s:70: Warning: ignoring changed section attributes for .data /tmp/ccY172rP.s: Error: 1 warning, treating warnings as errors make[5]: *** [drivers/mtd/chips/cfi_probe.o] Error 1 /tmp/ccK4rjeO.s: Assembler messages: /tmp/ccK4rjeO.s:421: Warning: ignoring changed section attributes for .data /tmp/ccK4rjeO.s: Error: 1 warning, treating warnings as errors make[5]: *** [drivers/mtd/chips/cfi_util.o] Error 1 /tmp/ccUvhCYR.s: Assembler messages: /tmp/ccUvhCYR.s:1895: Warning: ignoring changed section attributes for .data /tmp/ccUvhCYR.s: Error: 1 warning, treating warnings as errors Specifically, this does not work because the .data section is not marked executable, which leads LD to not generate trampolines for long calls. This moves the __xipram functions into their own .xiptext section instead. The section is still placed next to .data and located in RAM but is marked executable, which avoids the build errors. Also, we only need to place the XIP functions into a separate section if both CONFIG_XIP_KERNEL and CONFIG_MTD_XIP are set: When only MTD_XIP is used, the whole kernel is still in RAM and we do not need to worry about pulling out the rug under it. When only XIP_KERNEL but not MTD_XIP is set, the kernel is in some form of ROM, but we never write to it. Note that MTD_XIP has been broken on ARM since around 2011 or 2012. I have sent another patch[2] to fix compilation, which I plan to merge through arm-soc unless there are objections. The obvious alternative to that would be to completely rip out the MTD_XIP support from the kernel, since obviously nobody has been using it in a long while. Link: [1] https://patchwork.kernel.org/patch/8109771/ Link: [2] https://patchwork.kernel.org/patch/9855225/ Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Boris Brezillon <boris.brezillon@free-electrons.com>
2017-07-21 20:26:25 +00:00
*(.xiptext) \
*(DATA_MAIN) \
*(.data..decrypted) \
Introduce new section reference annotations tags: __ref, __refdata, __refconst Today we have the following annotations for functions/data referencing __init/__exit functions / data: __init_refok => for init functions __initdata_refok => for init data __exit_refok => for exit functions There is really no difference between the __init and __exit versions and simplify it and to introduce a shorter annotation the following new annotations are introduced: __ref => for functions (code) that references __*init / __*exit __refdata => for variables __refconst => for const variables Whit this annotation is it more obvious what the annotation is for and there is no longer the arbitary division between __init and __exit code. The mechanishm is the same as before - a special section is created which is made part of the usual sections in the linker script. We will start to see annotations like this: -static struct pci_serial_quirk pci_serial_quirks[] = { +static const struct pci_serial_quirk pci_serial_quirks[] __refconst = { ----------------- -static struct notifier_block __cpuinitdata cpuid_class_cpu_notifier = +static struct notifier_block cpuid_class_cpu_notifier __refdata = ---------------- -static int threshold_cpu_callback(struct notifier_block *nfb, +static int __ref threshold_cpu_callback(struct notifier_block *nfb, [The above is just random samples]. Note: No modifications were needed in modpost to support the new sections due to the newly introduced blacklisting. Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
2008-01-28 19:21:15 +00:00
*(.ref.data) \
vmlinux.lds.h: gather .data..shared_aligned sections in DATA_DATA With the recent change "net: remove time limit in process_backlog()", the softnet_data variable changed from "DEFINE_PER_CPU()" to "DEFINE_PER_CPU_ALIGNED()" which moved it from the .data section to the .data.shared_align section. I'm not saying this patch is wrong, just that is what caused me to notice this larger problem. No one else in the kernel is using this aligned macro variant, so I imagine that's why no one has noticed yet. Since .data..shared_align isn't declared in any vmlinux files that I can see, the linker just places it last. This "just works" for most people, but when building a ROM kernel on Blackfin systems, it causes section overlap errors: bfin-uclinux-ld.real: section .init.data [00000000202e06b8 -> 00000000202e48b7] overlaps section .data.shared_aligned [00000000202e06b8 -> 00000000202e0723] I imagine other arches which support the ROM config option and thus do funky placement would see similar issues ... On x86, it is stuck in a dedicated section at the end: [8] .data PROGBITS ffffffff810ec000 2ec0000303a8 00 WA 0 0 4096 [9] .data.shared_alig PROGBITS ffffffff8111c3c0 31c3c00000c8 00 WA 0 0 64 So make sure we include this section in the DATA_DATA macro so that it is placed in the right location. Signed-off-by: Mike Frysinger <vapier@gentoo.org> Cc: Sam Ravnborg <sam@ravnborg.org> Cc: Jeremy Fitzhardinge <jeremy@xensource.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Alan Jenkins <alan-jenkins@tuffmail.co.uk> Cc: Greg Ungerer <gerg@snapgear.com> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26 21:22:29 +00:00
*(.data..shared_aligned) /* percpu related */ \
*(.data..unlikely) \
__start_once = .; \
*(.data..once) \
__end_once = .; \
tracepoints: Fix section alignment using pointer array Make the tracepoints more robust, making them solid enough to handle compiler changes by not relying on anything based on compiler-specific behavior with respect to structure alignment. Implement an approach proposed by David Miller: use an array of const pointers to refer to the individual structures, and export this pointer array through the linker script rather than the structures per se. It will consume 32 extra bytes per tracepoint (24 for structure padding and 8 for the pointers), but are less likely to break due to compiler changes. History: commit 7e066fb8 tracepoints: add DECLARE_TRACE() and DEFINE_TRACE() added the aligned(32) type and variable attribute to the tracepoint structures to deal with gcc happily aligning statically defined structures on 32-byte multiples. One attempt was to use a 8-byte alignment for tracepoint structures by applying both the variable and type attribute to tracepoint structures definitions and declarations. It worked fine with gcc 4.5.1, but broke with gcc 4.4.4 and 4.4.5. The reason is that the "aligned" attribute only specify the _minimum_ alignment for a structure, leaving both the compiler and the linker free to align on larger multiples. Because tracepoint.c expects the structures to be placed as an array within each section, up-alignment cause NULL-pointer exceptions due to the extra unexpected padding. (this patch applies on top of -tip) Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: David S. Miller <davem@davemloft.net> LKML-Reference: <20110126222622.GA10794@Krystal> CC: Frederic Weisbecker <fweisbec@gmail.com> CC: Ingo Molnar <mingo@elte.hu> CC: Thomas Gleixner <tglx@linutronix.de> CC: Andrew Morton <akpm@linux-foundation.org> CC: Peter Zijlstra <peterz@infradead.org> CC: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-01-26 22:26:22 +00:00
STRUCT_ALIGN(); \
tracing: Kernel Tracepoints Implementation of kernel tracepoints. Inspired from the Linux Kernel Markers. Allows complete typing verification by declaring both tracing statement inline functions and probe registration/unregistration static inline functions within the same macro "DEFINE_TRACE". No format string is required. See the tracepoint Documentation and Samples patches for usage examples. Taken from the documentation patch : "A tracepoint placed in code provides a hook to call a function (probe) that you can provide at runtime. A tracepoint can be "on" (a probe is connected to it) or "off" (no probe is attached). When a tracepoint is "off" it has no effect, except for adding a tiny time penalty (checking a condition for a branch) and space penalty (adding a few bytes for the function call at the end of the instrumented function and adds a data structure in a separate section). When a tracepoint is "on", the function you provide is called each time the tracepoint is executed, in the execution context of the caller. When the function provided ends its execution, it returns to the caller (continuing from the tracepoint site). You can put tracepoints at important locations in the code. They are lightweight hooks that can pass an arbitrary number of parameters, which prototypes are described in a tracepoint declaration placed in a header file." Addition and removal of tracepoints is synchronized by RCU using the scheduler (and preempt_disable) as guarantees to find a quiescent state (this is really RCU "classic"). The update side uses rcu_barrier_sched() with call_rcu_sched() and the read/execute side uses "preempt_disable()/preempt_enable()". We make sure the previous array containing probes, which has been scheduled for deletion by the rcu callback, is indeed freed before we proceed to the next update. It therefore limits the rate of modification of a single tracepoint to one update per RCU period. The objective here is to permit fast batch add/removal of probes on _different_ tracepoints. Changelog : - Use #name ":" #proto as string to identify the tracepoint in the tracepoint table. This will make sure not type mismatch happens due to connexion of a probe with the wrong type to a tracepoint declared with the same name in a different header. - Add tracepoint_entry_free_old. - Change __TO_TRACE to get rid of the 'i' iterator. Masami Hiramatsu <mhiramat@redhat.com> : Tested on x86-64. Performance impact of a tracepoint : same as markers, except that it adds about 70 bytes of instructions in an unlikely branch of each instrumented function (the for loop, the stack setup and the function call). It currently adds a memory read, a test and a conditional branch at the instrumentation site (in the hot path). Immediate values will eventually change this into a load immediate, test and branch, which removes the memory read which will make the i-cache impact smaller (changing the memory read for a load immediate removes 3-4 bytes per site on x86_32 (depending on mov prefixes), or 7-8 bytes on x86_64, it also saves the d-cache hit). About the performance impact of tracepoints (which is comparable to markers), even without immediate values optimizations, tests done by Hideo Aoki on ia64 show no regression. His test case was using hackbench on a kernel where scheduler instrumentation (about 5 events in code scheduler code) was added. Quoting Hideo Aoki about Markers : I evaluated overhead of kernel marker using linux-2.6-sched-fixes git tree, which includes several markers for LTTng, using an ia64 server. While the immediate trace mark feature isn't implemented on ia64, there is no major performance regression. So, I think that we don't have any issues to propose merging marker point patches into Linus's tree from the viewpoint of performance impact. I prepared two kernels to evaluate. The first one was compiled without CONFIG_MARKERS. The second one was enabled CONFIG_MARKERS. I downloaded the original hackbench from the following URL: http://devresources.linux-foundation.org/craiger/hackbench/src/hackbench.c I ran hackbench 5 times in each condition and calculated the average and difference between the kernels. The parameter of hackbench: every 50 from 50 to 800 The number of CPUs of the server: 2, 4, and 8 Below is the results. As you can see, major performance regression wasn't found in any case. Even if number of processes increases, differences between marker-enabled kernel and marker- disabled kernel doesn't increase. Moreover, if number of CPUs increases, the differences doesn't increase either. Curiously, marker-enabled kernel is better than marker-disabled kernel in more than half cases, although I guess it comes from the difference of memory access pattern. * 2 CPUs Number of | without | with | diff | diff | processes | Marker [Sec] | Marker [Sec] | [Sec] | [%] | -------------------------------------------------------------- 50 | 4.811 | 4.872 | +0.061 | +1.27 | 100 | 9.854 | 10.309 | +0.454 | +4.61 | 150 | 15.602 | 15.040 | -0.562 | -3.6 | 200 | 20.489 | 20.380 | -0.109 | -0.53 | 250 | 25.798 | 25.652 | -0.146 | -0.56 | 300 | 31.260 | 30.797 | -0.463 | -1.48 | 350 | 36.121 | 35.770 | -0.351 | -0.97 | 400 | 42.288 | 42.102 | -0.186 | -0.44 | 450 | 47.778 | 47.253 | -0.526 | -1.1 | 500 | 51.953 | 52.278 | +0.325 | +0.63 | 550 | 58.401 | 57.700 | -0.701 | -1.2 | 600 | 63.334 | 63.222 | -0.112 | -0.18 | 650 | 68.816 | 68.511 | -0.306 | -0.44 | 700 | 74.667 | 74.088 | -0.579 | -0.78 | 750 | 78.612 | 79.582 | +0.970 | +1.23 | 800 | 85.431 | 85.263 | -0.168 | -0.2 | -------------------------------------------------------------- * 4 CPUs Number of | without | with | diff | diff | processes | Marker [Sec] | Marker [Sec] | [Sec] | [%] | -------------------------------------------------------------- 50 | 2.586 | 2.584 | -0.003 | -0.1 | 100 | 5.254 | 5.283 | +0.030 | +0.56 | 150 | 8.012 | 8.074 | +0.061 | +0.76 | 200 | 11.172 | 11.000 | -0.172 | -1.54 | 250 | 13.917 | 14.036 | +0.119 | +0.86 | 300 | 16.905 | 16.543 | -0.362 | -2.14 | 350 | 19.901 | 20.036 | +0.135 | +0.68 | 400 | 22.908 | 23.094 | +0.186 | +0.81 | 450 | 26.273 | 26.101 | -0.172 | -0.66 | 500 | 29.554 | 29.092 | -0.461 | -1.56 | 550 | 32.377 | 32.274 | -0.103 | -0.32 | 600 | 35.855 | 35.322 | -0.533 | -1.49 | 650 | 39.192 | 38.388 | -0.804 | -2.05 | 700 | 41.744 | 41.719 | -0.025 | -0.06 | 750 | 45.016 | 44.496 | -0.520 | -1.16 | 800 | 48.212 | 47.603 | -0.609 | -1.26 | -------------------------------------------------------------- * 8 CPUs Number of | without | with | diff | diff | processes | Marker [Sec] | Marker [Sec] | [Sec] | [%] | -------------------------------------------------------------- 50 | 2.094 | 2.072 | -0.022 | -1.07 | 100 | 4.162 | 4.273 | +0.111 | +2.66 | 150 | 6.485 | 6.540 | +0.055 | +0.84 | 200 | 8.556 | 8.478 | -0.078 | -0.91 | 250 | 10.458 | 10.258 | -0.200 | -1.91 | 300 | 12.425 | 12.750 | +0.325 | +2.62 | 350 | 14.807 | 14.839 | +0.032 | +0.22 | 400 | 16.801 | 16.959 | +0.158 | +0.94 | 450 | 19.478 | 19.009 | -0.470 | -2.41 | 500 | 21.296 | 21.504 | +0.208 | +0.98 | 550 | 23.842 | 23.979 | +0.137 | +0.57 | 600 | 26.309 | 26.111 | -0.198 | -0.75 | 650 | 28.705 | 28.446 | -0.259 | -0.9 | 700 | 31.233 | 31.394 | +0.161 | +0.52 | 750 | 34.064 | 33.720 | -0.344 | -1.01 | 800 | 36.320 | 36.114 | -0.206 | -0.57 | -------------------------------------------------------------- Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Acked-by: 'Peter Zijlstra' <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-18 16:16:16 +00:00
*(__tracepoints) \
/* implement dynamic printk debug */ \
. = ALIGN(8); \
BOUNDED_SECTION_BY(__dyndbg_classes, ___dyndbg_classes) \
BOUNDED_SECTION_BY(__dyndbg, ___dyndbg) \
lib: add allocation tagging support for memory allocation profiling Introduce CONFIG_MEM_ALLOC_PROFILING which provides definitions to easily instrument memory allocators. It registers an "alloc_tags" codetag type with /proc/allocinfo interface to output allocation tag information when the feature is enabled. CONFIG_MEM_ALLOC_PROFILING_DEBUG is provided for debugging the memory allocation profiling instrumentation. Memory allocation profiling can be enabled or disabled at runtime using /proc/sys/vm/mem_profiling sysctl when CONFIG_MEM_ALLOC_PROFILING_DEBUG=n. CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT enables memory allocation profiling by default. [surenb@google.com: Documentation/filesystems/proc.rst: fix allocinfo title] Link: https://lkml.kernel.org/r/20240326073813.727090-1-surenb@google.com [surenb@google.com: do limited memory accounting for modules with ARCH_NEEDS_WEAK_PER_CPU] Link: https://lkml.kernel.org/r/20240402180933.1663992-2-surenb@google.com [klarasmodin@gmail.com: explicitly include irqflags.h in alloc_tag.h] Link: https://lkml.kernel.org/r/20240407133252.173636-1-klarasmodin@gmail.com [surenb@google.com: fix alloc_tag_init() to prevent passing NULL to PTR_ERR()] Link: https://lkml.kernel.org/r/20240417003349.2520094-1-surenb@google.com Link: https://lkml.kernel.org/r/20240321163705.3067592-14-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Co-developed-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Klara Modin <klarasmodin@gmail.com> Tested-by: Kees Cook <keescook@chromium.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andreas Hindborg <a.hindborg@samsung.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Gary Guo <gary@garyguo.net> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-21 16:36:35 +00:00
CODETAG_SECTIONS() \
LIKELY_PROFILE() \
BRANCH_PROFILE() \
tracing: Add __tracepoint_string() to export string pointers There are several tracepoints (mostly in RCU), that reference a string pointer and uses the print format of "%s" to display the string that exists in the kernel, instead of copying the actual string to the ring buffer (saves time and ring buffer space). But this has an issue with userspace tools that read the binary buffers that has the address of the string but has no access to what the string itself is. The end result is just output that looks like: rcu_dyntick: ffffffff818adeaa 1 0 rcu_dyntick: ffffffff818adeb5 0 140000000000000 rcu_dyntick: ffffffff818adeb5 0 140000000000000 rcu_utilization: ffffffff8184333b rcu_utilization: ffffffff8184333b The above is pretty useless when read by the userspace tools. Ideally we would want something that looks like this: rcu_dyntick: Start 1 0 rcu_dyntick: End 0 140000000000000 rcu_dyntick: Start 140000000000000 0 rcu_callback: rcu_preempt rhp=0xffff880037aff710 func=put_cred_rcu 0/4 rcu_callback: rcu_preempt rhp=0xffff880078961980 func=file_free_rcu 0/5 rcu_dyntick: End 0 1 The trace_printk() which also only stores the address of the string format instead of recording the string into the buffer itself, exports the mapping of kernel addresses to format strings via the printk_format file in the debugfs tracing directory. The tracepoint strings can use this same method and output the format to the same file and the userspace tools will be able to decipher the address without any modification. The tracepoint strings need its own section to save the strings because the trace_printk section will cause the trace_printk() buffers to be allocated if anything exists within the section. trace_printk() is only used for debugging and should never exist in the kernel, we can not use the trace_printk sections. Add a new tracepoint_str section that will also be examined by the output of the printk_format file. Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-07-12 21:07:27 +00:00
TRACE_PRINTKS() \
bpf: introduce BPF_RAW_TRACEPOINT Introduce BPF_PROG_TYPE_RAW_TRACEPOINT bpf program type to access kernel internal arguments of the tracepoints in their raw form. >From bpf program point of view the access to the arguments look like: struct bpf_raw_tracepoint_args { __u64 args[0]; }; int bpf_prog(struct bpf_raw_tracepoint_args *ctx) { // program can read args[N] where N depends on tracepoint // and statically verified at program load+attach time } kprobe+bpf infrastructure allows programs access function arguments. This feature allows programs access raw tracepoint arguments. Similar to proposed 'dynamic ftrace events' there are no abi guarantees to what the tracepoints arguments are and what their meaning is. The program needs to type cast args properly and use bpf_probe_read() helper to access struct fields when argument is a pointer. For every tracepoint __bpf_trace_##call function is prepared. In assembler it looks like: (gdb) disassemble __bpf_trace_xdp_exception Dump of assembler code for function __bpf_trace_xdp_exception: 0xffffffff81132080 <+0>: mov %ecx,%ecx 0xffffffff81132082 <+2>: jmpq 0xffffffff811231f0 <bpf_trace_run3> where TRACE_EVENT(xdp_exception, TP_PROTO(const struct net_device *dev, const struct bpf_prog *xdp, u32 act), The above assembler snippet is casting 32-bit 'act' field into 'u64' to pass into bpf_trace_run3(), while 'dev' and 'xdp' args are passed as-is. All of ~500 of __bpf_trace_*() functions are only 5-10 byte long and in total this approach adds 7k bytes to .text. This approach gives the lowest possible overhead while calling trace_xdp_exception() from kernel C code and transitioning into bpf land. Since tracepoint+bpf are used at speeds of 1M+ events per second this is valuable optimization. The new BPF_RAW_TRACEPOINT_OPEN sys_bpf command is introduced that returns anon_inode FD of 'bpf-raw-tracepoint' object. The user space looks like: // load bpf prog with BPF_PROG_TYPE_RAW_TRACEPOINT type prog_fd = bpf_prog_load(...); // receive anon_inode fd for given bpf_raw_tracepoint with prog attached raw_tp_fd = bpf_raw_tracepoint_open("xdp_exception", prog_fd); Ctrl-C of tracing daemon or cmdline tool that uses this feature will automatically detach bpf program, unload it and unregister tracepoint probe. On the kernel side the __bpf_raw_tp_map section of pointers to tracepoint definition and to __bpf_trace_*() probe function is used to find a tracepoint with "xdp_exception" name and corresponding __bpf_trace_xdp_exception() probe function which are passed to tracepoint_probe_register() to connect probe with tracepoint. Addition of bpf_raw_tracepoint doesn't interfere with ftrace and perf tracepoint mechanisms. perf_event_open() can be used in parallel on the same tracepoint. Multiple bpf_raw_tracepoint_open("xdp_exception", prog_fd) are permitted. Each with its own bpf program. The kernel will execute all tracepoint probes and all attached bpf programs. In the future bpf_raw_tracepoints can be extended with query/introspection logic. __bpf_raw_tp_map section logic was contributed by Steven Rostedt Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-03-28 19:05:37 +00:00
BPF_RAW_TP() \
TRACEPOINT_STR() \
KUNIT_TABLE()
/*
* Data section helpers
*/
#define NOSAVE_DATA \
. = ALIGN(PAGE_SIZE); \
__nosave_begin = .; \
*(.data..nosave) \
. = ALIGN(PAGE_SIZE); \
__nosave_end = .;
#define PAGE_ALIGNED_DATA(page_align) \
. = ALIGN(page_align); \
*(.data..page_aligned) \
. = ALIGN(page_align);
#define READ_MOSTLY_DATA(align) \
. = ALIGN(align); \
*(.data..read_mostly) \
. = ALIGN(align);
#define CACHELINE_ALIGNED_DATA(align) \
. = ALIGN(align); \
*(.data..cacheline_aligned)
#define INIT_TASK_DATA(align) \
. = ALIGN(align); \
__start_init_stack = .; \
init_thread_union = .; \
init_stack = .; \
KEEP(*(.data..init_task)) \
KEEP(*(.data..init_thread_info)) \
. = __start_init_stack + THREAD_SIZE; \
__end_init_stack = .;
#define JUMP_TABLE_DATA \
. = ALIGN(8); \
BOUNDED_SECTION_BY(__jump_table, ___jump_table)
#ifdef CONFIG_HAVE_STATIC_CALL_INLINE
#define STATIC_CALL_DATA \
. = ALIGN(8); \
BOUNDED_SECTION_BY(.static_call_sites, _static_call_sites) \
BOUNDED_SECTION_BY(.static_call_tramp_key, _static_call_tramp_key)
#else
#define STATIC_CALL_DATA
#endif
/*
* Allow architectures to handle ro_after_init data on their
* own by defining an empty RO_AFTER_INIT_DATA.
*/
#ifndef RO_AFTER_INIT_DATA
#define RO_AFTER_INIT_DATA \
. = ALIGN(8); \
__start_ro_after_init = .; \
*(.data..ro_after_init) \
JUMP_TABLE_DATA \
STATIC_CALL_DATA \
__end_ro_after_init = .;
#endif
/*
* .kcfi_traps contains a list KCFI trap locations.
*/
#ifndef KCFI_TRAPS
#ifdef CONFIG_ARCH_USES_CFI_TRAPS
#define KCFI_TRAPS \
__kcfi_traps : AT(ADDR(__kcfi_traps) - LOAD_OFFSET) { \
BOUNDED_SECTION_BY(.kcfi_traps, ___kcfi_traps) \
}
#else
#define KCFI_TRAPS
#endif
#endif
/*
* Read only Data
*/
#define RO_DATA(align) \
. = ALIGN((align)); \
.rodata : AT(ADDR(.rodata) - LOAD_OFFSET) { \
__start_rodata = .; \
*(.rodata) *(.rodata.*) \
SCHED_DATA \
RO_AFTER_INIT_DATA /* Read only after init */ \
tracepoints: Fix section alignment using pointer array Make the tracepoints more robust, making them solid enough to handle compiler changes by not relying on anything based on compiler-specific behavior with respect to structure alignment. Implement an approach proposed by David Miller: use an array of const pointers to refer to the individual structures, and export this pointer array through the linker script rather than the structures per se. It will consume 32 extra bytes per tracepoint (24 for structure padding and 8 for the pointers), but are less likely to break due to compiler changes. History: commit 7e066fb8 tracepoints: add DECLARE_TRACE() and DEFINE_TRACE() added the aligned(32) type and variable attribute to the tracepoint structures to deal with gcc happily aligning statically defined structures on 32-byte multiples. One attempt was to use a 8-byte alignment for tracepoint structures by applying both the variable and type attribute to tracepoint structures definitions and declarations. It worked fine with gcc 4.5.1, but broke with gcc 4.4.4 and 4.4.5. The reason is that the "aligned" attribute only specify the _minimum_ alignment for a structure, leaving both the compiler and the linker free to align on larger multiples. Because tracepoint.c expects the structures to be placed as an array within each section, up-alignment cause NULL-pointer exceptions due to the extra unexpected padding. (this patch applies on top of -tip) Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: David S. Miller <davem@davemloft.net> LKML-Reference: <20110126222622.GA10794@Krystal> CC: Frederic Weisbecker <fweisbec@gmail.com> CC: Ingo Molnar <mingo@elte.hu> CC: Thomas Gleixner <tglx@linutronix.de> CC: Andrew Morton <akpm@linux-foundation.org> CC: Peter Zijlstra <peterz@infradead.org> CC: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-01-26 22:26:22 +00:00
. = ALIGN(8); \
BOUNDED_SECTION_BY(__tracepoints_ptrs, ___tracepoints_ptrs) \
tracing: Kernel Tracepoints Implementation of kernel tracepoints. Inspired from the Linux Kernel Markers. Allows complete typing verification by declaring both tracing statement inline functions and probe registration/unregistration static inline functions within the same macro "DEFINE_TRACE". No format string is required. See the tracepoint Documentation and Samples patches for usage examples. Taken from the documentation patch : "A tracepoint placed in code provides a hook to call a function (probe) that you can provide at runtime. A tracepoint can be "on" (a probe is connected to it) or "off" (no probe is attached). When a tracepoint is "off" it has no effect, except for adding a tiny time penalty (checking a condition for a branch) and space penalty (adding a few bytes for the function call at the end of the instrumented function and adds a data structure in a separate section). When a tracepoint is "on", the function you provide is called each time the tracepoint is executed, in the execution context of the caller. When the function provided ends its execution, it returns to the caller (continuing from the tracepoint site). You can put tracepoints at important locations in the code. They are lightweight hooks that can pass an arbitrary number of parameters, which prototypes are described in a tracepoint declaration placed in a header file." Addition and removal of tracepoints is synchronized by RCU using the scheduler (and preempt_disable) as guarantees to find a quiescent state (this is really RCU "classic"). The update side uses rcu_barrier_sched() with call_rcu_sched() and the read/execute side uses "preempt_disable()/preempt_enable()". We make sure the previous array containing probes, which has been scheduled for deletion by the rcu callback, is indeed freed before we proceed to the next update. It therefore limits the rate of modification of a single tracepoint to one update per RCU period. The objective here is to permit fast batch add/removal of probes on _different_ tracepoints. Changelog : - Use #name ":" #proto as string to identify the tracepoint in the tracepoint table. This will make sure not type mismatch happens due to connexion of a probe with the wrong type to a tracepoint declared with the same name in a different header. - Add tracepoint_entry_free_old. - Change __TO_TRACE to get rid of the 'i' iterator. Masami Hiramatsu <mhiramat@redhat.com> : Tested on x86-64. Performance impact of a tracepoint : same as markers, except that it adds about 70 bytes of instructions in an unlikely branch of each instrumented function (the for loop, the stack setup and the function call). It currently adds a memory read, a test and a conditional branch at the instrumentation site (in the hot path). Immediate values will eventually change this into a load immediate, test and branch, which removes the memory read which will make the i-cache impact smaller (changing the memory read for a load immediate removes 3-4 bytes per site on x86_32 (depending on mov prefixes), or 7-8 bytes on x86_64, it also saves the d-cache hit). About the performance impact of tracepoints (which is comparable to markers), even without immediate values optimizations, tests done by Hideo Aoki on ia64 show no regression. His test case was using hackbench on a kernel where scheduler instrumentation (about 5 events in code scheduler code) was added. Quoting Hideo Aoki about Markers : I evaluated overhead of kernel marker using linux-2.6-sched-fixes git tree, which includes several markers for LTTng, using an ia64 server. While the immediate trace mark feature isn't implemented on ia64, there is no major performance regression. So, I think that we don't have any issues to propose merging marker point patches into Linus's tree from the viewpoint of performance impact. I prepared two kernels to evaluate. The first one was compiled without CONFIG_MARKERS. The second one was enabled CONFIG_MARKERS. I downloaded the original hackbench from the following URL: http://devresources.linux-foundation.org/craiger/hackbench/src/hackbench.c I ran hackbench 5 times in each condition and calculated the average and difference between the kernels. The parameter of hackbench: every 50 from 50 to 800 The number of CPUs of the server: 2, 4, and 8 Below is the results. As you can see, major performance regression wasn't found in any case. Even if number of processes increases, differences between marker-enabled kernel and marker- disabled kernel doesn't increase. Moreover, if number of CPUs increases, the differences doesn't increase either. Curiously, marker-enabled kernel is better than marker-disabled kernel in more than half cases, although I guess it comes from the difference of memory access pattern. * 2 CPUs Number of | without | with | diff | diff | processes | Marker [Sec] | Marker [Sec] | [Sec] | [%] | -------------------------------------------------------------- 50 | 4.811 | 4.872 | +0.061 | +1.27 | 100 | 9.854 | 10.309 | +0.454 | +4.61 | 150 | 15.602 | 15.040 | -0.562 | -3.6 | 200 | 20.489 | 20.380 | -0.109 | -0.53 | 250 | 25.798 | 25.652 | -0.146 | -0.56 | 300 | 31.260 | 30.797 | -0.463 | -1.48 | 350 | 36.121 | 35.770 | -0.351 | -0.97 | 400 | 42.288 | 42.102 | -0.186 | -0.44 | 450 | 47.778 | 47.253 | -0.526 | -1.1 | 500 | 51.953 | 52.278 | +0.325 | +0.63 | 550 | 58.401 | 57.700 | -0.701 | -1.2 | 600 | 63.334 | 63.222 | -0.112 | -0.18 | 650 | 68.816 | 68.511 | -0.306 | -0.44 | 700 | 74.667 | 74.088 | -0.579 | -0.78 | 750 | 78.612 | 79.582 | +0.970 | +1.23 | 800 | 85.431 | 85.263 | -0.168 | -0.2 | -------------------------------------------------------------- * 4 CPUs Number of | without | with | diff | diff | processes | Marker [Sec] | Marker [Sec] | [Sec] | [%] | -------------------------------------------------------------- 50 | 2.586 | 2.584 | -0.003 | -0.1 | 100 | 5.254 | 5.283 | +0.030 | +0.56 | 150 | 8.012 | 8.074 | +0.061 | +0.76 | 200 | 11.172 | 11.000 | -0.172 | -1.54 | 250 | 13.917 | 14.036 | +0.119 | +0.86 | 300 | 16.905 | 16.543 | -0.362 | -2.14 | 350 | 19.901 | 20.036 | +0.135 | +0.68 | 400 | 22.908 | 23.094 | +0.186 | +0.81 | 450 | 26.273 | 26.101 | -0.172 | -0.66 | 500 | 29.554 | 29.092 | -0.461 | -1.56 | 550 | 32.377 | 32.274 | -0.103 | -0.32 | 600 | 35.855 | 35.322 | -0.533 | -1.49 | 650 | 39.192 | 38.388 | -0.804 | -2.05 | 700 | 41.744 | 41.719 | -0.025 | -0.06 | 750 | 45.016 | 44.496 | -0.520 | -1.16 | 800 | 48.212 | 47.603 | -0.609 | -1.26 | -------------------------------------------------------------- * 8 CPUs Number of | without | with | diff | diff | processes | Marker [Sec] | Marker [Sec] | [Sec] | [%] | -------------------------------------------------------------- 50 | 2.094 | 2.072 | -0.022 | -1.07 | 100 | 4.162 | 4.273 | +0.111 | +2.66 | 150 | 6.485 | 6.540 | +0.055 | +0.84 | 200 | 8.556 | 8.478 | -0.078 | -0.91 | 250 | 10.458 | 10.258 | -0.200 | -1.91 | 300 | 12.425 | 12.750 | +0.325 | +2.62 | 350 | 14.807 | 14.839 | +0.032 | +0.22 | 400 | 16.801 | 16.959 | +0.158 | +0.94 | 450 | 19.478 | 19.009 | -0.470 | -2.41 | 500 | 21.296 | 21.504 | +0.208 | +0.98 | 550 | 23.842 | 23.979 | +0.137 | +0.57 | 600 | 26.309 | 26.111 | -0.198 | -0.75 | 650 | 28.705 | 28.446 | -0.259 | -0.9 | 700 | 31.233 | 31.394 | +0.161 | +0.52 | 750 | 34.064 | 33.720 | -0.344 | -1.01 | 800 | 36.320 | 36.114 | -0.206 | -0.57 | -------------------------------------------------------------- Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Acked-by: 'Peter Zijlstra' <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-18 16:16:16 +00:00
*(__tracepoints_strings)/* Tracepoints: strings */ \
} \
\
.rodata1 : AT(ADDR(.rodata1) - LOAD_OFFSET) { \
*(.rodata1) \
} \
\
/* PCI quirks */ \
.pci_fixup : AT(ADDR(.pci_fixup) - LOAD_OFFSET) { \
BOUNDED_SECTION_PRE_LABEL(.pci_fixup_early, _pci_fixups_early, __start, __end) \
BOUNDED_SECTION_PRE_LABEL(.pci_fixup_header, _pci_fixups_header, __start, __end) \
BOUNDED_SECTION_PRE_LABEL(.pci_fixup_final, _pci_fixups_final, __start, __end) \
BOUNDED_SECTION_PRE_LABEL(.pci_fixup_enable, _pci_fixups_enable, __start, __end) \
BOUNDED_SECTION_PRE_LABEL(.pci_fixup_resume, _pci_fixups_resume, __start, __end) \
BOUNDED_SECTION_PRE_LABEL(.pci_fixup_suspend, _pci_fixups_suspend, __start, __end) \
BOUNDED_SECTION_PRE_LABEL(.pci_fixup_resume_early, _pci_fixups_resume_early, __start, __end) \
BOUNDED_SECTION_PRE_LABEL(.pci_fixup_suspend_late, _pci_fixups_suspend_late, __start, __end) \
} \
\
FW_LOADER_BUILT_IN_DATA \
TRACEDATA \
\
printk: Userspace format indexing support We have a number of systems industry-wide that have a subset of their functionality that works as follows: 1. Receive a message from local kmsg, serial console, or netconsole; 2. Apply a set of rules to classify the message; 3. Do something based on this classification (like scheduling a remediation for the machine), rinse, and repeat. As a couple of examples of places we have this implemented just inside Facebook, although this isn't a Facebook-specific problem, we have this inside our netconsole processing (for alarm classification), and as part of our machine health checking. We use these messages to determine fairly important metrics around production health, and it's important that we get them right. While for some kinds of issues we have counters, tracepoints, or metrics with a stable interface which can reliably indicate the issue, in order to react to production issues quickly we need to work with the interface which most kernel developers naturally use when developing: printk. Most production issues come from unexpected phenomena, and as such usually the code in question doesn't have easily usable tracepoints or other counters available for the specific problem being mitigated. We have a number of lines of monitoring defence against problems in production (host metrics, process metrics, service metrics, etc), and where it's not feasible to reliably monitor at another level, this kind of pragmatic netconsole monitoring is essential. As one would expect, monitoring using printk is rather brittle for a number of reasons -- most notably that the message might disappear entirely in a new version of the kernel, or that the message may change in some way that the regex or other classification methods start to silently fail. One factor that makes this even harder is that, under normal operation, many of these messages are never expected to be hit. For example, there may be a rare hardware bug which one wants to detect if it was to ever happen again, but its recurrence is not likely or anticipated. This precludes using something like checking whether the printk in question was printed somewhere fleetwide recently to determine whether the message in question is still present or not, since we don't anticipate that it should be printed anywhere, but still need to monitor for its future presence in the long-term. This class of issue has happened on a number of occasions, causing unhealthy machines with hardware issues to remain in production for longer than ideal. As a recent example, some monitoring around blk_update_request fell out of date and caused semi-broken machines to remain in production for longer than would be desirable. Searching through the codebase to find the message is also extremely fragile, because many of the messages are further constructed beyond their callsite (eg. btrfs_printk and other module-specific wrappers, each with their own functionality). Even if they aren't, guessing the format and formulation of the underlying message based on the aesthetics of the message emitted is not a recipe for success at scale, and our previous issues with fleetwide machine health checking demonstrate as much. This provides a solution to the issue of silently changed or deleted printks: we record pointers to all printk format strings known at compile time into a new .printk_index section, both in vmlinux and modules. At runtime, this can then be iterated by looking at <debugfs>/printk/index/<module>, which emits the following format, both readable by humans and able to be parsed by machines: $ head -1 vmlinux; shuf -n 5 vmlinux # <level[,flags]> filename:line function "format" <5> block/blk-settings.c:661 disk_stack_limits "%s: Warning: Device %s is misaligned\n" <4> kernel/trace/trace.c:8296 trace_create_file "Could not create tracefs '%s' entry\n" <6> arch/x86/kernel/hpet.c:144 _hpet_print_config "hpet: %s(%d):\n" <6> init/do_mounts.c:605 prepare_namespace "Waiting for root device %s...\n" <6> drivers/acpi/osl.c:1410 acpi_no_auto_serialize_setup "ACPI: auto-serialization disabled\n" This mitigates the majority of cases where we have a highly-specific printk which we want to match on, as we can now enumerate and check whether the format changed or the printk callsite disappeared entirely in userspace. This allows us to catch changes to printks we monitor earlier and decide what to do about it before it becomes problematic. There is no additional runtime cost for printk callers or printk itself, and the assembly generated is exactly the same. Signed-off-by: Chris Down <chris@chrisdown.name> Cc: Petr Mladek <pmladek@suse.com> Cc: Jessica Yu <jeyu@kernel.org> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: John Ogness <john.ogness@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kees Cook <keescook@chromium.org> Reviewed-by: Petr Mladek <pmladek@suse.com> Tested-by: Petr Mladek <pmladek@suse.com> Reported-by: kernel test robot <lkp@intel.com> Acked-by: Andy Shevchenko <andy.shevchenko@gmail.com> Acked-by: Jessica Yu <jeyu@kernel.org> # for module.{c,h} Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/e42070983637ac5e384f17fbdbe86d19c7b212a5.1623775748.git.chris@chrisdown.name
2021-06-15 16:52:53 +00:00
PRINTK_INDEX \
\
/* Kernel symbol table: Normal symbols */ \
__ksymtab : AT(ADDR(__ksymtab) - LOAD_OFFSET) { \
__start___ksymtab = .; \
KEEP(*(SORT(___ksymtab+*))) \
__stop___ksymtab = .; \
} \
\
/* Kernel symbol table: GPL-only symbols */ \
__ksymtab_gpl : AT(ADDR(__ksymtab_gpl) - LOAD_OFFSET) { \
__start___ksymtab_gpl = .; \
KEEP(*(SORT(___ksymtab_gpl+*))) \
__stop___ksymtab_gpl = .; \
} \
\
/* Kernel symbol table: Normal symbols */ \
__kcrctab : AT(ADDR(__kcrctab) - LOAD_OFFSET) { \
__start___kcrctab = .; \
KEEP(*(SORT(___kcrctab+*))) \
__stop___kcrctab = .; \
} \
\
/* Kernel symbol table: GPL-only symbols */ \
__kcrctab_gpl : AT(ADDR(__kcrctab_gpl) - LOAD_OFFSET) { \
__start___kcrctab_gpl = .; \
KEEP(*(SORT(___kcrctab_gpl+*))) \
__stop___kcrctab_gpl = .; \
} \
\
/* Kernel symbol table: strings */ \
__ksymtab_strings : AT(ADDR(__ksymtab_strings) - LOAD_OFFSET) { \
*(__ksymtab_strings) \
} \
\
/* __*init sections */ \
__init_rodata : AT(ADDR(__init_rodata) - LOAD_OFFSET) { \
Introduce new section reference annotations tags: __ref, __refdata, __refconst Today we have the following annotations for functions/data referencing __init/__exit functions / data: __init_refok => for init functions __initdata_refok => for init data __exit_refok => for exit functions There is really no difference between the __init and __exit versions and simplify it and to introduce a shorter annotation the following new annotations are introduced: __ref => for functions (code) that references __*init / __*exit __refdata => for variables __refconst => for const variables Whit this annotation is it more obvious what the annotation is for and there is no longer the arbitary division between __init and __exit code. The mechanishm is the same as before - a special section is created which is made part of the usual sections in the linker script. We will start to see annotations like this: -static struct pci_serial_quirk pci_serial_quirks[] = { +static const struct pci_serial_quirk pci_serial_quirks[] __refconst = { ----------------- -static struct notifier_block __cpuinitdata cpuid_class_cpu_notifier = +static struct notifier_block cpuid_class_cpu_notifier __refdata = ---------------- -static int threshold_cpu_callback(struct notifier_block *nfb, +static int __ref threshold_cpu_callback(struct notifier_block *nfb, [The above is just random samples]. Note: No modifications were needed in modpost to support the new sections due to the newly introduced blacklisting. Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
2008-01-28 19:21:15 +00:00
*(.ref.rodata) \
} \
\
/* Built-in module parameters. */ \
__param : AT(ADDR(__param) - LOAD_OFFSET) { \
BOUNDED_SECTION_BY(__param, ___param) \
} \
\
/* Built-in module versions. */ \
__modver : AT(ADDR(__modver) - LOAD_OFFSET) { \
BOUNDED_SECTION_BY(__modver, ___modver) \
} \
\
KCFI_TRAPS \
\
RO_EXCEPTION_TABLE \
NOTES \
bpf: Support llvm-objcopy for vmlinux BTF Simplify gen_btf logic to make it work with llvm-objcopy. The existing 'file format' and 'architecture' parsing logic is brittle and does not work with llvm-objcopy/llvm-objdump. 'file format' output of llvm-objdump>=11 will match GNU objdump, but 'architecture' (bfdarch) may not. .BTF in .tmp_vmlinux.btf is non-SHF_ALLOC. Add the SHF_ALLOC flag because it is part of vmlinux image used for introspection. C code can reference the section via linker script defined __start_BTF and __stop_BTF. This fixes a small problem that previous .BTF had the SHF_WRITE flag (objcopy -I binary -O elf* synthesized .data). Additionally, `objcopy -I binary` synthesized symbols _binary__btf_vmlinux_bin_start and _binary__btf_vmlinux_bin_stop (not used elsewhere) are replaced with more commonplace __start_BTF and __stop_BTF. Add 2>/dev/null because GNU objcopy (but not llvm-objcopy) warns "empty loadable segment detected at vaddr=0xffffffff81000000, is this intentional?" We use a dd command to change the e_type field in the ELF header from ET_EXEC to ET_REL so that lld will accept .btf.vmlinux.bin.o. Accepting ET_EXEC as an input file is an extremely rare GNU ld feature that lld does not intend to support, because this is error-prone. The output section description .BTF in include/asm-generic/vmlinux.lds.h avoids potential subtle orphan section placement issues and suppresses --orphan-handling=warn warnings. Fixes: df786c9b9476 ("bpf: Force .BTF section start to zero when dumping from vmlinux") Fixes: cb0cc635c7a9 ("powerpc: Include .BTF section") Reported-by: Nathan Chancellor <natechancellor@gmail.com> Signed-off-by: Fangrui Song <maskray@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Stanislav Fomichev <sdf@google.com> Tested-by: Andrii Nakryiko <andriin@fb.com> Reviewed-by: Stanislav Fomichev <sdf@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Link: https://github.com/ClangBuiltLinux/linux/issues/871 Link: https://lore.kernel.org/bpf/20200318222746.173648-1-maskray@google.com
2020-03-18 22:27:46 +00:00
BTF \
\
. = ALIGN((align)); \
__end_rodata = .;
add support for Clang CFI This change adds support for Clang’s forward-edge Control Flow Integrity (CFI) checking. With CONFIG_CFI_CLANG, the compiler injects a runtime check before each indirect function call to ensure the target is a valid function with the correct static type. This restricts possible call targets and makes it more difficult for an attacker to exploit bugs that allow the modification of stored function pointers. For more details, see: https://clang.llvm.org/docs/ControlFlowIntegrity.html Clang requires CONFIG_LTO_CLANG to be enabled with CFI to gain visibility to possible call targets. Kernel modules are supported with Clang’s cross-DSO CFI mode, which allows checking between independently compiled components. With CFI enabled, the compiler injects a __cfi_check() function into the kernel and each module for validating local call targets. For cross-module calls that cannot be validated locally, the compiler calls the global __cfi_slowpath_diag() function, which determines the target module and calls the correct __cfi_check() function. This patch includes a slowpath implementation that uses __module_address() to resolve call targets, and with CONFIG_CFI_CLANG_SHADOW enabled, a shadow map that speeds up module look-ups by ~3x. Clang implements indirect call checking using jump tables and offers two methods of generating them. With canonical jump tables, the compiler renames each address-taken function to <function>.cfi and points the original symbol to a jump table entry, which passes __cfi_check() validation. This isn’t compatible with stand-alone assembly code, which the compiler doesn’t instrument, and would result in indirect calls to assembly code to fail. Therefore, we default to using non-canonical jump tables instead, where the compiler generates a local jump table entry <function>.cfi_jt for each address-taken function, and replaces all references to the function with the address of the jump table entry. Note that because non-canonical jump table addresses are local to each component, they break cross-module function address equality. Specifically, the address of a global function will be different in each module, as it's replaced with the address of a local jump table entry. If this address is passed to a different module, it won’t match the address of the same function taken there. This may break code that relies on comparing addresses passed from other components. CFI checking can be disabled in a function with the __nocfi attribute. Additionally, CFI can be disabled for an entire compilation unit by filtering out CC_FLAGS_CFI. By default, CFI failures result in a kernel panic to stop a potential exploit. CONFIG_CFI_PERMISSIVE enables a permissive mode, where the kernel prints out a rate-limited warning instead, and allows execution to continue. This option is helpful for locating type mismatches, but should only be enabled during development. Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Tested-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20210408182843.1754385-2-samitolvanen@google.com
2021-04-08 18:28:26 +00:00
/*
* Non-instrumentable text section
*/
#define NOINSTR_TEXT \
ALIGN_FUNCTION(); \
__noinstr_text_start = .; \
*(.noinstr.text) \
__cpuidle_text_start = .; \
*(.cpuidle.text) \
__cpuidle_text_end = .; \
__noinstr_text_end = .;
#define TEXT_SPLIT \
__split_text_start = .; \
*(.text.split .text.split.[0-9a-zA-Z_]*) \
__split_text_end = .;
#define TEXT_UNLIKELY \
__unlikely_text_start = .; \
*(.text.unlikely .text.unlikely.*) \
__unlikely_text_end = .;
#define TEXT_HOT \
__hot_text_start = .; \
*(.text.hot .text.hot.*) \
__hot_text_end = .;
/*
* .text section. Map to function alignment to avoid address changes
* during second ld run in second ld pass when generating System.map
*
vmlinux.lds.h: Adjust symbol ordering in text output section When the -ffunction-sections compiler option is enabled, each function is placed in a separate section named .text.function_name rather than putting all functions in a single .text section. However, using -function-sections can cause problems with the linker script. The comments included in include/asm-generic/vmlinux.lds.h note these issues.: “TEXT_MAIN here will match .text.fixup and .text.unlikely if dead code elimination is enabled, so these sections should be converted to use ".." first.” It is unclear whether there is a straightforward method for converting a suffix to "..". This patch modifies the order of subsections within the text output section. Specifically, it changes current order: .text.hot, .text, .text_unlikely, .text.unknown, .text.asan to the new order: .text.asan, .text.unknown, .text_unlikely, .text.hot, .text Here is the rationale behind the new layout: The majority of the code resides in three sections: .text.hot, .text, and .text.unlikely, with .text.unknown containing a negligible amount. .text.asan is only generated in ASAN builds. The primary goal is to group code segments based on their execution frequency (hotness). First, we want to place .text.hot adjacent to .text. Since we cannot put .text.hot after .text (Due to constraints with -ffunction-sections, placing .text.hot after .text is problematic), we need to put .text.hot before .text. Then it comes to .text.unlikely, we cannot put it after .text (same -ffunction-sections issue) . Therefore, we position .text.unlikely before .text.hot. .text.unknown and .tex.asan follow the same logic. This revised ordering effectively reverses the original arrangement (for .text.unlikely, .text.unknown, and .tex.asan), maintaining a similar level of affinity between sections. It also places .text.hot section at the beginning of a page to better utilize the TLB entry. Note that the limitation arises because the linker script employs glob patterns instead of regular expressions for string matching. While there is a method to maintain the current order using complex patterns, this significantly complicates the pattern and increases the likelihood of errors. This patch also changes vmlinux.lds.S for the sparc64 architecture to accommodate specific symbol placement requirements. Co-developed-by: Han Shen <shenhan@google.com> Signed-off-by: Han Shen <shenhan@google.com> Signed-off-by: Rong Xu <xur@google.com> Suggested-by: Sriraman Tallam <tmsriram@google.com> Suggested-by: Krzysztof Pszeniczny <kpszeniczny@google.com> Tested-by: Yonghong Song <yonghong.song@linux.dev> Tested-by: Yabin Cui <yabinc@google.com> Tested-by: Nathan Chancellor <nathan@kernel.org> Reviewed-by: Kees Cook <kees@kernel.org> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2024-11-02 17:51:10 +00:00
* TEXT_MAIN here will match symbols with a fixed pattern (for example,
* .text.hot or .text.unlikely) if dead code elimination or
* function-section is enabled. Match these symbols first before
* TEXT_MAIN to ensure they are grouped together.
*
* Also placing .text.hot section at the beginning of a page, this
* would help the TLB performance.
*/
#define TEXT_TEXT \
ALIGN_FUNCTION(); \
vmlinux.lds.h: Adjust symbol ordering in text output section When the -ffunction-sections compiler option is enabled, each function is placed in a separate section named .text.function_name rather than putting all functions in a single .text section. However, using -function-sections can cause problems with the linker script. The comments included in include/asm-generic/vmlinux.lds.h note these issues.: “TEXT_MAIN here will match .text.fixup and .text.unlikely if dead code elimination is enabled, so these sections should be converted to use ".." first.” It is unclear whether there is a straightforward method for converting a suffix to "..". This patch modifies the order of subsections within the text output section. Specifically, it changes current order: .text.hot, .text, .text_unlikely, .text.unknown, .text.asan to the new order: .text.asan, .text.unknown, .text_unlikely, .text.hot, .text Here is the rationale behind the new layout: The majority of the code resides in three sections: .text.hot, .text, and .text.unlikely, with .text.unknown containing a negligible amount. .text.asan is only generated in ASAN builds. The primary goal is to group code segments based on their execution frequency (hotness). First, we want to place .text.hot adjacent to .text. Since we cannot put .text.hot after .text (Due to constraints with -ffunction-sections, placing .text.hot after .text is problematic), we need to put .text.hot before .text. Then it comes to .text.unlikely, we cannot put it after .text (same -ffunction-sections issue) . Therefore, we position .text.unlikely before .text.hot. .text.unknown and .tex.asan follow the same logic. This revised ordering effectively reverses the original arrangement (for .text.unlikely, .text.unknown, and .tex.asan), maintaining a similar level of affinity between sections. It also places .text.hot section at the beginning of a page to better utilize the TLB entry. Note that the limitation arises because the linker script employs glob patterns instead of regular expressions for string matching. While there is a method to maintain the current order using complex patterns, this significantly complicates the pattern and increases the likelihood of errors. This patch also changes vmlinux.lds.S for the sparc64 architecture to accommodate specific symbol placement requirements. Co-developed-by: Han Shen <shenhan@google.com> Signed-off-by: Han Shen <shenhan@google.com> Signed-off-by: Rong Xu <xur@google.com> Suggested-by: Sriraman Tallam <tmsriram@google.com> Suggested-by: Krzysztof Pszeniczny <kpszeniczny@google.com> Tested-by: Yonghong Song <yonghong.song@linux.dev> Tested-by: Yabin Cui <yabinc@google.com> Tested-by: Nathan Chancellor <nathan@kernel.org> Reviewed-by: Kees Cook <kees@kernel.org> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2024-11-02 17:51:10 +00:00
*(.text.asan.* .text.tsan.*) \
*(.text.unknown .text.unknown.*) \
TEXT_SPLIT \
TEXT_UNLIKELY \
vmlinux.lds.h: Adjust symbol ordering in text output section When the -ffunction-sections compiler option is enabled, each function is placed in a separate section named .text.function_name rather than putting all functions in a single .text section. However, using -function-sections can cause problems with the linker script. The comments included in include/asm-generic/vmlinux.lds.h note these issues.: “TEXT_MAIN here will match .text.fixup and .text.unlikely if dead code elimination is enabled, so these sections should be converted to use ".." first.” It is unclear whether there is a straightforward method for converting a suffix to "..". This patch modifies the order of subsections within the text output section. Specifically, it changes current order: .text.hot, .text, .text_unlikely, .text.unknown, .text.asan to the new order: .text.asan, .text.unknown, .text_unlikely, .text.hot, .text Here is the rationale behind the new layout: The majority of the code resides in three sections: .text.hot, .text, and .text.unlikely, with .text.unknown containing a negligible amount. .text.asan is only generated in ASAN builds. The primary goal is to group code segments based on their execution frequency (hotness). First, we want to place .text.hot adjacent to .text. Since we cannot put .text.hot after .text (Due to constraints with -ffunction-sections, placing .text.hot after .text is problematic), we need to put .text.hot before .text. Then it comes to .text.unlikely, we cannot put it after .text (same -ffunction-sections issue) . Therefore, we position .text.unlikely before .text.hot. .text.unknown and .tex.asan follow the same logic. This revised ordering effectively reverses the original arrangement (for .text.unlikely, .text.unknown, and .tex.asan), maintaining a similar level of affinity between sections. It also places .text.hot section at the beginning of a page to better utilize the TLB entry. Note that the limitation arises because the linker script employs glob patterns instead of regular expressions for string matching. While there is a method to maintain the current order using complex patterns, this significantly complicates the pattern and increases the likelihood of errors. This patch also changes vmlinux.lds.S for the sparc64 architecture to accommodate specific symbol placement requirements. Co-developed-by: Han Shen <shenhan@google.com> Signed-off-by: Han Shen <shenhan@google.com> Signed-off-by: Rong Xu <xur@google.com> Suggested-by: Sriraman Tallam <tmsriram@google.com> Suggested-by: Krzysztof Pszeniczny <kpszeniczny@google.com> Tested-by: Yonghong Song <yonghong.song@linux.dev> Tested-by: Yabin Cui <yabinc@google.com> Tested-by: Nathan Chancellor <nathan@kernel.org> Reviewed-by: Kees Cook <kees@kernel.org> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2024-11-02 17:51:10 +00:00
. = ALIGN(PAGE_SIZE); \
TEXT_HOT \
vmlinux.lds.h: Add PGO and AutoFDO input sections Basically, consider .text.{hot|unlikely|unknown}.* part of .text, too. When compiling with profiling information (collected via PGO instrumentations or AutoFDO sampling), Clang will separate code into .text.hot, .text.unlikely, or .text.unknown sections based on profiling information. After D79600 (clang-11), these sections will have a trailing `.` suffix, ie. .text.hot., .text.unlikely., .text.unknown.. When using -ffunction-sections together with profiling infomation, either explicitly (FGKASLR) or implicitly (LTO), code may be placed in sections following the convention: .text.hot.<foo>, .text.unlikely.<bar>, .text.unknown.<baz> where <foo>, <bar>, and <baz> are functions. (This produces one section per function; we generally try to merge these all back via linker script so that we don't have 50k sections). For the above cases, we need to teach our linker scripts that such sections might exist and that we'd explicitly like them grouped together, otherwise we can wind up with code outside of the _stext/_etext boundaries that might not be mapped properly for some architectures, resulting in boot failures. If the linker script is not told about possible input sections, then where the section is placed as output is a heuristic-laiden mess that's non-portable between linkers (ie. BFD and LLD), and has resulted in many hard to debug bugs. Kees Cook is working on cleaning this up by adding --orphan-handling=warn linker flag used in ARCH=powerpc to additional architectures. In the case of linker scripts, borrowing from the Zen of Python: explicit is better than implicit. Also, ld.bfd's internal linker script considers .text.hot AND .text.hot.* to be part of .text, as well as .text.unlikely and .text.unlikely.*. I didn't see support for .text.unknown.*, and didn't see Clang producing such code in our kernel builds, but I see code in LLVM that can produce such section names if profiling information is missing. That may point to a larger issue with generating or collecting profiles, but I would much rather be safe and explicit than have to debug yet another issue related to orphan section placement. Reported-by: Jian Cai <jiancai@google.com> Suggested-by: Fāng-ruì Sòng <maskray@google.com> Signed-off-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Luis Lozano <llozano@google.com> Tested-by: Manoj Gupta <manojgupta@google.com> Acked-by: Kees Cook <keescook@chromium.org> Cc: linux-arch@vger.kernel.org Cc: stable@vger.kernel.org Link: https://sourceware.org/git/?p=binutils-gdb.git;a=commitdiff;h=add44f8d5c5c05e08b11e033127a744d61c26aee Link: https://sourceware.org/git/?p=binutils-gdb.git;a=commitdiff;h=1de778ed23ce7492c523d5850c6c6dbb34152655 Link: https://reviews.llvm.org/D79600 Link: https://bugs.chromium.org/p/chromium/issues/detail?id=1084760 Link: https://lore.kernel.org/r/20200821194310.3089815-7-keescook@chromium.org Debugged-by: Luis Lozano <llozano@google.com>
2020-08-21 19:42:47 +00:00
*(TEXT_MAIN .text.fixup) \
NOINSTR_TEXT \
vmlinux.lds.h: Adjust symbol ordering in text output section When the -ffunction-sections compiler option is enabled, each function is placed in a separate section named .text.function_name rather than putting all functions in a single .text section. However, using -function-sections can cause problems with the linker script. The comments included in include/asm-generic/vmlinux.lds.h note these issues.: “TEXT_MAIN here will match .text.fixup and .text.unlikely if dead code elimination is enabled, so these sections should be converted to use ".." first.” It is unclear whether there is a straightforward method for converting a suffix to "..". This patch modifies the order of subsections within the text output section. Specifically, it changes current order: .text.hot, .text, .text_unlikely, .text.unknown, .text.asan to the new order: .text.asan, .text.unknown, .text_unlikely, .text.hot, .text Here is the rationale behind the new layout: The majority of the code resides in three sections: .text.hot, .text, and .text.unlikely, with .text.unknown containing a negligible amount. .text.asan is only generated in ASAN builds. The primary goal is to group code segments based on their execution frequency (hotness). First, we want to place .text.hot adjacent to .text. Since we cannot put .text.hot after .text (Due to constraints with -ffunction-sections, placing .text.hot after .text is problematic), we need to put .text.hot before .text. Then it comes to .text.unlikely, we cannot put it after .text (same -ffunction-sections issue) . Therefore, we position .text.unlikely before .text.hot. .text.unknown and .tex.asan follow the same logic. This revised ordering effectively reverses the original arrangement (for .text.unlikely, .text.unknown, and .tex.asan), maintaining a similar level of affinity between sections. It also places .text.hot section at the beginning of a page to better utilize the TLB entry. Note that the limitation arises because the linker script employs glob patterns instead of regular expressions for string matching. While there is a method to maintain the current order using complex patterns, this significantly complicates the pattern and increases the likelihood of errors. This patch also changes vmlinux.lds.S for the sparc64 architecture to accommodate specific symbol placement requirements. Co-developed-by: Han Shen <shenhan@google.com> Signed-off-by: Han Shen <shenhan@google.com> Signed-off-by: Rong Xu <xur@google.com> Suggested-by: Sriraman Tallam <tmsriram@google.com> Suggested-by: Krzysztof Pszeniczny <kpszeniczny@google.com> Tested-by: Yonghong Song <yonghong.song@linux.dev> Tested-by: Yabin Cui <yabinc@google.com> Tested-by: Nathan Chancellor <nathan@kernel.org> Reviewed-by: Kees Cook <kees@kernel.org> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2024-11-02 17:51:10 +00:00
*(.ref.text)
/* sched.text is aling to function alignment to secure we have same
* address even at second ld pass when generating System.map */
#define SCHED_TEXT \
ALIGN_FUNCTION(); \
__sched_text_start = .; \
*(.sched.text) \
__sched_text_end = .;
/* spinlock.text is aling to function alignment to secure we have same
* address even at second ld pass when generating System.map */
#define LOCK_TEXT \
ALIGN_FUNCTION(); \
__lock_text_start = .; \
*(.spinlock.text) \
__lock_text_end = .;
#define KPROBES_TEXT \
ALIGN_FUNCTION(); \
__kprobes_text_start = .; \
*(.kprobes.text) \
__kprobes_text_end = .;
x86: Separate out entry text section Put x86 entry code into a separate link section: .entry.text. Separating the entry text section seems to have performance benefits - caused by more efficient instruction cache usage. Running hackbench with perf stat --repeat showed that the change compresses the icache footprint. The icache load miss rate went down by about 15%: before patch: 19417627 L1-icache-load-misses ( +- 0.147% ) after patch: 16490788 L1-icache-load-misses ( +- 0.180% ) The motivation of the patch was to fix a particular kprobes bug that relates to the entry text section, the performance advantage was discovered accidentally. Whole perf output follows: - results for current tip tree: Performance counter stats for './hackbench/hackbench 10' (500 runs): 19417627 L1-icache-load-misses ( +- 0.147% ) 2676914223 instructions # 0.497 IPC ( +- 0.079% ) 5389516026 cycles ( +- 0.144% ) 0.206267711 seconds time elapsed ( +- 0.138% ) - results for current tip tree with the patch applied: Performance counter stats for './hackbench/hackbench 10' (500 runs): 16490788 L1-icache-load-misses ( +- 0.180% ) 2717734941 instructions # 0.502 IPC ( +- 0.079% ) 5414756975 cycles ( +- 0.148% ) 0.206747566 seconds time elapsed ( +- 0.137% ) Signed-off-by: Jiri Olsa <jolsa@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: masami.hiramatsu.pt@hitachi.com Cc: ananth@in.ibm.com Cc: davem@davemloft.net Cc: 2nddept-manager@sdl.hitachi.co.jp LKML-Reference: <20110307181039.GB15197@jolsa.redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-03-07 18:10:39 +00:00
#define ENTRY_TEXT \
ALIGN_FUNCTION(); \
__entry_text_start = .; \
x86: Separate out entry text section Put x86 entry code into a separate link section: .entry.text. Separating the entry text section seems to have performance benefits - caused by more efficient instruction cache usage. Running hackbench with perf stat --repeat showed that the change compresses the icache footprint. The icache load miss rate went down by about 15%: before patch: 19417627 L1-icache-load-misses ( +- 0.147% ) after patch: 16490788 L1-icache-load-misses ( +- 0.180% ) The motivation of the patch was to fix a particular kprobes bug that relates to the entry text section, the performance advantage was discovered accidentally. Whole perf output follows: - results for current tip tree: Performance counter stats for './hackbench/hackbench 10' (500 runs): 19417627 L1-icache-load-misses ( +- 0.147% ) 2676914223 instructions # 0.497 IPC ( +- 0.079% ) 5389516026 cycles ( +- 0.144% ) 0.206267711 seconds time elapsed ( +- 0.138% ) - results for current tip tree with the patch applied: Performance counter stats for './hackbench/hackbench 10' (500 runs): 16490788 L1-icache-load-misses ( +- 0.180% ) 2717734941 instructions # 0.502 IPC ( +- 0.079% ) 5414756975 cycles ( +- 0.148% ) 0.206747566 seconds time elapsed ( +- 0.137% ) Signed-off-by: Jiri Olsa <jolsa@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: masami.hiramatsu.pt@hitachi.com Cc: ananth@in.ibm.com Cc: davem@davemloft.net Cc: 2nddept-manager@sdl.hitachi.co.jp LKML-Reference: <20110307181039.GB15197@jolsa.redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-03-07 18:10:39 +00:00
*(.entry.text) \
__entry_text_end = .;
x86: Separate out entry text section Put x86 entry code into a separate link section: .entry.text. Separating the entry text section seems to have performance benefits - caused by more efficient instruction cache usage. Running hackbench with perf stat --repeat showed that the change compresses the icache footprint. The icache load miss rate went down by about 15%: before patch: 19417627 L1-icache-load-misses ( +- 0.147% ) after patch: 16490788 L1-icache-load-misses ( +- 0.180% ) The motivation of the patch was to fix a particular kprobes bug that relates to the entry text section, the performance advantage was discovered accidentally. Whole perf output follows: - results for current tip tree: Performance counter stats for './hackbench/hackbench 10' (500 runs): 19417627 L1-icache-load-misses ( +- 0.147% ) 2676914223 instructions # 0.497 IPC ( +- 0.079% ) 5389516026 cycles ( +- 0.144% ) 0.206267711 seconds time elapsed ( +- 0.138% ) - results for current tip tree with the patch applied: Performance counter stats for './hackbench/hackbench 10' (500 runs): 16490788 L1-icache-load-misses ( +- 0.180% ) 2717734941 instructions # 0.502 IPC ( +- 0.079% ) 5414756975 cycles ( +- 0.148% ) 0.206747566 seconds time elapsed ( +- 0.137% ) Signed-off-by: Jiri Olsa <jolsa@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: masami.hiramatsu.pt@hitachi.com Cc: ananth@in.ibm.com Cc: davem@davemloft.net Cc: 2nddept-manager@sdl.hitachi.co.jp LKML-Reference: <20110307181039.GB15197@jolsa.redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-03-07 18:10:39 +00:00
#define IRQENTRY_TEXT \
ALIGN_FUNCTION(); \
__irqentry_text_start = .; \
*(.irqentry.text) \
__irqentry_text_end = .;
#define SOFTIRQENTRY_TEXT \
ALIGN_FUNCTION(); \
__softirqentry_text_start = .; \
*(.softirqentry.text) \
__softirqentry_text_end = .;
#define STATIC_CALL_TEXT \
ALIGN_FUNCTION(); \
__static_call_text_start = .; \
*(.static_call.text) \
__static_call_text_end = .;
/* Section used for early init (in .S files) */
#define HEAD_TEXT KEEP(*(.head.text))
#define HEAD_TEXT_SECTION \
.head.text : AT(ADDR(.head.text) - LOAD_OFFSET) { \
HEAD_TEXT \
}
/*
* Exception table
*/
#define EXCEPTION_TABLE(align) \
. = ALIGN(align); \
__ex_table : AT(ADDR(__ex_table) - LOAD_OFFSET) { \
BOUNDED_SECTION_BY(__ex_table, ___ex_table) \
}
bpf: Support llvm-objcopy for vmlinux BTF Simplify gen_btf logic to make it work with llvm-objcopy. The existing 'file format' and 'architecture' parsing logic is brittle and does not work with llvm-objcopy/llvm-objdump. 'file format' output of llvm-objdump>=11 will match GNU objdump, but 'architecture' (bfdarch) may not. .BTF in .tmp_vmlinux.btf is non-SHF_ALLOC. Add the SHF_ALLOC flag because it is part of vmlinux image used for introspection. C code can reference the section via linker script defined __start_BTF and __stop_BTF. This fixes a small problem that previous .BTF had the SHF_WRITE flag (objcopy -I binary -O elf* synthesized .data). Additionally, `objcopy -I binary` synthesized symbols _binary__btf_vmlinux_bin_start and _binary__btf_vmlinux_bin_stop (not used elsewhere) are replaced with more commonplace __start_BTF and __stop_BTF. Add 2>/dev/null because GNU objcopy (but not llvm-objcopy) warns "empty loadable segment detected at vaddr=0xffffffff81000000, is this intentional?" We use a dd command to change the e_type field in the ELF header from ET_EXEC to ET_REL so that lld will accept .btf.vmlinux.bin.o. Accepting ET_EXEC as an input file is an extremely rare GNU ld feature that lld does not intend to support, because this is error-prone. The output section description .BTF in include/asm-generic/vmlinux.lds.h avoids potential subtle orphan section placement issues and suppresses --orphan-handling=warn warnings. Fixes: df786c9b9476 ("bpf: Force .BTF section start to zero when dumping from vmlinux") Fixes: cb0cc635c7a9 ("powerpc: Include .BTF section") Reported-by: Nathan Chancellor <natechancellor@gmail.com> Signed-off-by: Fangrui Song <maskray@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Stanislav Fomichev <sdf@google.com> Tested-by: Andrii Nakryiko <andriin@fb.com> Reviewed-by: Stanislav Fomichev <sdf@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Link: https://github.com/ClangBuiltLinux/linux/issues/871 Link: https://lore.kernel.org/bpf/20200318222746.173648-1-maskray@google.com
2020-03-18 22:27:46 +00:00
/*
* .BTF
*/
#ifdef CONFIG_DEBUG_INFO_BTF
#define BTF \
.BTF : AT(ADDR(.BTF) - LOAD_OFFSET) { \
BOUNDED_SECTION_BY(.BTF, _BTF) \
} \
. = ALIGN(4); \
.BTF_ids : AT(ADDR(.BTF_ids) - LOAD_OFFSET) { \
*(.BTF_ids) \
bpf: Support llvm-objcopy for vmlinux BTF Simplify gen_btf logic to make it work with llvm-objcopy. The existing 'file format' and 'architecture' parsing logic is brittle and does not work with llvm-objcopy/llvm-objdump. 'file format' output of llvm-objdump>=11 will match GNU objdump, but 'architecture' (bfdarch) may not. .BTF in .tmp_vmlinux.btf is non-SHF_ALLOC. Add the SHF_ALLOC flag because it is part of vmlinux image used for introspection. C code can reference the section via linker script defined __start_BTF and __stop_BTF. This fixes a small problem that previous .BTF had the SHF_WRITE flag (objcopy -I binary -O elf* synthesized .data). Additionally, `objcopy -I binary` synthesized symbols _binary__btf_vmlinux_bin_start and _binary__btf_vmlinux_bin_stop (not used elsewhere) are replaced with more commonplace __start_BTF and __stop_BTF. Add 2>/dev/null because GNU objcopy (but not llvm-objcopy) warns "empty loadable segment detected at vaddr=0xffffffff81000000, is this intentional?" We use a dd command to change the e_type field in the ELF header from ET_EXEC to ET_REL so that lld will accept .btf.vmlinux.bin.o. Accepting ET_EXEC as an input file is an extremely rare GNU ld feature that lld does not intend to support, because this is error-prone. The output section description .BTF in include/asm-generic/vmlinux.lds.h avoids potential subtle orphan section placement issues and suppresses --orphan-handling=warn warnings. Fixes: df786c9b9476 ("bpf: Force .BTF section start to zero when dumping from vmlinux") Fixes: cb0cc635c7a9 ("powerpc: Include .BTF section") Reported-by: Nathan Chancellor <natechancellor@gmail.com> Signed-off-by: Fangrui Song <maskray@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Stanislav Fomichev <sdf@google.com> Tested-by: Andrii Nakryiko <andriin@fb.com> Reviewed-by: Stanislav Fomichev <sdf@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Link: https://github.com/ClangBuiltLinux/linux/issues/871 Link: https://lore.kernel.org/bpf/20200318222746.173648-1-maskray@google.com
2020-03-18 22:27:46 +00:00
}
#else
#define BTF
#endif
/*
* Init task
*/
#define INIT_TASK_DATA_SECTION(align) \
. = ALIGN(align); \
.data..init_task : AT(ADDR(.data..init_task) - LOAD_OFFSET) { \
INIT_TASK_DATA(align) \
}
#ifdef CONFIG_CONSTRUCTORS
gcov: fix __ctors_start alignment The ctors section for each object file is eight byte aligned (on 64 bit). However the __ctors_start symbol starts at an arbitrary address dependent on the size of the previous sections. Therefore the linker may add some zeroes after __ctors_start to make sure the ctors contents are properly aligned. However the extra zeroes at the beginning aren't expected by the code. When walking the functions pointers contained in there and extra zeroes are added this may result in random jumps. So make sure that the __ctors_start symbol is always aligned as well. Fixes this crash on an allyesconfig on s390: [ 0.582482] Kernel BUG at 0000000000000012 [verbose debug info unavailable] [ 0.582489] illegal operation: 0001 [#1] SMP DEBUG_PAGEALLOC [ 0.582496] Modules linked in: [ 0.582501] CPU: 0 Tainted: G W 2.6.31-rc1-dirty #273 [ 0.582506] Process swapper (pid: 1, task: 000000003f218000, ksp: 000000003f2238e8) [ 0.582510] Krnl PSW : 0704200180000000 0000000000000012 (0x12) [ 0.582518] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:2 PM:0 EA:3 [ 0.582524] Krnl GPRS: 0000000000036727 0000000000000010 0000000000000001 0000000000000001 [ 0.582529] 00000000001dfefa 0000000000000000 0000000000000000 0000000000000040 [ 0.582534] 0000000001fff0f0 0000000001790628 0000000002296048 0000000002296048 [ 0.582540] 00000000020c438e 0000000001786000 0000000002014a66 000000003f223e60 [ 0.582553] Krnl Code:>0000000000000012: 0000 unknown [ 0.582559] 0000000000000014: 0000 unknown [ 0.582564] 0000000000000016: 0000 unknown [ 0.582570] 0000000000000018: 0000 unknown [ 0.582575] 000000000000001a: 0000 unknown [ 0.582580] 000000000000001c: 0000 unknown [ 0.582585] 000000000000001e: 0000 unknown [ 0.582591] 0000000000000020: 0000 unknown [ 0.582596] Call Trace: [ 0.582599] ([<0000000002014a46>] kernel_init+0x622/0x7a0) [ 0.582607] [<0000000000113e22>] kernel_thread_starter+0x6/0xc [ 0.582615] [<0000000000113e1c>] kernel_thread_starter+0x0/0xc [ 0.582621] INFO: lockdep is turned off. [ 0.582624] Last Breaking-Event-Address: [ 0.582627] [<0000000002014a64>] kernel_init+0x640/0x7a0 Cc: Peter Oberparleiter <oberpar@linux.vnet.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-30 18:41:13 +00:00
#define KERNEL_CTORS() . = ALIGN(8); \
__ctors_start = .; \
KEEP(*(SORT(.ctors.*))) \
KEEP(*(.ctors)) \
KEEP(*(SORT(.init_array.*))) \
KEEP(*(.init_array)) \
__ctors_end = .;
#else
#define KERNEL_CTORS()
#endif
/* init and exit section handling */
#define INIT_DATA \
KEEP(*(SORT(___kentry+*))) \
*(.init.data .init.data.*) \
KERNEL_CTORS() \
MCOUNT_REC() \
*(.init.rodata .init.rodata.*) \
tracing: Replace trace_event struct array with pointer array Currently the trace_event structures are placed in the _ftrace_events section, and at link time, the linker makes one large array of all the trace_event structures. On boot up, this array is read (much like the initcall sections) and the events are processed. The problem is that there is no guarantee that gcc will place complex structures nicely together in an array format. Two structures in the same file may be placed awkwardly, because gcc has no clue that they are suppose to be in an array. A hack was used previous to force the alignment to 4, to pack the structures together. But this caused alignment issues with other architectures (sparc). Instead of packing the structures into an array, the structures' addresses are now put into the _ftrace_event section. As pointers are always the natural alignment, gcc should always pack them tightly together (otherwise initcall, extable, etc would also fail). By having the pointers to the structures in the section, we can still iterate the trace_events without causing unnecessary alignment problems with other architectures, or depending on the current behaviour of gcc that will likely change in the future just to tick us kernel developers off a little more. The _ftrace_event section is also moved into the .init.data section as it is now only needed at boot up. Suggested-by: David Miller <davem@davemloft.net> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-01-27 14:15:30 +00:00
FTRACE_EVENTS() \
tracing: Replace syscall_meta_data struct array with pointer array Currently the syscall_meta structures for the syscall tracepoints are placed in the __syscall_metadata section, and at link time, the linker makes one large array of all these syscall metadata structures. On boot up, this array is read (much like the initcall sections) and the syscall data is processed. The problem is that there is no guarantee that gcc will place complex structures nicely together in an array format. Two structures in the same file may be placed awkwardly, because gcc has no clue that they are suppose to be in an array. A hack was used previous to force the alignment to 4, to pack the structures together. But this caused alignment issues with other architectures (sparc). Instead of packing the structures into an array, the structures' addresses are now put into the __syscall_metadata section. As pointers are always the natural alignment, gcc should always pack them tightly together (otherwise initcall, extable, etc would also fail). By having the pointers to the structures in the section, we can still iterate the trace_events without causing unnecessary alignment problems with other architectures, or depending on the current behaviour of gcc that will likely change in the future just to tick us kernel developers off a little more. The __syscall_metadata section is also moved into the .init.data section as it is now only needed at boot up. Suggested-by: David Miller <davem@davemloft.net> Acked-by: David S. Miller <davem@davemloft.net> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-02-02 22:06:09 +00:00
TRACE_SYSCALLS() \
kprobes: Introduce NOKPROBE_SYMBOL() macro to maintain kprobes blacklist Introduce NOKPROBE_SYMBOL() macro which builds a kprobes blacklist at kernel build time. The usage of this macro is similar to EXPORT_SYMBOL(), placed after the function definition: NOKPROBE_SYMBOL(function); Since this macro will inhibit inlining of static/inline functions, this patch also introduces a nokprobe_inline macro for static/inline functions. In this case, we must use NOKPROBE_SYMBOL() for the inline function caller. When CONFIG_KPROBES=y, the macro stores the given function address in the "_kprobe_blacklist" section. Since the data structures are not fully initialized by the macro (because there is no "size" information), those are re-initialized at boot time by using kallsyms. Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Link: http://lkml.kernel.org/r/20140417081705.26341.96719.stgit@ltc230.yrl.intra.hitachi.co.jp Cc: Alok Kataria <akataria@vmware.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christopher Li <sparse@chrisli.org> Cc: Chris Wright <chrisw@sous-sol.org> Cc: David S. Miller <davem@davemloft.net> Cc: Jan-Simon Möller <dl9pf@gmx.de> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: linux-arch@vger.kernel.org Cc: linux-doc@vger.kernel.org Cc: linux-sparse@vger.kernel.org Cc: virtualization@lists.linux-foundation.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-04-17 08:17:05 +00:00
KPROBE_BLACKLIST() \
ERROR_INJECT_WHITELIST() \
CLK_OF_TABLES() \
RESERVEDMEM_OF_TABLES() \
TIMER_OF_TABLES() \
CPU_METHOD_OF_TABLES() \
CPUIDLE_METHOD_OF_TABLES() \
irqchip: add basic infrastructure With the recent creation of the drivers/irqchip/ directory, it is desirable to move irq controller drivers here. At the moment, the only driver here is irq-bcm2835, the driver for the irq controller found in the ARM BCM2835 SoC, present in Rasberry Pi systems. This irq controller driver was exporting its initialization function and its irq handling function through a header file in <linux/irqchip/bcm2835.h>. When proposing to also move another irq controller driver in drivers/irqchip, Rob Herring raised the very valid point that moving things to drivers/irqchip was good in order to remove more stuff from arch/arm, but if it means adding gazillions of headers files in include/linux/irqchip/, it would not be very nice. So, upon the suggestion of Rob Herring and Arnd Bergmann, this commit introduces a small infrastructure that defines a central irqchip_init() function in drivers/irqchip/irqchip.c, which is meant to be called as the ->init_irq() callback of ARM platforms. This function calls of_irq_init() with an array of match strings and init functions generated from a special linker section. Note that the irq controller driver initialization function is responsible for setting the global handle_arch_irq() variable, so that ARM platforms no longer have to define the ->handle_irq field in their DT_MACHINE structure. A global header, <linux/irqchip.h> is also added to expose the single irqchip_init() function to the reset of the kernel. A further commit moves the BCM2835 irq controller driver to this new small infrastructure, therefore removing the include/linux/irqchip/ directory. Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com> Reviewed-by: Stephen Warren <swarren@wwwdotorg.org> Reviewed-by: Rob Herring <rob.herring@calxeda.com> Acked-by: Arnd Bergmann <arnd@arndb.de> [rob.herring: reword commit message to reflect use of linker sections.] Signed-off-by: Rob Herring <rob.herring@calxeda.com>
2012-11-20 22:00:52 +00:00
KERNEL_DTB() \
IRQCHIP_OF_MATCH_TABLE() \
ACPI_PROBE_TABLE(irqchip) \
ACPI_PROBE_TABLE(timer) \
THERMAL_TABLE(governor) \
EARLYCON_TABLE() \
LSM_TABLE() \
EARLY_LSM_TABLE() \
KUNIT_INIT_TABLE()
#define INIT_TEXT \
*(.init.text .init.text.*) \
*(.text.startup)
#define EXIT_DATA \
*(.exit.data .exit.data.*) \
*(.fini_array .fini_array.*) \
*(.dtors .dtors.*) \
#define EXIT_TEXT \
*(.exit.text) \
vmlinux.lds: account for destructor sections If CONFIG_KASAN is enabled and gcc is configured with --disable-initfini-array and/or gold linker is used, gcc emits .ctors/.dtors and .text.startup/.text.exit sections instead of .init_array/.fini_array. .dtors section is not explicitly accounted in the linker script and messes vvar/percpu layout. We want: ffffffff822bfd80 D _edata ffffffff822c0000 D __vvar_beginning_hack ffffffff822c0000 A __vvar_page ffffffff822c0080 0000000000000098 D vsyscall_gtod_data ffffffff822c1000 A __init_begin ffffffff822c1000 D init_per_cpu__irq_stack_union ffffffff822c1000 A __per_cpu_load ffffffff822d3000 D init_per_cpu__gdt_page We got: ffffffff8279a600 D _edata ffffffff8279b000 A __vvar_page ffffffff8279c000 A __init_begin ffffffff8279c000 D init_per_cpu__irq_stack_union ffffffff8279c000 A __per_cpu_load ffffffff8279e000 D __vvar_beginning_hack ffffffff8279e080 0000000000000098 D vsyscall_gtod_data ffffffff827ae000 D init_per_cpu__gdt_page This happens because __vvar_page and .vvar get different addresses in arch/x86/kernel/vmlinux.lds.S: . = ALIGN(PAGE_SIZE); __vvar_page = .; .vvar : AT(ADDR(.vvar) - LOAD_OFFSET) { /* work around gold bug 13023 */ __vvar_beginning_hack = .; Discard .dtors/.fini_array/.text.exit, since we don't call dtors. Merge .text.startup into init text. Link: http://lkml.kernel.org/r/1467386363-120030-1-git-send-email-dvyukov@google.com Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: <stable@vger.kernel.org> [4.0+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-14 19:07:29 +00:00
*(.text.exit) \
#define EXIT_CALL \
*(.exitcall.exit)
/*
* bss (Block Started by Symbol) - uninitialized data
* zeroed during startup
*/
vmlinux.lds.h: restructure BSS linker script macros The BSS section macros in vmlinux.lds.h currently place the .sbss input section outside the bounds of [__bss_start, __bss_end]. On all architectures except for microblaze that handle both .sbss and __bss_start/__bss_end, this is wrong: the .sbss input section is within the range [__bss_start, __bss_end]. Relatedly, the example code at the top of the file actually has __bss_start/__bss_end defined twice; I believe the right fix here is to define them in the BSS_SECTION macro but not in the BSS macro. Another problem with the current macros is that several architectures have an ALIGN(4) or some other small number just before __bss_stop in their linker scripts. The BSS_SECTION macro currently hardcodes this to 4; while it should really be an argument. It also ignores its sbss_align argument; fix that. mn10300 is the only user at present of any of the macros touched by this patch. It looks like mn10300 actually was incorrectly converted to use the new BSS() macro (the alignment of 4 prior to conversion was a __bss_stop alignment, but the argument to the BSS macro is a start alignment). So fix this as well. I'd like acks from Sam and David on this one. Also CCing Paul, since he has a patch from me which will need to be updated to use BSS_SECTION(0, PAGE_SIZE, 4) once this gets merged. Signed-off-by: Tim Abbott <tabbott@ksplice.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
2009-07-12 22:23:33 +00:00
#define SBSS(sbss_align) \
. = ALIGN(sbss_align); \
.sbss : AT(ADDR(.sbss) - LOAD_OFFSET) { \
*(.dynsbss) \
*(SBSS_MAIN) \
*(.scommon) \
}
/*
* Allow archectures to redefine BSS_FIRST_SECTIONS to add extra
* sections to the front of bss.
*/
#ifndef BSS_FIRST_SECTIONS
#define BSS_FIRST_SECTIONS
#endif
#define BSS(bss_align) \
. = ALIGN(bss_align); \
.bss : AT(ADDR(.bss) - LOAD_OFFSET) { \
BSS_FIRST_SECTIONS \
. = ALIGN(PAGE_SIZE); \
*(.bss..page_aligned) \
. = ALIGN(PAGE_SIZE); \
*(.dynbss) \
*(BSS_MAIN) \
*(COMMON) \
}
/*
* DWARF debug sections.
* Symbols in the DWARF debugging sections are relative to
* the beginning of the section so we begin them at 0.
*/
#define DWARF_DEBUG \
/* DWARF 1 */ \
.debug 0 : { *(.debug) } \
.line 0 : { *(.line) } \
/* GNU DWARF 1 extensions */ \
.debug_srcinfo 0 : { *(.debug_srcinfo) } \
.debug_sfnames 0 : { *(.debug_sfnames) } \
/* DWARF 1.1 and DWARF 2 */ \
.debug_aranges 0 : { *(.debug_aranges) } \
.debug_pubnames 0 : { *(.debug_pubnames) } \
/* DWARF 2 */ \
.debug_info 0 : { *(.debug_info \
.gnu.linkonce.wi.*) } \
.debug_abbrev 0 : { *(.debug_abbrev) } \
.debug_line 0 : { *(.debug_line) } \
.debug_frame 0 : { *(.debug_frame) } \
.debug_str 0 : { *(.debug_str) } \
.debug_loc 0 : { *(.debug_loc) } \
.debug_macinfo 0 : { *(.debug_macinfo) } \
.debug_pubtypes 0 : { *(.debug_pubtypes) } \
/* DWARF 3 */ \
.debug_ranges 0 : { *(.debug_ranges) } \
/* SGI/MIPS DWARF 2 extensions */ \
.debug_weaknames 0 : { *(.debug_weaknames) } \
.debug_funcnames 0 : { *(.debug_funcnames) } \
.debug_typenames 0 : { *(.debug_typenames) } \
.debug_varnames 0 : { *(.debug_varnames) } \
/* GNU DWARF 2 extensions */ \
.debug_gnu_pubnames 0 : { *(.debug_gnu_pubnames) } \
.debug_gnu_pubtypes 0 : { *(.debug_gnu_pubtypes) } \
/* DWARF 4 */ \
.debug_types 0 : { *(.debug_types) } \
/* DWARF 5 */ \
.debug_addr 0 : { *(.debug_addr) } \
.debug_line_str 0 : { *(.debug_line_str) } \
.debug_loclists 0 : { *(.debug_loclists) } \
.debug_macro 0 : { *(.debug_macro) } \
.debug_names 0 : { *(.debug_names) } \
.debug_rnglists 0 : { *(.debug_rnglists) } \
.debug_str_offsets 0 : { *(.debug_str_offsets) }
/* Stabs debugging sections. */
#define STABS_DEBUG \
.stab 0 : { *(.stab) } \
.stabstr 0 : { *(.stabstr) } \
.stab.excl 0 : { *(.stab.excl) } \
.stab.exclstr 0 : { *(.stab.exclstr) } \
.stab.index 0 : { *(.stab.index) } \
.stab.indexstr 0 : { *(.stab.indexstr) }
/* Required sections not related to debugging. */
#define ELF_DETAILS \
.comment 0 : { *(.comment) } \
.symtab 0 : { *(.symtab) } \
.strtab 0 : { *(.strtab) } \
.shstrtab 0 : { *(.shstrtab) }
#ifdef CONFIG_GENERIC_BUG
[PATCH] Generic BUG implementation This patch adds common handling for kernel BUGs, for use by architectures as they wish. The code is derived from arch/powerpc. The advantages of having common BUG handling are: - consistent BUG reporting across architectures - shared implementation of out-of-line file/line data - implement CONFIG_DEBUG_BUGVERBOSE consistently This means that in inline impact of BUG is just the illegal instruction itself, which is an improvement for i386 and x86-64. A BUG is represented in the instruction stream as an illegal instruction, which has file/line information associated with it. This extra information is stored in the __bug_table section in the ELF file. When the kernel gets an illegal instruction, it first confirms it might possibly be from a BUG (ie, in kernel mode, the right illegal instruction). It then calls report_bug(). This searches __bug_table for a matching instruction pointer, and if found, prints the corresponding file/line information. If report_bug() determines that it wasn't a BUG which caused the trap, it returns BUG_TRAP_TYPE_NONE. Some architectures (powerpc) implement WARN using the same mechanism; if the illegal instruction was the result of a WARN, then report_bug(Q) returns CONFIG_DEBUG_BUGVERBOSE; otherwise it returns BUG_TRAP_TYPE_BUG. lib/bug.c keeps a list of loaded modules which can be searched for __bug_table entries. The architecture must call module_bug_finalize()/module_bug_cleanup() from its corresponding module_finalize/cleanup functions. Unsetting CONFIG_DEBUG_BUGVERBOSE will reduce the kernel size by some amount. At the very least, filename and line information will not be recorded for each but, but architectures may decide to store no extra information per BUG at all. Unfortunately, gcc doesn't have a general way to mark an asm() as noreturn, so architectures will generally have to include an infinite loop (or similar) in the BUG code, so that gcc knows execution won't continue beyond that point. gcc does have a __builtin_trap() operator which may be useful to achieve the same effect, unfortunately it cannot be used to actually implement the BUG itself, because there's no way to get the instruction's address for use in generating the __bug_table entry. [randy.dunlap@oracle.com: Handle BUG=n, GENERIC_BUG=n to prevent build errors] [bunk@stusta.de: include/linux/bug.h must always #include <linux/module.h] Signed-off-by: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Andi Kleen <ak@muc.de> Cc: Hugh Dickens <hugh@veritas.com> Cc: Michael Ellerman <michael@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 10:36:19 +00:00
#define BUG_TABLE \
. = ALIGN(8); \
__bug_table : AT(ADDR(__bug_table) - LOAD_OFFSET) { \
BOUNDED_SECTION_BY(__bug_table, ___bug_table) \
[PATCH] Generic BUG implementation This patch adds common handling for kernel BUGs, for use by architectures as they wish. The code is derived from arch/powerpc. The advantages of having common BUG handling are: - consistent BUG reporting across architectures - shared implementation of out-of-line file/line data - implement CONFIG_DEBUG_BUGVERBOSE consistently This means that in inline impact of BUG is just the illegal instruction itself, which is an improvement for i386 and x86-64. A BUG is represented in the instruction stream as an illegal instruction, which has file/line information associated with it. This extra information is stored in the __bug_table section in the ELF file. When the kernel gets an illegal instruction, it first confirms it might possibly be from a BUG (ie, in kernel mode, the right illegal instruction). It then calls report_bug(). This searches __bug_table for a matching instruction pointer, and if found, prints the corresponding file/line information. If report_bug() determines that it wasn't a BUG which caused the trap, it returns BUG_TRAP_TYPE_NONE. Some architectures (powerpc) implement WARN using the same mechanism; if the illegal instruction was the result of a WARN, then report_bug(Q) returns CONFIG_DEBUG_BUGVERBOSE; otherwise it returns BUG_TRAP_TYPE_BUG. lib/bug.c keeps a list of loaded modules which can be searched for __bug_table entries. The architecture must call module_bug_finalize()/module_bug_cleanup() from its corresponding module_finalize/cleanup functions. Unsetting CONFIG_DEBUG_BUGVERBOSE will reduce the kernel size by some amount. At the very least, filename and line information will not be recorded for each but, but architectures may decide to store no extra information per BUG at all. Unfortunately, gcc doesn't have a general way to mark an asm() as noreturn, so architectures will generally have to include an infinite loop (or similar) in the BUG code, so that gcc knows execution won't continue beyond that point. gcc does have a __builtin_trap() operator which may be useful to achieve the same effect, unfortunately it cannot be used to actually implement the BUG itself, because there's no way to get the instruction's address for use in generating the __bug_table entry. [randy.dunlap@oracle.com: Handle BUG=n, GENERIC_BUG=n to prevent build errors] [bunk@stusta.de: include/linux/bug.h must always #include <linux/module.h] Signed-off-by: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Andi Kleen <ak@muc.de> Cc: Hugh Dickens <hugh@veritas.com> Cc: Michael Ellerman <michael@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 10:36:19 +00:00
}
#else
#define BUG_TABLE
#endif
[PATCH] Generic BUG implementation This patch adds common handling for kernel BUGs, for use by architectures as they wish. The code is derived from arch/powerpc. The advantages of having common BUG handling are: - consistent BUG reporting across architectures - shared implementation of out-of-line file/line data - implement CONFIG_DEBUG_BUGVERBOSE consistently This means that in inline impact of BUG is just the illegal instruction itself, which is an improvement for i386 and x86-64. A BUG is represented in the instruction stream as an illegal instruction, which has file/line information associated with it. This extra information is stored in the __bug_table section in the ELF file. When the kernel gets an illegal instruction, it first confirms it might possibly be from a BUG (ie, in kernel mode, the right illegal instruction). It then calls report_bug(). This searches __bug_table for a matching instruction pointer, and if found, prints the corresponding file/line information. If report_bug() determines that it wasn't a BUG which caused the trap, it returns BUG_TRAP_TYPE_NONE. Some architectures (powerpc) implement WARN using the same mechanism; if the illegal instruction was the result of a WARN, then report_bug(Q) returns CONFIG_DEBUG_BUGVERBOSE; otherwise it returns BUG_TRAP_TYPE_BUG. lib/bug.c keeps a list of loaded modules which can be searched for __bug_table entries. The architecture must call module_bug_finalize()/module_bug_cleanup() from its corresponding module_finalize/cleanup functions. Unsetting CONFIG_DEBUG_BUGVERBOSE will reduce the kernel size by some amount. At the very least, filename and line information will not be recorded for each but, but architectures may decide to store no extra information per BUG at all. Unfortunately, gcc doesn't have a general way to mark an asm() as noreturn, so architectures will generally have to include an infinite loop (or similar) in the BUG code, so that gcc knows execution won't continue beyond that point. gcc does have a __builtin_trap() operator which may be useful to achieve the same effect, unfortunately it cannot be used to actually implement the BUG itself, because there's no way to get the instruction's address for use in generating the __bug_table entry. [randy.dunlap@oracle.com: Handle BUG=n, GENERIC_BUG=n to prevent build errors] [bunk@stusta.de: include/linux/bug.h must always #include <linux/module.h] Signed-off-by: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Andi Kleen <ak@muc.de> Cc: Hugh Dickens <hugh@veritas.com> Cc: Michael Ellerman <michael@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 10:36:19 +00:00
#ifdef CONFIG_UNWINDER_ORC
#define ORC_UNWIND_TABLE \
x86/unwind/orc: Add ELF section with ORC version identifier Commits ffb1b4a41016 ("x86/unwind/orc: Add 'signal' field to ORC metadata") and fb799447ae29 ("x86,objtool: Split UNWIND_HINT_EMPTY in two") changed the ORC format. Although ORC is internal to the kernel, it's the only way for external tools to get reliable kernel stack traces on x86-64. In particular, the drgn debugger [1] uses ORC for stack unwinding, and these format changes broke it [2]. As the drgn maintainer, I don't care how often or how much the kernel changes the ORC format as long as I have a way to detect the change. It suffices to store a version identifier in the vmlinux and kernel module ELF files (to use when parsing ORC sections from ELF), and in kernel memory (to use when parsing ORC from a core dump+symbol table). Rather than hard-coding a version number that needs to be manually bumped, Peterz suggested hashing the definitions from orc_types.h. If there is a format change that isn't caught by this, the hashing script can be updated. This patch adds an .orc_header allocated ELF section containing the 20-byte hash to vmlinux and kernel modules, along with the corresponding __start_orc_header and __stop_orc_header symbols in vmlinux. 1: https://github.com/osandov/drgn 2: https://github.com/osandov/drgn/issues/303 Fixes: ffb1b4a41016 ("x86/unwind/orc: Add 'signal' field to ORC metadata") Fixes: fb799447ae29 ("x86,objtool: Split UNWIND_HINT_EMPTY in two") Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Josh Poimboeuf <jpoimboe@kernel.org> Link: https://lkml.kernel.org/r/aef9c8dc43915b886a8c48509a12ec1b006ca1ca.1686690801.git.osandov@osandov.com
2023-06-13 21:14:56 +00:00
.orc_header : AT(ADDR(.orc_header) - LOAD_OFFSET) { \
BOUNDED_SECTION_BY(.orc_header, _orc_header) \
} \
. = ALIGN(4); \
.orc_unwind_ip : AT(ADDR(.orc_unwind_ip) - LOAD_OFFSET) { \
BOUNDED_SECTION_BY(.orc_unwind_ip, _orc_unwind_ip) \
} \
. = ALIGN(2); \
.orc_unwind : AT(ADDR(.orc_unwind) - LOAD_OFFSET) { \
BOUNDED_SECTION_BY(.orc_unwind, _orc_unwind) \
} \
text_size = _etext - _stext; \
. = ALIGN(4); \
.orc_lookup : AT(ADDR(.orc_lookup) - LOAD_OFFSET) { \
orc_lookup = .; \
. += (((text_size + LOOKUP_BLOCK_SIZE - 1) / \
LOOKUP_BLOCK_SIZE) + 1) * 4; \
orc_lookup_end = .; \
}
#else
#define ORC_UNWIND_TABLE
#endif
/* Built-in firmware blobs */
#ifdef CONFIG_FW_LOADER
#define FW_LOADER_BUILT_IN_DATA \
.builtin_fw : AT(ADDR(.builtin_fw) - LOAD_OFFSET) ALIGN(8) { \
BOUNDED_SECTION_PRE_LABEL(.builtin_fw, _builtin_fw, __start, __end) \
}
#else
#define FW_LOADER_BUILT_IN_DATA
#endif
#ifdef CONFIG_PM_TRACE
#define TRACEDATA \
. = ALIGN(4); \
.tracedata : AT(ADDR(.tracedata) - LOAD_OFFSET) { \
BOUNDED_SECTION_POST_LABEL(.tracedata, __tracedata, _start, _end) \
}
#else
#define TRACEDATA
#endif
printk: Userspace format indexing support We have a number of systems industry-wide that have a subset of their functionality that works as follows: 1. Receive a message from local kmsg, serial console, or netconsole; 2. Apply a set of rules to classify the message; 3. Do something based on this classification (like scheduling a remediation for the machine), rinse, and repeat. As a couple of examples of places we have this implemented just inside Facebook, although this isn't a Facebook-specific problem, we have this inside our netconsole processing (for alarm classification), and as part of our machine health checking. We use these messages to determine fairly important metrics around production health, and it's important that we get them right. While for some kinds of issues we have counters, tracepoints, or metrics with a stable interface which can reliably indicate the issue, in order to react to production issues quickly we need to work with the interface which most kernel developers naturally use when developing: printk. Most production issues come from unexpected phenomena, and as such usually the code in question doesn't have easily usable tracepoints or other counters available for the specific problem being mitigated. We have a number of lines of monitoring defence against problems in production (host metrics, process metrics, service metrics, etc), and where it's not feasible to reliably monitor at another level, this kind of pragmatic netconsole monitoring is essential. As one would expect, monitoring using printk is rather brittle for a number of reasons -- most notably that the message might disappear entirely in a new version of the kernel, or that the message may change in some way that the regex or other classification methods start to silently fail. One factor that makes this even harder is that, under normal operation, many of these messages are never expected to be hit. For example, there may be a rare hardware bug which one wants to detect if it was to ever happen again, but its recurrence is not likely or anticipated. This precludes using something like checking whether the printk in question was printed somewhere fleetwide recently to determine whether the message in question is still present or not, since we don't anticipate that it should be printed anywhere, but still need to monitor for its future presence in the long-term. This class of issue has happened on a number of occasions, causing unhealthy machines with hardware issues to remain in production for longer than ideal. As a recent example, some monitoring around blk_update_request fell out of date and caused semi-broken machines to remain in production for longer than would be desirable. Searching through the codebase to find the message is also extremely fragile, because many of the messages are further constructed beyond their callsite (eg. btrfs_printk and other module-specific wrappers, each with their own functionality). Even if they aren't, guessing the format and formulation of the underlying message based on the aesthetics of the message emitted is not a recipe for success at scale, and our previous issues with fleetwide machine health checking demonstrate as much. This provides a solution to the issue of silently changed or deleted printks: we record pointers to all printk format strings known at compile time into a new .printk_index section, both in vmlinux and modules. At runtime, this can then be iterated by looking at <debugfs>/printk/index/<module>, which emits the following format, both readable by humans and able to be parsed by machines: $ head -1 vmlinux; shuf -n 5 vmlinux # <level[,flags]> filename:line function "format" <5> block/blk-settings.c:661 disk_stack_limits "%s: Warning: Device %s is misaligned\n" <4> kernel/trace/trace.c:8296 trace_create_file "Could not create tracefs '%s' entry\n" <6> arch/x86/kernel/hpet.c:144 _hpet_print_config "hpet: %s(%d):\n" <6> init/do_mounts.c:605 prepare_namespace "Waiting for root device %s...\n" <6> drivers/acpi/osl.c:1410 acpi_no_auto_serialize_setup "ACPI: auto-serialization disabled\n" This mitigates the majority of cases where we have a highly-specific printk which we want to match on, as we can now enumerate and check whether the format changed or the printk callsite disappeared entirely in userspace. This allows us to catch changes to printks we monitor earlier and decide what to do about it before it becomes problematic. There is no additional runtime cost for printk callers or printk itself, and the assembly generated is exactly the same. Signed-off-by: Chris Down <chris@chrisdown.name> Cc: Petr Mladek <pmladek@suse.com> Cc: Jessica Yu <jeyu@kernel.org> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: John Ogness <john.ogness@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kees Cook <keescook@chromium.org> Reviewed-by: Petr Mladek <pmladek@suse.com> Tested-by: Petr Mladek <pmladek@suse.com> Reported-by: kernel test robot <lkp@intel.com> Acked-by: Andy Shevchenko <andy.shevchenko@gmail.com> Acked-by: Jessica Yu <jeyu@kernel.org> # for module.{c,h} Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/e42070983637ac5e384f17fbdbe86d19c7b212a5.1623775748.git.chris@chrisdown.name
2021-06-15 16:52:53 +00:00
#ifdef CONFIG_PRINTK_INDEX
#define PRINTK_INDEX \
.printk_index : AT(ADDR(.printk_index) - LOAD_OFFSET) { \
BOUNDED_SECTION_BY(.printk_index, _printk_index) \
printk: Userspace format indexing support We have a number of systems industry-wide that have a subset of their functionality that works as follows: 1. Receive a message from local kmsg, serial console, or netconsole; 2. Apply a set of rules to classify the message; 3. Do something based on this classification (like scheduling a remediation for the machine), rinse, and repeat. As a couple of examples of places we have this implemented just inside Facebook, although this isn't a Facebook-specific problem, we have this inside our netconsole processing (for alarm classification), and as part of our machine health checking. We use these messages to determine fairly important metrics around production health, and it's important that we get them right. While for some kinds of issues we have counters, tracepoints, or metrics with a stable interface which can reliably indicate the issue, in order to react to production issues quickly we need to work with the interface which most kernel developers naturally use when developing: printk. Most production issues come from unexpected phenomena, and as such usually the code in question doesn't have easily usable tracepoints or other counters available for the specific problem being mitigated. We have a number of lines of monitoring defence against problems in production (host metrics, process metrics, service metrics, etc), and where it's not feasible to reliably monitor at another level, this kind of pragmatic netconsole monitoring is essential. As one would expect, monitoring using printk is rather brittle for a number of reasons -- most notably that the message might disappear entirely in a new version of the kernel, or that the message may change in some way that the regex or other classification methods start to silently fail. One factor that makes this even harder is that, under normal operation, many of these messages are never expected to be hit. For example, there may be a rare hardware bug which one wants to detect if it was to ever happen again, but its recurrence is not likely or anticipated. This precludes using something like checking whether the printk in question was printed somewhere fleetwide recently to determine whether the message in question is still present or not, since we don't anticipate that it should be printed anywhere, but still need to monitor for its future presence in the long-term. This class of issue has happened on a number of occasions, causing unhealthy machines with hardware issues to remain in production for longer than ideal. As a recent example, some monitoring around blk_update_request fell out of date and caused semi-broken machines to remain in production for longer than would be desirable. Searching through the codebase to find the message is also extremely fragile, because many of the messages are further constructed beyond their callsite (eg. btrfs_printk and other module-specific wrappers, each with their own functionality). Even if they aren't, guessing the format and formulation of the underlying message based on the aesthetics of the message emitted is not a recipe for success at scale, and our previous issues with fleetwide machine health checking demonstrate as much. This provides a solution to the issue of silently changed or deleted printks: we record pointers to all printk format strings known at compile time into a new .printk_index section, both in vmlinux and modules. At runtime, this can then be iterated by looking at <debugfs>/printk/index/<module>, which emits the following format, both readable by humans and able to be parsed by machines: $ head -1 vmlinux; shuf -n 5 vmlinux # <level[,flags]> filename:line function "format" <5> block/blk-settings.c:661 disk_stack_limits "%s: Warning: Device %s is misaligned\n" <4> kernel/trace/trace.c:8296 trace_create_file "Could not create tracefs '%s' entry\n" <6> arch/x86/kernel/hpet.c:144 _hpet_print_config "hpet: %s(%d):\n" <6> init/do_mounts.c:605 prepare_namespace "Waiting for root device %s...\n" <6> drivers/acpi/osl.c:1410 acpi_no_auto_serialize_setup "ACPI: auto-serialization disabled\n" This mitigates the majority of cases where we have a highly-specific printk which we want to match on, as we can now enumerate and check whether the format changed or the printk callsite disappeared entirely in userspace. This allows us to catch changes to printks we monitor earlier and decide what to do about it before it becomes problematic. There is no additional runtime cost for printk callers or printk itself, and the assembly generated is exactly the same. Signed-off-by: Chris Down <chris@chrisdown.name> Cc: Petr Mladek <pmladek@suse.com> Cc: Jessica Yu <jeyu@kernel.org> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: John Ogness <john.ogness@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kees Cook <keescook@chromium.org> Reviewed-by: Petr Mladek <pmladek@suse.com> Tested-by: Petr Mladek <pmladek@suse.com> Reported-by: kernel test robot <lkp@intel.com> Acked-by: Andy Shevchenko <andy.shevchenko@gmail.com> Acked-by: Jessica Yu <jeyu@kernel.org> # for module.{c,h} Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/e42070983637ac5e384f17fbdbe86d19c7b212a5.1623775748.git.chris@chrisdown.name
2021-06-15 16:52:53 +00:00
}
#else
#define PRINTK_INDEX
#endif
/*
* Discard .note.GNU-stack, which is emitted as PROGBITS by the compiler.
* Otherwise, the type of .notes section would become PROGBITS instead of NOTES.
vmlinux.lds.h: Discard .note.gnu.property section When tooling reads ELF notes, it assumes each note entry is aligned to the value listed in the .note section header's sh_addralign field. The kernel-created ELF notes in the .note.Linux and .note.Xen sections are aligned to 4 bytes. This causes the toolchain to set those sections' sh_addralign values to 4. On the other hand, the GCC-created .note.gnu.property section has an sh_addralign value of 8 for some reason, despite being based on struct Elf32_Nhdr which only needs 4-byte alignment. When the mismatched input sections get linked together into the vmlinux .notes output section, the higher alignment "wins", resulting in an sh_addralign of 8, which confuses tooling. For example: $ readelf -n .tmp_vmlinux.btf ... readelf: .tmp_vmlinux.btf: Warning: note with invalid namesz and/or descsz found at offset 0x170 readelf: .tmp_vmlinux.btf: Warning: type: 0x4, namesize: 0x006e6558, descsize: 0x00008801, alignment: 8 In this case readelf thinks there's alignment padding where there is none, so it starts reading an ELF note in the middle. With newer toolchains (e.g., latest Fedora Rawhide), a similar mismatch triggers a build failure when combined with CONFIG_X86_KERNEL_IBT: btf_encoder__encode: btf__dedup failed! Failed to encode BTF libbpf: failed to find '.BTF' ELF section in vmlinux FAILED: load BTF from vmlinux: No data available make[1]: *** [scripts/Makefile.vmlinux:35: vmlinux] Error 255 This latter error was caused by pahole crashing when it encountered the corrupt .notes section. This crash has been fixed in dwarves version 1.25. As Tianyi Liu describes: "Pahole reads .notes to look for LINUX_ELFNOTE_BUILD_LTO. When LTO is enabled, pahole needs to call cus__merge_and_process_cu to merge compile units, at which point there should only be one unspecified type (used to represent some compilation information) in the global context. However, when the kernel is compiled without LTO, if pahole calls cus__merge_and_process_cu due to alignment issues with notes, multiple unspecified types may appear after merging the cus, and older versions of pahole only support up to one. This is why pahole 1.24 crashes, while newer versions support multiple. However, the latest version of pahole still does not solve the problem of incorrect LTO recognition, so compiling the kernel may be slower than normal." Even with the newer pahole, the note section misaligment issue still exists and pahole is misinterpreting the LTO note. Fix it by discarding the .note.gnu.property section. While GNU properties are important for user space (and VDSO), they don't seem to have any use for vmlinux. (In fact, they're already getting (inadvertently) stripped from vmlinux when CONFIG_DEBUG_INFO_BTF is enabled. The BTF data is extracted from vmlinux.o with "objcopy --only-section=.BTF" into .btf.vmlinux.bin.o. That file doesn't have .note.gnu.property, so when it gets modified and linked back into the main object, the linker automatically strips it (see "How GNU properties are merged" in the ld man page).) Reported-by: Daniel Xu <dxu@dxuuu.xyz> Link: https://lkml.kernel.org/bpf/57830c30-cd77-40cf-9cd1-3bb608aa602e@app.fastmail.com Debugged-by: Tianyi Liu <i.pear@outlook.com> Suggested-by: Joan Bruguera Micó <joanbrugueram@gmail.com> Link: https://lore.kernel.org/r/20230418214925.ay3jpf2zhw75kgmd@treble Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2023-04-18 21:49:25 +00:00
*
* Also, discard .note.gnu.property, otherwise it forces the notes section to
* be 8-byte aligned which causes alignment mismatches with the kernel's custom
* 4-byte aligned notes.
*/
#define NOTES \
vmlinux.lds.h: Discard .note.gnu.property section When tooling reads ELF notes, it assumes each note entry is aligned to the value listed in the .note section header's sh_addralign field. The kernel-created ELF notes in the .note.Linux and .note.Xen sections are aligned to 4 bytes. This causes the toolchain to set those sections' sh_addralign values to 4. On the other hand, the GCC-created .note.gnu.property section has an sh_addralign value of 8 for some reason, despite being based on struct Elf32_Nhdr which only needs 4-byte alignment. When the mismatched input sections get linked together into the vmlinux .notes output section, the higher alignment "wins", resulting in an sh_addralign of 8, which confuses tooling. For example: $ readelf -n .tmp_vmlinux.btf ... readelf: .tmp_vmlinux.btf: Warning: note with invalid namesz and/or descsz found at offset 0x170 readelf: .tmp_vmlinux.btf: Warning: type: 0x4, namesize: 0x006e6558, descsize: 0x00008801, alignment: 8 In this case readelf thinks there's alignment padding where there is none, so it starts reading an ELF note in the middle. With newer toolchains (e.g., latest Fedora Rawhide), a similar mismatch triggers a build failure when combined with CONFIG_X86_KERNEL_IBT: btf_encoder__encode: btf__dedup failed! Failed to encode BTF libbpf: failed to find '.BTF' ELF section in vmlinux FAILED: load BTF from vmlinux: No data available make[1]: *** [scripts/Makefile.vmlinux:35: vmlinux] Error 255 This latter error was caused by pahole crashing when it encountered the corrupt .notes section. This crash has been fixed in dwarves version 1.25. As Tianyi Liu describes: "Pahole reads .notes to look for LINUX_ELFNOTE_BUILD_LTO. When LTO is enabled, pahole needs to call cus__merge_and_process_cu to merge compile units, at which point there should only be one unspecified type (used to represent some compilation information) in the global context. However, when the kernel is compiled without LTO, if pahole calls cus__merge_and_process_cu due to alignment issues with notes, multiple unspecified types may appear after merging the cus, and older versions of pahole only support up to one. This is why pahole 1.24 crashes, while newer versions support multiple. However, the latest version of pahole still does not solve the problem of incorrect LTO recognition, so compiling the kernel may be slower than normal." Even with the newer pahole, the note section misaligment issue still exists and pahole is misinterpreting the LTO note. Fix it by discarding the .note.gnu.property section. While GNU properties are important for user space (and VDSO), they don't seem to have any use for vmlinux. (In fact, they're already getting (inadvertently) stripped from vmlinux when CONFIG_DEBUG_INFO_BTF is enabled. The BTF data is extracted from vmlinux.o with "objcopy --only-section=.BTF" into .btf.vmlinux.bin.o. That file doesn't have .note.gnu.property, so when it gets modified and linked back into the main object, the linker automatically strips it (see "How GNU properties are merged" in the ld man page).) Reported-by: Daniel Xu <dxu@dxuuu.xyz> Link: https://lkml.kernel.org/bpf/57830c30-cd77-40cf-9cd1-3bb608aa602e@app.fastmail.com Debugged-by: Tianyi Liu <i.pear@outlook.com> Suggested-by: Joan Bruguera Micó <joanbrugueram@gmail.com> Link: https://lore.kernel.org/r/20230418214925.ay3jpf2zhw75kgmd@treble Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2023-04-18 21:49:25 +00:00
/DISCARD/ : { \
*(.note.GNU-stack) \
*(.note.gnu.property) \
} \
.notes : AT(ADDR(.notes) - LOAD_OFFSET) { \
BOUNDED_SECTION_BY(.note.*, _notes) \
} NOTES_HEADERS \
NOTES_HEADERS_RESTORE
#define INIT_SETUP(initsetup_align) \
. = ALIGN(initsetup_align); \
BOUNDED_SECTION_POST_LABEL(.init.setup, __setup, _start, _end)
#define INIT_CALLS_LEVEL(level) \
__initcall##level##_start = .; \
KEEP(*(.initcall##level##.init)) \
KEEP(*(.initcall##level##s.init)) \
#define INIT_CALLS \
__initcall_start = .; \
KEEP(*(.initcallearly.init)) \
INIT_CALLS_LEVEL(0) \
INIT_CALLS_LEVEL(1) \
INIT_CALLS_LEVEL(2) \
INIT_CALLS_LEVEL(3) \
INIT_CALLS_LEVEL(4) \
INIT_CALLS_LEVEL(5) \
INIT_CALLS_LEVEL(rootfs) \
INIT_CALLS_LEVEL(6) \
INIT_CALLS_LEVEL(7) \
__initcall_end = .;
#define CON_INITCALL \
BOUNDED_SECTION_POST_LABEL(.con_initcall.init, __con_initcall, _start, _end)
#define NAMED_SECTION(name) \
. = ALIGN(8); \
name : AT(ADDR(name) - LOAD_OFFSET) \
{ BOUNDED_SECTION_PRE_LABEL(name, name, __start_, __stop_) }
#define RUNTIME_CONST(t,x) NAMED_SECTION(runtime_##t##_##x)
#define RUNTIME_CONST_VARIABLES \
RUNTIME_CONST(shift, d_hash_shift) \
RUNTIME_CONST(ptr, dentry_hashtable)
/* Alignment must be consistent with (kunit_suite *) in include/kunit/test.h */
#define KUNIT_TABLE() \
. = ALIGN(8); \
BOUNDED_SECTION_POST_LABEL(.kunit_test_suites, __kunit_suites, _start, _end)
/* Alignment must be consistent with (kunit_suite *) in include/kunit/test.h */
#define KUNIT_INIT_TABLE() \
. = ALIGN(8); \
BOUNDED_SECTION_POST_LABEL(.kunit_init_test_suites, \
__kunit_init_suites, _start, _end)
#ifdef CONFIG_BLK_DEV_INITRD
#define INIT_RAM_FS \
. = ALIGN(4); \
__initramfs_start = .; \
KEEP(*(.init.ramfs)) \
initramfs: fix initramfs size calculation The size of a built-in initramfs is calculated in init/initramfs.c by "__initramfs_end - __initramfs_start". Those symbols are defined in the linker script include/asm-generic/vmlinux.lds.h: #define INIT_RAM_FS \ . = ALIGN(PAGE_SIZE); \ VMLINUX_SYMBOL(__initramfs_start) = .; \ *(.init.ramfs) \ VMLINUX_SYMBOL(__initramfs_end) = .; If the initramfs file has an odd number of bytes, the "__initramfs_end" symbol points to an odd address, for example, the symbols in the System.map might look like: 0000000000572000 T __initramfs_start 00000000005bcd05 T __initramfs_end <-- odd address At least on s390 this causes a problem: Certain s390 instructions, especially instructions for loading addresses (larl) or branch addresses must be on even addresses. The compiler loads the symbol addresses with the "larl" instruction. This instruction sets the last bit to 0 and, therefore, for odd size files, the calculated size is one byte less than it should be: 0000000000540a9c <populate_rootfs>: 540a9c: eb cf f0 78 00 24 stmg %r12,%r15,120(%r15), 540aa2: c0 10 00 01 8a af larl %r1,572000 <__initramfs_start> 540aa8: c0 c0 00 03 e1 2e larl %r12,5bcd04 <initramfs_end> (Instead of 5bcd05) ... 540abe: 1b c1 sr %r12,%r1 To fix the problem, this patch introduces the global variable __initramfs_size, which is calculated in the "usr/initramfs_data.S" file. The populate_rootfs() function can then use the start marker of the .init.ramfs section and the value of __initramfs_size for loading the initramfs. Because the start marker and size is sufficient, the __initramfs_end symbol is no longer needed and is removed. Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com> Signed-off-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com> Reviewed-by: WANG Cong <xiyou.wangcong@gmail.com> Acked-by: Michal Marek <mmarek@suse.cz> Acked-by: "H. Peter Anvin" <hpa@zytor.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Michal Marek <mmarek@suse.cz>
2010-09-17 22:24:11 +00:00
. = ALIGN(8); \
KEEP(*(.init.ramfs.info))
#else
#define INIT_RAM_FS
#endif
/*
* Memory encryption operates on a page basis. Since we need to clear
* the memory encryption mask for this section, it needs to be aligned
* on a page boundary and be a page-size multiple in length.
*
* Note: We use a separate section so that only this section gets
* decrypted to avoid exposing more than we wish.
*/
#ifdef CONFIG_AMD_MEM_ENCRYPT
#define PERCPU_DECRYPTED_SECTION \
. = ALIGN(PAGE_SIZE); \
*(.data..percpu..decrypted) \
. = ALIGN(PAGE_SIZE);
#else
#define PERCPU_DECRYPTED_SECTION
#endif
/*
* Default discarded sections.
*
* Some archs want to discard exit text/data at runtime rather than
* link time due to cross-section references such as alt instructions,
* bug table, eh_frame, etc. DISCARDS must be the last of output
* section definitions so that such archs put those in earlier section
* definitions.
*/
#ifdef RUNTIME_DISCARD_EXIT
#define EXIT_DISCARDS
#else
#define EXIT_DISCARDS \
EXIT_TEXT \
EXIT_DATA
#endif
/*
vmlinux.lds.h: Define SANTIZER_DISCARDS with CONFIG_GCOV_KERNEL=y clang produces .eh_frame sections when CONFIG_GCOV_KERNEL is enabled, even when -fno-asynchronous-unwind-tables is in KBUILD_CFLAGS: $ make CC=clang vmlinux ... ld: warning: orphan section `.eh_frame' from `init/main.o' being placed in section `.eh_frame' ld: warning: orphan section `.eh_frame' from `init/version.o' being placed in section `.eh_frame' ld: warning: orphan section `.eh_frame' from `init/do_mounts.o' being placed in section `.eh_frame' ld: warning: orphan section `.eh_frame' from `init/do_mounts_initrd.o' being placed in section `.eh_frame' ld: warning: orphan section `.eh_frame' from `init/initramfs.o' being placed in section `.eh_frame' ld: warning: orphan section `.eh_frame' from `init/calibrate.o' being placed in section `.eh_frame' ld: warning: orphan section `.eh_frame' from `init/init_task.o' being placed in section `.eh_frame' ... $ rg "GCOV_KERNEL|GCOV_PROFILE_ALL" .config CONFIG_GCOV_KERNEL=y CONFIG_ARCH_HAS_GCOV_PROFILE_ALL=y CONFIG_GCOV_PROFILE_ALL=y This was already handled for a couple of other options in commit d812db78288d ("vmlinux.lds.h: Avoid KASAN and KCSAN's unwanted sections") and there is an open LLVM bug for this issue. Take advantage of that section for this config as well so that there are no more orphan warnings. Link: https://bugs.llvm.org/show_bug.cgi?id=46478 Link: https://github.com/ClangBuiltLinux/linux/issues/1069 Reported-by: kernel test robot <lkp@intel.com> Reviewed-by: Fangrui Song <maskray@google.com> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Nathan Chancellor <nathan@kernel.org> Fixes: d812db78288d ("vmlinux.lds.h: Avoid KASAN and KCSAN's unwanted sections") Cc: stable@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20210130004650.2682422-1-nathan@kernel.org
2021-01-30 00:46:51 +00:00
* Clang's -fprofile-arcs, -fsanitize=kernel-address, and
* -fsanitize=thread produce unwanted sections (.eh_frame
* and .init_array.*), but CONFIG_CONSTRUCTORS wants to
* keep any .init_array.* sections.
* https://llvm.org/pr46478
*/
#ifdef CONFIG_UNWIND_TABLES
#define DISCARD_EH_FRAME
#else
#define DISCARD_EH_FRAME *(.eh_frame)
#endif
#if defined(CONFIG_GCOV_KERNEL) || defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KCSAN)
# ifdef CONFIG_CONSTRUCTORS
# define SANITIZER_DISCARDS \
DISCARD_EH_FRAME
# else
# define SANITIZER_DISCARDS \
*(.init_array) *(.init_array.*) \
DISCARD_EH_FRAME
# endif
#else
# define SANITIZER_DISCARDS
#endif
#define COMMON_DISCARDS \
SANITIZER_DISCARDS \
PATCHABLE_DISCARDS \
*(.discard) \
*(.discard.*) \
kbuild: generate KSYMTAB entries by modpost Commit 7b4537199a4a ("kbuild: link symbol CRCs at final link, removing CONFIG_MODULE_REL_CRCS") made modpost output CRCs in the same way whether the EXPORT_SYMBOL() is placed in *.c or *.S. For further cleanups, this commit applies a similar approach to the entire data structure of EXPORT_SYMBOL(). The EXPORT_SYMBOL() compilation is split into two stages. When a source file is compiled, EXPORT_SYMBOL() will be converted into a dummy symbol in the .export_symbol section. For example, EXPORT_SYMBOL(foo); EXPORT_SYMBOL_NS_GPL(bar, BAR_NAMESPACE); will be encoded into the following assembly code: .section ".export_symbol","a" __export_symbol_foo: .asciz "" /* license */ .asciz "" /* name space */ .balign 8 .quad foo /* symbol reference */ .previous .section ".export_symbol","a" __export_symbol_bar: .asciz "GPL" /* license */ .asciz "BAR_NAMESPACE" /* name space */ .balign 8 .quad bar /* symbol reference */ .previous They are mere markers to tell modpost the name, license, and namespace of the symbols. They will be dropped from the final vmlinux and modules because the *(.export_symbol) will go into /DISCARD/ in the linker script. Then, modpost extracts all the information about EXPORT_SYMBOL() from the .export_symbol section, and generates the final C code: KSYMTAB_FUNC(foo, "", ""); KSYMTAB_FUNC(bar, "_gpl", "BAR_NAMESPACE"); KSYMTAB_FUNC() (or KSYMTAB_DATA() if it is data) is expanded to struct kernel_symbol that will be linked to the vmlinux or a module. With this change, EXPORT_SYMBOL() works in the same way for *.c and *.S files, providing the following benefits. [1] Deprecate EXPORT_DATA_SYMBOL() In the old days, EXPORT_SYMBOL() was only available in C files. To export a symbol in *.S, EXPORT_SYMBOL() was placed in a separate *.c file. arch/arm/kernel/armksyms.c is one example written in the classic manner. Commit 22823ab419d8 ("EXPORT_SYMBOL() for asm") removed this limitation. Since then, EXPORT_SYMBOL() can be placed close to the symbol definition in *.S files. It was a nice improvement. However, as that commit mentioned, you need to use EXPORT_DATA_SYMBOL() for data objects on some architectures. In the new approach, modpost checks symbol's type (STT_FUNC or not), and outputs KSYMTAB_FUNC() or KSYMTAB_DATA() accordingly. There are only two users of EXPORT_DATA_SYMBOL: EXPORT_DATA_SYMBOL_GPL(empty_zero_page) (arch/ia64/kernel/head.S) EXPORT_DATA_SYMBOL(ia64_ivt) (arch/ia64/kernel/ivt.S) They are transformed as follows and output into .vmlinux.export.c KSYMTAB_DATA(empty_zero_page, "_gpl", ""); KSYMTAB_DATA(ia64_ivt, "", ""); The other EXPORT_SYMBOL users in ia64 assembly are output as KSYMTAB_FUNC(). EXPORT_DATA_SYMBOL() is now deprecated. [2] merge <linux/export.h> and <asm-generic/export.h> There are two similar header implementations: include/linux/export.h for .c files include/asm-generic/export.h for .S files Ideally, the functionality should be consistent between them, but they tend to diverge. Commit 8651ec01daed ("module: add support for symbol namespaces.") did not support the namespace for *.S files. This commit shifts the essential implementation part to C, which supports EXPORT_SYMBOL_NS() for *.S files. <asm/export.h> and <asm-generic/export.h> will remain as a wrapper of <linux/export.h> for a while. They will be removed after #include <asm/export.h> directives are all replaced with #include <linux/export.h>. [3] Implement CONFIG_TRIM_UNUSED_KSYMS in one-pass algorithm (by a later commit) When CONFIG_TRIM_UNUSED_KSYMS is enabled, Kbuild recursively traverses the directory tree to determine which EXPORT_SYMBOL to trim. If an EXPORT_SYMBOL turns out to be unused by anyone, Kbuild begins the second traverse, where some source files are recompiled with their EXPORT_SYMBOL() tuned into a no-op. We can do this better now; modpost can selectively emit KSYMTAB entries that are really used by modules. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
2023-06-11 15:50:52 +00:00
*(.export_symbol) \
*(.modinfo) \
/* ld.bfd warns about .gnu.version* even when not emitted */ \
*(.gnu.version*) \
2009-06-24 06:13:38 +00:00
#define DISCARDS \
/DISCARD/ : { \
EXIT_DISCARDS \
EXIT_CALL \
COMMON_DISCARDS \
2009-06-24 06:13:38 +00:00
}
/**
* PERCPU_INPUT - the percpu input sections
* @cacheline: cacheline size
*
* The core percpu section names and core symbols which do not rely
* directly upon load addresses.
*
* @cacheline is used to align subsections to avoid false cacheline
* sharing between subsections for different purposes.
*/
#define PERCPU_INPUT(cacheline) \
__per_cpu_start = .; \
*(.data..percpu..first) \
. = ALIGN(PAGE_SIZE); \
*(.data..percpu..page_aligned) \
. = ALIGN(cacheline); \
*(.data..percpu..read_mostly) \
. = ALIGN(cacheline); \
*(.data..percpu) \
*(.data..percpu..shared_aligned) \
PERCPU_DECRYPTED_SECTION \
__per_cpu_end = .;
/**
* PERCPU_VADDR - define output section for percpu area
* @cacheline: cacheline size
* @vaddr: explicit base address (optional)
* @phdr: destination PHDR (optional)
*
* Macro which expands to output section for percpu area.
*
* @cacheline is used to align subsections to avoid false cacheline
* sharing between subsections for different purposes.
*
* If @vaddr is not blank, it specifies explicit base address and all
* percpu symbols will be offset from the given address. If blank,
* @vaddr always equals @laddr + LOAD_OFFSET.
*
* @phdr defines the output PHDR to use if not blank. Be warned that
* output PHDR is sticky. If @phdr is specified, the next output
* section in the linker script will go there too. @phdr should have
* a leading colon.
*
* Note that this macros defines __per_cpu_load as an absolute symbol.
* If there is no need to put the percpu section at a predetermined
* address, use PERCPU_SECTION.
*/
#define PERCPU_VADDR(cacheline, vaddr, phdr) \
__per_cpu_load = .; \
.data..percpu vaddr : AT(__per_cpu_load - LOAD_OFFSET) { \
PERCPU_INPUT(cacheline) \
} phdr \
. = __per_cpu_load + SIZEOF(.data..percpu);
/**
* PERCPU_SECTION - define output section for percpu area, simple version
* @cacheline: cacheline size
*
* Align to PAGE_SIZE and outputs output section for percpu area. This
* macro doesn't manipulate @vaddr or @phdr and __per_cpu_load and
* __per_cpu_start will be identical.
*
* This macro is equivalent to ALIGN(PAGE_SIZE); PERCPU_VADDR(@cacheline,,)
* except that __per_cpu_load is defined as a relative symbol against
* .data..percpu which is required for relocatable x86_32 configuration.
*/
#define PERCPU_SECTION(cacheline) \
. = ALIGN(PAGE_SIZE); \
.data..percpu : AT(ADDR(.data..percpu) - LOAD_OFFSET) { \
__per_cpu_load = .; \
PERCPU_INPUT(cacheline) \
}
/*
* Definition of the high level *_SECTION macros
* They will fit only a subset of the architectures
*/
/*
* Writeable data.
* All sections are combined in a single .data section.
* The sections following CONSTRUCTORS are arranged so their
* typical alignment matches.
* A cacheline is typical/always less than a PAGE_SIZE so
* the sections that has this restriction (or similar)
* is located before the ones requiring PAGE_SIZE alignment.
* NOSAVE_DATA starts and ends with a PAGE_SIZE alignment which
* matches the requirement of PAGE_ALIGNED_DATA.
*
* use 0 as page_align if page_aligned data is not used */
#define RW_DATA(cacheline, pagealigned, inittask) \
. = ALIGN(PAGE_SIZE); \
.data : AT(ADDR(.data) - LOAD_OFFSET) { \
INIT_TASK_DATA(inittask) \
NOSAVE_DATA \
PAGE_ALIGNED_DATA(pagealigned) \
CACHELINE_ALIGNED_DATA(cacheline) \
READ_MOSTLY_DATA(cacheline) \
DATA_DATA \
CONSTRUCTORS \
} \
BUG_TABLE \
#define INIT_TEXT_SECTION(inittext_align) \
. = ALIGN(inittext_align); \
.init.text : AT(ADDR(.init.text) - LOAD_OFFSET) { \
_sinittext = .; \
INIT_TEXT \
_einittext = .; \
}
#define INIT_DATA_SECTION(initsetup_align) \
.init.data : AT(ADDR(.init.data) - LOAD_OFFSET) { \
INIT_DATA \
INIT_SETUP(initsetup_align) \
INIT_CALLS \
CON_INITCALL \
INIT_RAM_FS \
}
vmlinux.lds.h: restructure BSS linker script macros The BSS section macros in vmlinux.lds.h currently place the .sbss input section outside the bounds of [__bss_start, __bss_end]. On all architectures except for microblaze that handle both .sbss and __bss_start/__bss_end, this is wrong: the .sbss input section is within the range [__bss_start, __bss_end]. Relatedly, the example code at the top of the file actually has __bss_start/__bss_end defined twice; I believe the right fix here is to define them in the BSS_SECTION macro but not in the BSS macro. Another problem with the current macros is that several architectures have an ALIGN(4) or some other small number just before __bss_stop in their linker scripts. The BSS_SECTION macro currently hardcodes this to 4; while it should really be an argument. It also ignores its sbss_align argument; fix that. mn10300 is the only user at present of any of the macros touched by this patch. It looks like mn10300 actually was incorrectly converted to use the new BSS() macro (the alignment of 4 prior to conversion was a __bss_stop alignment, but the argument to the BSS macro is a start alignment). So fix this as well. I'd like acks from Sam and David on this one. Also CCing Paul, since he has a patch from me which will need to be updated to use BSS_SECTION(0, PAGE_SIZE, 4) once this gets merged. Signed-off-by: Tim Abbott <tabbott@ksplice.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
2009-07-12 22:23:33 +00:00
#define BSS_SECTION(sbss_align, bss_align, stop_align) \
. = ALIGN(sbss_align); \
__bss_start = .; \
vmlinux.lds.h: restructure BSS linker script macros The BSS section macros in vmlinux.lds.h currently place the .sbss input section outside the bounds of [__bss_start, __bss_end]. On all architectures except for microblaze that handle both .sbss and __bss_start/__bss_end, this is wrong: the .sbss input section is within the range [__bss_start, __bss_end]. Relatedly, the example code at the top of the file actually has __bss_start/__bss_end defined twice; I believe the right fix here is to define them in the BSS_SECTION macro but not in the BSS macro. Another problem with the current macros is that several architectures have an ALIGN(4) or some other small number just before __bss_stop in their linker scripts. The BSS_SECTION macro currently hardcodes this to 4; while it should really be an argument. It also ignores its sbss_align argument; fix that. mn10300 is the only user at present of any of the macros touched by this patch. It looks like mn10300 actually was incorrectly converted to use the new BSS() macro (the alignment of 4 prior to conversion was a __bss_stop alignment, but the argument to the BSS macro is a start alignment). So fix this as well. I'd like acks from Sam and David on this one. Also CCing Paul, since he has a patch from me which will need to be updated to use BSS_SECTION(0, PAGE_SIZE, 4) once this gets merged. Signed-off-by: Tim Abbott <tabbott@ksplice.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
2009-07-12 22:23:33 +00:00
SBSS(sbss_align) \
BSS(bss_align) \
vmlinux.lds.h: restructure BSS linker script macros The BSS section macros in vmlinux.lds.h currently place the .sbss input section outside the bounds of [__bss_start, __bss_end]. On all architectures except for microblaze that handle both .sbss and __bss_start/__bss_end, this is wrong: the .sbss input section is within the range [__bss_start, __bss_end]. Relatedly, the example code at the top of the file actually has __bss_start/__bss_end defined twice; I believe the right fix here is to define them in the BSS_SECTION macro but not in the BSS macro. Another problem with the current macros is that several architectures have an ALIGN(4) or some other small number just before __bss_stop in their linker scripts. The BSS_SECTION macro currently hardcodes this to 4; while it should really be an argument. It also ignores its sbss_align argument; fix that. mn10300 is the only user at present of any of the macros touched by this patch. It looks like mn10300 actually was incorrectly converted to use the new BSS() macro (the alignment of 4 prior to conversion was a __bss_stop alignment, but the argument to the BSS macro is a start alignment). So fix this as well. I'd like acks from Sam and David on this one. Also CCing Paul, since he has a patch from me which will need to be updated to use BSS_SECTION(0, PAGE_SIZE, 4) once this gets merged. Signed-off-by: Tim Abbott <tabbott@ksplice.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
2009-07-12 22:23:33 +00:00
. = ALIGN(stop_align); \
__bss_stop = .;