linux-stable

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git synced 2024-12-28 16:56:26 +00:00

Linux kernel stable tree

Go to file

Lance Yang 03ecb24db2 hung_task: add detect count for hung tasks Patch series "add detect count for hung tasks", v2. This patchset adds a counter, hung_task_detect_count, to track the number of times hung tasks are detected. IHMO, hung tasks are a critical metric. Currently, we detect them by periodically parsing dmesg. However, this method isn't as user-friendly as using a counter. Sometimes, a short-lived issue with NIC or hard drive can quickly decrease the hung_task_warnings to zero. Without warnings, we must directly access the node to ensure that there are no more hung tasks and that the system has recovered. After all, load average alone cannot provide a clear picture. Once this counter is in place, in a high-density deployment pattern, we plan to set hung_task_timeout_secs to a lower number to improve stability, even though this might result in false positives. And then we can set a time-based threshold: if hung tasks last beyond this duration, we will automatically migrate containers to other nodes. Based on past experience, this approach could help avoid many production disruptions. Moreover, just like other important events such as OOM that already have counters, having a dedicated counter for hung tasks makes sense ;) This patch (of 2): This commit adds a counter, hung_task_detect_count, to track the number of times hung tasks are detected. IHMO, hung tasks are a critical metric. Currently, we detect them by periodically parsing dmesg. However, this method isn't as user-friendly as using a counter. Sometimes, a short-lived issue with NIC or hard drive can quickly decrease the hung_task_warnings to zero. Without warnings, we must directly access the node to ensure that there are no more hung tasks and that the system has recovered. After all, load average alone cannot provide a clear picture. Once this counter is in place, in a high-density deployment pattern, we plan to set hung_task_timeout_secs to a lower number to improve stability, even though this might result in false positives. And then we can set a time-based threshold: if hung tasks last beyond this duration, we will automatically migrate containers to other nodes. Based on past experience, this approach could help avoid many production disruptions. Moreover, just like other important events such as OOM that already have counters, having a dedicated counter for hung tasks makes sense. [ioworker0@gmail.com: proc_doulongvec_minmax instead of proc_dointvec] Link: https://lkml.kernel.org/r/20241101114833.8377-1-ioworker0@gmail.com Link: https://lkml.kernel.org/r/20241027120747.42833-1-ioworker0@gmail.com Link: https://lkml.kernel.org/r/20241027120747.42833-2-ioworker0@gmail.com Signed-off-by: Mingzhe Yang <mingzhe.yang@ly.com> Signed-off-by: Lance Yang <ioworker0@gmail.com> Cc: Bang Li <libang.li@antgroup.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Huang Cun <cunhuang@tencent.com> Cc: Joel Granados <j.granados@samsung.com> Cc: Joel Granados <joel.granados@kernel.org> Cc: John Siddle <jsiddle@redhat.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Thomas Weißschuh <linux@weissschuh.net> Cc: Yongliang Gao <leonylgao@tencent.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>		2024-11-11 17:17:03 -08:00
arch	A trivial compile test fix for x86	2024-11-03 08:26:00 -10:00
block	block-6.12-20241101	2024-11-01 13:41:55 -10:00
certs	sign-file,extract-cert: use pkcs11 provider for OPENSSL MAJOR >= 3	2024-09-20 19:52:48 +03:00
crypto	This push fixes the following issues:	2024-10-16 08:42:54 -07:00
Documentation	Documentation/core-api: add min heap API introduction	2024-11-05 17:12:37 -08:00
drivers	dma-buf: use atomic64_inc_return() in dma_buf_getfile()	2024-11-06 13:36:37 -08:00
fs	fs/proc/kcore.c: fix coccinelle reported ERROR instances	2024-11-06 13:36:37 -08:00
include	lib min_heap: avoid indirect function call by providing default swap	2024-11-05 17:12:35 -08:00
init	lib/Makefile: make union-find compilation conditional on CONFIG_CPUSETS	2024-11-05 17:12:32 -08:00
io_uring	io_uring/rw: fix missing NOWAIT check for O_DIRECT start write	2024-10-31 08:21:02 -06:00
ipc	ipc: fix memleak if msg_init_ns failed in create_ipc_ns	2024-11-05 17:12:33 -08:00
kernel	hung_task: add detect count for hung tasks	2024-11-11 17:17:03 -08:00
lib	lib/scatterlist: use sg_phys() helper	2024-11-05 17:12:40 -08:00
LICENSES	LICENSES: add 0BSD license text	2024-09-01 20:43:24 -07:00
mm	mm/util: deduplicate code in {kstrdup,kstrndup,kmemdup_nul}	2024-11-05 17:12:30 -08:00
net	nfsd-6.12 fixes:	2024-11-02 09:27:11 -10:00
rust	Driver core fix for 6.12-rc3	2024-10-13 09:10:52 -07:00
samples	perf/hw_breakpoint: use ERR_PTR_PCPU(), IS_ERR_PCPU() and PTR_ERR_PCPU() macros	2024-11-05 17:12:32 -08:00
scripts	checkpatch: always parse orig_commit in fixes tag	2024-11-05 17:12:40 -08:00
security	security: replace memcpy() with get_task_comm()	2024-11-05 17:12:29 -08:00
sound	ALSA: hda/realtek: Fix headset mic on TUXEDO Stellaris 16 Gen6 mb1	2024-10-30 14:46:59 +01:00
tools	perf tools: update expected diff for lib/list_sort.c	2024-11-05 17:12:33 -08:00
usr	initramfs: shorten cmd_initfs in usr/Makefile	2024-07-16 01:07:52 +09:00
virt	ARM64:	2024-10-21 11:22:04 -07:00
.clang-format	clang-format: Update with v6.11-rc1's `for_each` macro list	2024-08-02 13:20:31 +02:00
.cocciconfig	scripts: add Linux .cocciconfig for coccinelle	2016-07-22 12:13:39 +02:00
.editorconfig	.editorconfig: remove trim_trailing_whitespace option	2024-06-13 16:47:52 +02:00
.get_maintainer.ignore	Add Jeff Kirsher to .get_maintainer.ignore	2024-03-08 11:36:54 +00:00
.gitattributes	.gitattributes: set diff driver for Rust source code files	2023-05-31 17:48:25 +02:00
.gitignore	Kbuild updates for v6.12	2024-09-24 13:02:06 -07:00
.mailmap	.mailmap: update e-mail address for Eugen Hristev	2024-10-31 20:27:04 -07:00
.rustfmt.toml	rust: add `.rustfmt.toml`	2022-09-28 09:02:20 +02:00
COPYING	COPYING: state that all contributions really are covered by this file	2020-02-10 13:32:20 -08:00
CREDITS	CREDITS: sort alphabetically by name	2024-10-09 12:47:19 -07:00
Kbuild	Kbuild updates for v6.1	2022-10-10 12:00:45 -07:00
Kconfig	kbuild: ensure full rebuild when the compiler is updated	2020-05-12 13:28:33 +09:00
MAINTAINERS	MAINTAINERS: add entry for min heap library code	2024-11-05 17:12:37 -08:00
Makefile	Linux 6.12-rc6	2024-11-03 14:05:52 -10:00
README	README: Fix spelling	2024-03-18 03:36:32 -06:00

README

Linux kernel
============

There are several guides for kernel developers and users. These guides can
be rendered in a number of formats, like HTML and PDF. Please read
Documentation/admin-guide/README.rst first.

In order to build the documentation, use ``make htmldocs`` or
``make pdfdocs``.  The formatted documentation can also be read online at:

    https://www.kernel.org/doc/html/latest/

There are various text files in the Documentation/ subdirectory,
several of them using the reStructuredText markup notation.

Please read the Documentation/process/changes.rst file, as it contains the
requirements for building and running the kernel, and information about
the problems which may result by upgrading your kernel.